Python: End-to-end Data Analysis

Book description

Leverage the power of Python to clean, scrape, analyze, and visualize your data

About This Book

  • Clean, format, and explore your data using the popular Python libraries and get valuable insights from it

  • Analyze big data sets; create attractive visualizations; manipulate and process various data types using NumPy, SciPy, and matplotlib; and more

  • Packed with easy-to-follow examples to develop advanced computational skills for the analysis of complex data

  • Who This Book Is For

    This course is for developers, analysts, and data scientists who want to learn data analysis from scratch. This course will provide you with a solid foundation from which to analyze data with varying complexity. A working knowledge of Python (and a strong interest in playing with your data) is recommended.

    What You Will Learn

  • Understand the importance of data analysis and master its processing steps

  • Get comfortable using Python and its associated data analysis libraries such as Pandas, NumPy, and SciPy

  • Clean and transform your data and apply advanced statistical analysis to create attractive visualizations

  • Analyze images and time series data

  • Mine text and analyze social networks

  • Perform web scraping and work with different databases, Hadoop, and Spark

  • Use statistical models to discover patterns in data

  • Detect similarities and differences in data with clustering

  • Work with Jupyter Notebook to produce publication-ready figures to be included in reports

  • In Detail

    Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need!

    In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning.

    The course provides you with highly practical content explaining data analysis with Python, from the following Packt books:

    1. Getting Started with Python Data Analysis.

    2. Python Data Analysis Cookbook.

    3. Mastering Python Data Analysis.

    By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.

    Style and approach

    Learn Python data analysis using engaging examples and fun exercises, and with a gentle and friendly but comprehensive "learn-by-doing" approach. It offers you a useful way of analyzing the data that’s specific to this course, but that can also be applied to any other data. This course is designed to be both a guide and a reference for moving beyond the basics of data analysis.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of contents

    1. Python: End-to-end Data Analysis
      1. Table of Contents
      2. Python: End-to-end Data Analysis
      3. Python: End-to-end Data Analysis
      4. Credits
      5. Preface
        1. What this learning path covers
        2. What you need for this learning path
        3. Who this learning path is for
        4. Reader feedback
        5. Customer support
        6. Downloading the example code
        7. Errata
        8. Piracy
          1. Questions
      6. 1. Module 1
        1. 1. Introducing Data Analysis and Libraries
          1. Data analysis and processing
          2. An overview of the libraries in data analysis
          3. Python libraries in data analysis
            1. NumPy
            2. Pandas
            3. Matplotlib
            4. PyMongo
            5. The scikit-learn library
          4. Summary
        2. 2. NumPy Arrays and Vectorized Computation
          1. NumPy arrays
            1. Data types
            2. Array creation
            3. Indexing and slicing
            4. Fancy indexing
            5. Numerical operations on arrays
          2. Array functions
          3. Data processing using arrays
            1. Loading and saving data
            2. Saving an array
            3. Loading an array
          4. Linear algebra with NumPy
          5. NumPy random numbers
          6. Summary
        3. 3. Data Analysis with Pandas
          1. An overview of the Pandas package
          2. The Pandas data structure
            1. Series
            2. The DataFrame
          3. The essential basic functionality
            1. Reindexing and altering labels
            2. Head and tail
            3. Binary operations
            4. Functional statistics
            5. Function application
            6. Sorting
          4. Indexing and selecting data
          5. Computational tools
          6. Working with missing data
          7. Advanced uses of Pandas for data analysis
            1. Hierarchical indexing
            2. The Panel data
          8. Summary
        4. 4. Data Visualization
          1. The matplotlib API primer
            1. Line properties
            2. Figures and subplots
          2. Exploring plot types
            1. Scatter plots
            2. Bar plots
            3. Contour plots
            4. Histogram plots
          3. Legends and annotations
          4. Plotting functions with Pandas
          5. Additional Python data visualization tools
            1. Bokeh
            2. MayaVi
          6. Summary
        5. 5. Time Series
          1. Time series primer
          2. Working with date and time objects
          3. Resampling time series
          4. Downsampling time series data
          5. Upsampling time series data
          6. Time zone handling
          7. Timedeltas
          8. Time series plotting
          9. Summary
        6. 6. Interacting with Databases
          1. Interacting with data in text format
            1. Reading data from text format
            2. Writing data to text format
          2. Interacting with data in binary format
            1. HDF5
          3. Interacting with data in MongoDB
          4. Interacting with data in Redis
            1. The simple value
            2. List
            3. Set
            4. Ordered set
          5. Summary
        7. 7. Data Analysis Application Examples
          1. Data munging
            1. Cleaning data
            2. Filtering
            3. Merging data
            4. Reshaping data
          2. Data aggregation
          3. Grouping data
          4. Summary
        8. 8. Machine Learning Models with scikit-learn
          1. An overview of machine learning models
          2. The scikit-learn modules for different models
          3. Data representation in scikit-learn
          4. Supervised learning – classification and regression
          5. Unsupervised learning – clustering and dimensionality reduction
          6. Measuring prediction performance
          7. Summary
      7. 2. Module 2
        1. 1. Laying the Foundation for Reproducible Data Analysis
          1. Introduction
          2. Setting up Anaconda
            1. Getting ready
            2. How to do it...
            3. There's more...
            4. See also
          3. Installing the Data Science Toolbox
            1. Getting ready
            2. How to do it...
            3. How it works...
            4. See also
          4. Creating a virtual environment with virtualenv and virtualenvwrapper
            1. Getting ready
            2. How to do it...
            3. See also
          5. Sandboxing Python applications with Docker images
            1. Getting ready
            2. How to do it...
            3. How it works...
            4. See also
          6. Keeping track of package versions and history in IPython Notebook
            1. Getting ready
            2. How to do it...
            3. How it works...
            4. See also
          7. Configuring IPython
            1. Getting ready
            2. How to do it...
            3. See also
          8. Learning to log for robust error checking
            1. Getting ready
            2. How to do it...
            3. How it works...
            4. See also
          9. Unit testing your code
            1. Getting ready
            2. How to do it...
            3. How it works...
            4. See also
          10. Configuring pandas
            1. Getting ready
            2. How to do it...
          11. Configuring matplotlib
            1. Getting ready
            2. How to do it...
            3. How it works...
            4. See also
          12. Seeding random number generators and NumPy print options
            1. Getting ready
            2. How to do it...
            3. See also
          13. Standardizing reports, code style, and data access
            1. Getting ready
            2. How to do it...
            3. See also
        2. 2. Creating Attractive Data Visualizations
          1. Introduction
          2. Graphing Anscombe's quartet
            1. How to do it...
            2. See also
          3. Choosing seaborn color palettes
            1. How to do it...
            2. See also
          4. Choosing matplotlib color maps
            1. How to do it...
            2. See also
          5. Interacting with IPython Notebook widgets
            1. How to do it...
            2. See also
          6. Viewing a matrix of scatterplots
            1. How to do it...
          7. Visualizing with d3.js via mpld3
            1. Getting ready
            2. How to do it...
          8. Creating heatmaps
            1. Getting ready
            2. How to do it...
            3. See also
          9. Combining box plots and kernel density plots with violin plots
            1. How to do it...
            2. See also
          10. Visualizing network graphs with hive plots
            1. Getting ready
            2. How to do it...
          11. Displaying geographical maps
            1. Getting ready
            2. How to do it...
          12. Using ggplot2-like plots
            1. Getting ready
            2. How to do it...
          13. Highlighting data points with influence plots
            1. How to do it...
            2. See also
        3. 3. Statistical Data Analysis and Probability
          1. Introduction
          2. Fitting data to the exponential distribution
            1. How to do it...
            2. How it works…
            3. See also
          3. Fitting aggregated data to the gamma distribution
            1. How to do it...
            2. See also
          4. Fitting aggregated counts to the Poisson distribution
            1. How to do it...
            2. See also
          5. Determining bias
            1. How to do it...
            2. See also
          6. Estimating kernel density
            1. How to do it...
            2. See also
          7. Determining confidence intervals for mean, variance, and standard deviation
            1. How to do it...
            2. See also
          8. Sampling with probability weights
            1. How to do it...
            2. See also
          9. Exploring extreme values
            1. How to do it...
            2. See also
          10. Correlating variables with Pearson's correlation
            1. How to do it...
            2. See also
          11. Correlating variables with the Spearman rank correlation
            1. How to do it...
            2. See also
          12. Correlating a binary and a continuous variable with the point biserial correlation
            1. How to do it...
            2. See also
          13. Evaluating relations between variables with ANOVA
            1. How to do it...
            2. See also
        4. 4. Dealing with Data and Numerical Issues
          1. Introduction
          2. Clipping and filtering outliers
            1. How to do it...
            2. See also
          3. Winsorizing data
            1. How to do it...
            2. See also
          4. Measuring central tendency of noisy data
            1. How to do it...
            2. See also
          5. Normalizing with the Box-Cox transformation
            1. How to do it...
            2. How it works
            3. See also
          6. Transforming data with the power ladder
            1. How to do it...
          7. Transforming data with logarithms
            1. How to do it...
          8. Rebinning data
            1. How to do it...
          9. Applying logit() to transform proportions
            1. How to do it...
          10. Fitting a robust linear model
            1. How to do it...
            2. See also
          11. Taking variance into account with weighted least squares
            1. How to do it...
            2. See also
          12. Using arbitrary precision for optimization
            1. Getting ready
            2. How to do it...
            3. See also
          13. Using arbitrary precision for linear algebra
            1. Getting ready
            2. How to do it...
            3. See also
        5. 5. Web Mining, Databases, and Big Data
          1. Introduction
          2. Simulating web browsing
            1. Getting ready
            2. How to do it…
            3. See also
          3. Scraping the Web
            1. Getting ready
            2. How to do it…
          4. Dealing with non-ASCII text and HTML entities
            1. Getting ready
            2. How to do it…
            3. See also
          5. Implementing association tables
            1. Getting ready
            2. How to do it…
          6. Setting up database migration scripts
            1. Getting ready
            2. How to do it…
            3. See also
          7. Adding a table column to an existing table
            1. Getting ready
            2. How to do it…
          8. Adding indices after table creation
            1. Getting ready
            2. How to do it…
            3. How it works…
            4. See also
          9. Setting up a test web server
            1. Getting ready
            2. How to do it…
          10. Implementing a star schema with fact and dimension tables
            1. How to do it…
            2. See also
          11. Using HDFS
            1. Getting ready
            2. How to do it…
            3. See also
          12. Setting up Spark
            1. Getting ready
            2. How to do it…
            3. See also
          13. Clustering data with Spark
            1. Getting ready
            2. How to do it…
            3. How it works…
            4. There's more…
            5. See also
        6. 6. Signal Processing and Timeseries
          1. Introduction
          2. Spectral analysis with periodograms
            1. How to do it...
            2. See also
          3. Estimating power spectral density with the Welch method
            1. How to do it...
            2. See also
          4. Analyzing peaks
            1. How to do it...
            2. See also
          5. Measuring phase synchronization
            1. How to do it...
            2. See also
          6. Exponential smoothing
            1. How to do it...
            2. See also
          7. Evaluating smoothing
            1. How to do it...
            2. See also
          8. Using the Lomb-Scargle periodogram
            1. How to do it...
            2. See also
          9. Analyzing the frequency spectrum of audio
            1. How to do it...
            2. See also
          10. Analyzing signals with the discrete cosine transform
            1. How to do it...
            2. See also
          11. Block bootstrapping time series data
            1. How to do it...
            2. See also
          12. Moving block bootstrapping time series data
            1. How to do it...
            2. See also
          13. Applying the discrete wavelet transform
            1. Getting started
            2. How to do it...
            3. See also
        7. 7. Selecting Stocks with Financial Data Analysis
          1. Introduction
          2. Computing simple and log returns
            1. How to do it...
            2. See also
          3. Ranking stocks with the Sharpe ratio and liquidity
            1. How to do it...
            2. See also
          4. Ranking stocks with the Calmar and Sortino ratios
            1. How to do it...
            2. See also
          5. Analyzing returns statistics
            1. How to do it...
          6. Correlating individual stocks with the broader market
            1. How to do it...
          7. Exploring risk and return
            1. How to do it...
            2. See also
          8. Examining the market with the non-parametric runs test
            1. How to do it...
            2. See also
          9. Testing for random walks
            1. How to do it...
            2. See also
          10. Determining market efficiency with autoregressive models
            1. How to do it...
            2. See also
          11. Creating tables for a stock prices database
            1. How to do it...
          12. Populating the stock prices database
            1. How to do it...
          13. Optimizing an equal weights two-asset portfolio
            1. How to do it...
            2. See also
        8. 8. Text Mining and Social Network Analysis
          1. Introduction
          2. Creating a categorized corpus
            1. Getting ready
            2. How to do it...
            3. See also
          3. Tokenizing news articles in sentences and words
            1. Getting ready
            2. How to do it...
            3. See also
          4. Stemming, lemmatizing, filtering, and TF-IDF scores
            1. Getting ready
            2. How to do it...
            3. How it works
            4. See also
          5. Recognizing named entities
            1. Getting ready
            2. How to do it...
            3. How it works
            4. See also
          6. Extracting topics with non-negative matrix factorization
            1. How to do it...
            2. How it works
            3. See also
          7. Implementing a basic terms database
            1. How to do it...
            2. How it works
            3. See also
          8. Computing social network density
            1. Getting ready
            2. How to do it...
            3. See also
          9. Calculating social network closeness centrality
            1. Getting ready
            2. How to do it...
            3. See also
          10. Determining the betweenness centrality
            1. Getting ready
            2. How to do it...
            3. See also
          11. Estimating the average clustering coefficient
            1. Getting ready
            2. How to do it...
            3. See also
          12. Calculating the assortativity coefficient of a graph
            1. Getting ready
            2. How to do it...
            3. See also
          13. Getting the clique number of a graph
            1. Getting ready
            2. How to do it...
            3. See also
          14. Creating a document graph with cosine similarity
            1. How to do it...
            2. See also
        9. 9. Ensemble Learning and Dimensionality Reduction
          1. Introduction
          2. Recursively eliminating features
            1. How to do it...
            2. How it works
            3. See also
          3. Applying principal component analysis for dimension reduction
            1. How to do it...
            2. See also
          4. Applying linear discriminant analysis for dimension reduction
            1. How to do it...
            2. See also
          5. Stacking and majority voting for multiple models
            1. How to do it...
            2. See also
          6. Learning with random forests
            1. How to do it...
            2. There's more…
            3. See also
          7. Fitting noisy data with the RANSAC algorithm
            1. How to do it...
            2. See also
          8. Bagging to improve results
            1. How to do it...
            2. See also
          9. Boosting for better learning
            1. How to do it...
            2. See also
          10. Nesting cross-validation
            1. How to do it...
            2. See also
          11. Reusing models with joblib
            1. How to do it...
            2. See also
          12. Hierarchically clustering data
            1. How to do it...
            2. See also
          13. Taking a Theano tour
            1. Getting ready
            2. How to do it...
            3. See also
        10. 10. Evaluating Classifiers, Regressors, and Clusters
          1. Introduction
          2. Getting classification straight with the confusion matrix
            1. How to do it...
            2. How it works
            3. See also
          3. Computing precision, recall, and F1-score
            1. How to do it...
            2. See also
          4. Examining a receiver operating characteristic and the area under a curve
            1. How to do it...
            2. See also
          5. Visualizing the goodness of fit
            1. How to do it...
            2. See also
          6. Computing MSE and median absolute error
            1. How to do it...
            2. See also
          7. Evaluating clusters with the mean silhouette coefficient
            1. How to do it...
            2. See also
          8. Comparing results with a dummy classifier
            1. How to do it...
            2. See also
          9. Determining MAPE and MPE
            1. How to do it...
            2. See also
          10. Comparing with a dummy regressor
            1. How to do it...
            2. See also
          11. Calculating the mean absolute error and the residual sum of squares
            1. How to do it...
            2. See also
          12. Examining the kappa of classification
            1. How to do it...
            2. How it works
            3. See also
          13. Taking a look at the Matthews correlation coefficient
            1. How to do it...
            2. See also
        11. 11. Analyzing Images
          1. Introduction
          2. Setting up OpenCV
            1. Getting ready
            2. How to do it...
            3. How it works
            4. There's more
          3. Applying Scale-Invariant Feature Transform (SIFT)
            1. Getting ready
            2. How to do it...
            3. See also
          4. Detecting features with SURF
            1. Getting ready
            2. How to do it...
            3. See also
          5. Quantizing colors
            1. Getting ready
            2. How to do it...
            3. See also
          6. Denoising images
            1. Getting ready
            2. How to do it...
            3. See also
          7. Extracting patches from an image
            1. Getting ready
            2. How to do it...
            3. See also
          8. Detecting faces with Haar cascades
            1. Getting ready
            2. How to do it...
            3. See also
          9. Searching for bright stars
            1. Getting ready
            2. How to do it...
            3. See also
          10. Extracting metadata from images
            1. Getting ready
            2. How to do it...
            3. See also
          11. Extracting texture features from images
            1. Getting ready
            2. How to do it...
            3. See also
          12. Applying hierarchical clustering on images
            1. How to do it...
            2. See also
          13. Segmenting images with spectral clustering
            1. How to do it...
            2. See also
        12. 12. Parallelism and Performance
          1. Introduction
          2. Just-in-time compiling with Numba
            1. Getting ready
            2. How to do it...
            3. How it works
            4. See also
          3. Speeding up numerical expressions with Numexpr
            1. How to do it...
            2. How it works
            3. See also
          4. Running multiple threads with the threading module
            1. How to do it...
            2. See also
          5. Launching multiple tasks with the concurrent.futures module
            1. How to do it...
            2. See also
          6. Accessing resources asynchronously with the asyncio module
            1. How to do it...
            2. See also
          7. Distributed processing with execnet
            1. Getting ready
            2. How to do it...
            3. See also
          8. Profiling memory usage
            1. Getting ready
            2. How to do it...
            3. See also
          9. Calculating the mean, variance, skewness, and kurtosis on the fly
            1. Getting ready
            2. How to do it...
            3. See also
          10. Caching with a least recently used cache
            1. Getting ready
            2. How to do it...
            3. See also
          11. Caching HTTP requests
            1. Getting ready
            2. How to do it...
            3. See also
          12. Streaming counting with the Count-min sketch
            1. How to do it...
            2. See also
          13. Harnessing the power of the GPU with OpenCL
            1. Getting ready
            2. How to do it...
            3. See also
        13. A. Glossary
        14. B. Function Reference
          1. IPython
          2. Matplotlib
          3. NumPy
          4. pandas
          5. Scikit-learn
          6. SciPy
          7. Seaborn
          8. Statsmodels
        15. C. Online Resources
          1. IPython notebooks and open data
          2. Mathematics and statistics
            1. Presentations
        16. D. Tips and Tricks for Command-Line and Miscellaneous Tools
          1. IPython notebooks
          2. Command-line tools
          3. The alias command
          4. Command-line history
          5. Reproducible sessions
          6. Docker tips
      8. 3. Module 3
        1. 1. Tools of the Trade
          1. Before you start
          2. Using the notebook interface
          3. Imports
          4. An example using the Pandas library
          5. Summary
        2. 2. Exploring Data
          1. The General Social Survey
            1. Obtaining the data
            2. Reading the data
          2. Univariate data
            1. Histograms
              1. Making things pretty
              2. Characterization
            2. Concept of statistical inference
            3. Numeric summaries and boxplots
          3. Relationships between variables – scatterplots
          4. Summary
        3. 3. Learning About Models
          1. Models and experiments
          2. The cumulative distribution function
          3. Working with distributions
          4. The probability density function
          5. Where do models come from?
          6. Multivariate distributions
          7. Summary
        4. 4. Regression
          1. Introducing linear regression
            1. Getting the dataset
            2. Testing with linear regression
          2. Multivariate regression
            1. Adding economic indicators
            2. Taking a step back
          3. Logistic regression
            1. Some notes
          4. Summary
        5. 5. Clustering
          1. Introduction to cluster finding
            1. Starting out simple – John Snow on cholera
          2. K-means clustering
            1. Suicide rate versus GDP versus absolute latitude
          3. Hierarchical clustering analysis
            1. Reading in and reducing the data
            2. Hierarchical cluster algorithm
          4. Summary
        6. 6. Bayesian Methods
          1. The Bayesian method
            1. Credible versus confidence intervals
            2. Bayes formula
            3. Python packages
          2. U.S. air travel safety record
            1. Getting the NTSB database
            2. Binning the data
            3. Bayesian analysis of the data
              1. Binning by month
            4. Plotting coordinates
              1. Cartopy
              2. Mpl toolkits – basemap
          3. Climate change - CO2 in the atmosphere
            1. Getting the data
            2. Creating and sampling the model
          4. Summary
        7. 7. Supervised and Unsupervised Learning
          1. Introduction to machine learning
          2. Scikit-learn
          3. Linear regression
            1. Climate data
            2. Checking with Bayesian analysis and OLS
          4. Clustering
          5. Seeds classification
            1. Visualizing the data
            2. Feature selection
            3. Classifying the data
              1. The SVC linear kernel
              2. The SVC Radial Basis Function
              3. The SVC polynomial
              4. K-Nearest Neighbour
              5. Random Forest
            4. Choosing your classifier
          6. Summary
        8. 8. Time Series Analysis
          1. Introduction
          2. Pandas and time series data
          3. Indexing and slicing
          4. Resampling, smoothing, and other estimates
          5. Stationarity
          6. Patterns and components
            1. Decomposing components
            2. Differencing
          7. Time series models
            1. Autoregressive – AR
            2. Moving average – MA
            3. Selecting p and q
              1. Automatic function
              2. The (Partial) AutoCorrelation Function
            4. Autoregressive Integrated Moving Average – ARIMA
          8. Summary
        9. E. More on Jupyter Notebook and matplotlib Styles
          1. Jupyter Notebook
            1. Useful keyboard shortcuts
              1. Command mode shortcuts
              2. Edit mode shortcuts
            2. Markdown cells
            3. Notebook Python extensions
              1. Installing the extensions
              2. Codefolding
              3. Collapsible headings
              4. Help panel
              5. Initialization cells
              6. NbExtensions menu item
              7. Ruler
              8. Skip-traceback
              9. Table of contents
            4. Other Jupyter Notebook tips
              1. External connections
              2. Export
              3. Additional file types
          2. Matplotlib styles
          3. Useful resources
            1. General resources
            2. Packages
            3. Data repositories
            4. Visualization of data
          4. Summary
      9. A. Bibliography
      10. Index

    Product information

    • Title: Python: End-to-end Data Analysis
    • Author(s): Phuong Vothihong, Martin Czygan, Ivan Idris, Magnus Vilhelm Persson, Luiz Felipe Martins
    • Release date: May 2017
    • Publisher(s): Packt Publishing
    • ISBN: 9781788394697