Machine Learning with Python Cookbook

Book description

This practical guide provides nearly 200 self-contained recipes to help you solve machine learning challenges you may encounter in your daily work. If you’re comfortable with Python and its libraries, including pandas and scikit-learn, you’ll be able to address specific problems such as loading data, handling text or numerical data, model selection, and dimensionality reduction and many other topics.

Each recipe includes code that you can copy and paste into a toy dataset to ensure that it actually works. From there, you can insert, combine, or adapt the code to help construct your application. Recipes also include a discussion that explains the solution and provides meaningful context. This cookbook takes you beyond theory and concepts by providing the nuts and bolts you need to construct working machine learning applications.

You’ll find recipes for:

  • Vectors, matrices, and arrays
  • Handling numerical and categorical data, text, images, and dates and times
  • Dimensionality reduction using feature extraction or feature selection
  • Model evaluation and selection
  • Linear and logical regression, trees and forests, and k-nearest neighbors
  • Support vector machines (SVM), naïve Bayes, clustering, and neural networks
  • Saving and loading trained models

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who This Book Is For
    2. Who This Book Is Not For
    3. Terminology Used in This Book
    4. Acknowledgments
  2. 1. Vectors, Matrices, and Arrays
    1. 1.0. Introduction
    2. 1.1. Creating a Vector
    3. 1.2. Creating a Matrix
    4. 1.3. Creating a Sparse Matrix
    5. 1.4. Selecting Elements
    6. 1.5. Describing a Matrix
    7. 1.6. Applying Operations to Elements
    8. 1.7. Finding the Maximum and Minimum Values
    9. 1.8. Calculating the Average, Variance, and Standard Deviation
    10. 1.9. Reshaping Arrays
    11. 1.10. Transposing a Vector or Matrix
    12. 1.11. Flattening a Matrix
    13. 1.12. Finding the Rank of a Matrix
    14. 1.13. Calculating the Determinant
    15. 1.14. Getting the Diagonal of a Matrix
    16. 1.15. Calculating the Trace of a Matrix
    17. 1.16. Finding Eigenvalues and Eigenvectors
    18. 1.17. Calculating Dot Products
    19. 1.18. Adding and Subtracting Matrices
    20. 1.19. Multiplying Matrices
    21. 1.20. Inverting a Matrix
    22. 1.21. Generating Random Values
  3. 2. Loading Data
    1. 2.0. Introduction
    2. 2.1. Loading a Sample Dataset
    3. 2.2. Creating a Simulated Dataset
    4. 2.3. Loading a CSV File
    5. 2.4. Loading an Excel File
    6. 2.5. Loading a JSON File
    7. 2.6. Querying a SQL Database
  4. 3. Data Wrangling
    1. 3.0. Introduction
    2. 3.1. Creating a Data Frame
    3. 3.2. Describing the Data
    4. 3.3. Navigating DataFrames
    5. 3.4. Selecting Rows Based on Conditionals
    6. 3.5. Replacing Values
    7. 3.6. Renaming Columns
    8. 3.7. Finding the Minimum, Maximum, Sum, Average, and Count
    9. 3.8. Finding Unique Values
    10. 3.9. Handling Missing Values
    11. 3.10. Deleting a Column
    12. 3.11. Deleting a Row
    13. 3.12. Dropping Duplicate Rows
    14. 3.13. Grouping Rows by Values
    15. 3.14. Grouping Rows by Time
    16. 3.15. Looping Over a Column
    17. 3.16. Applying a Function Over All Elements in a Column
    18. 3.17. Applying a Function to Groups
    19. 3.18. Concatenating DataFrames
    20. 3.19. Merging DataFrames
  5. 4. Handling Numerical Data
    1. 4.0. Introduction
    2. 4.1. Rescaling a Feature
    3. 4.2. Standardizing a Feature
    4. 4.3. Normalizing Observations
    5. 4.4. Generating Polynomial and Interaction Features
    6. 4.5. Transforming Features
    7. 4.6. Detecting Outliers
    8. 4.7. Handling Outliers
    9. 4.8. Discretizating Features
    10. 4.9. Grouping Observations Using Clustering
    11. 4.10. Deleting Observations with Missing Values
    12. 4.11. Imputing Missing Values
  6. 5. Handling Categorical Data
    1. 5.0. Introduction
    2. 5.1. Encoding Nominal Categorical Features
    3. 5.2. Encoding Ordinal Categorical Features
    4. 5.3. Encoding Dictionaries of Features
    5. 5.4. Imputing Missing Class Values
    6. 5.5. Handling Imbalanced Classes
  7. 6. Handling Text
    1. 6.0. Introduction
    2. 6.1. Cleaning Text
    3. 6.2. Parsing and Cleaning HTML
    4. 6.3. Removing Punctuation
    5. 6.4. Tokenizing Text
    6. 6.5. Removing Stop Words
    7. 6.6. Stemming Words
    8. 6.7. Tagging Parts of Speech
    9. 6.8. Encoding Text as a Bag of Words
    10. 6.9. Weighting Word Importance
  8. 7. Handling Dates and Times
    1. 7.0. Introduction
    2. 7.1. Converting Strings to Dates
    3. 7.2. Handling Time Zones
    4. 7.3. Selecting Dates and Times
    5. 7.4. Breaking Up Date Data into Multiple Features
    6. 7.5. Calculating the Difference Between Dates
    7. 7.6. Encoding Days of the Week
    8. 7.7. Creating a Lagged Feature
    9. 7.8. Using Rolling Time Windows
    10. 7.9. Handling Missing Data in Time Series
  9. 8. Handling Images
    1. 8.0. Introduction
    2. 8.1. Loading Images
    3. 8.2. Saving Images
    4. 8.3. Resizing Images
    5. 8.4. Cropping Images
    6. 8.5. Blurring Images
    7. 8.6. Sharpening Images
    8. 8.7. Enhancing Contrast
    9. 8.8. Isolating Colors
    10. 8.9. Binarizing Images
    11. 8.10. Removing Backgrounds
    12. 8.11. Detecting Edges
    13. 8.12. Detecting Corners
    14. 8.13. Creating Features for Machine Learning
    15. 8.14. Encoding Mean Color as a Feature
    16. 8.15. Encoding Color Histograms as Features
  10. 9. Dimensionality Reduction Using Feature Extraction
    1. 9.0. Introduction
    2. 9.1. Reducing Features Using Principal Components
    3. 9.2. Reducing Features When Data Is Linearly Inseparable
    4. 9.3. Reducing Features by Maximizing Class Separability
    5. 9.4. Reducing Features Using Matrix Factorization
    6. 9.5. Reducing Features on Sparse Data
  11. 10. Dimensionality Reduction Using Feature Selection
    1. 10.0. Introduction
    2. 10.1. Thresholding Numerical Feature Variance
    3. 10.2. Thresholding Binary Feature Variance
    4. 10.3. Handling Highly Correlated Features
    5. 10.4. Removing Irrelevant Features for Classification
    6. 10.5. Recursively Eliminating Features
  12. 11. Model Evaluation
    1. 11.0. Introduction
    2. 11.1. Cross-Validating Models
    3. 11.2. Creating a Baseline Regression Model
    4. 11.3. Creating a Baseline Classification Model
    5. 11.4. Evaluating Binary Classifier Predictions
    6. 11.5. Evaluating Binary Classifier Thresholds
    7. 11.6. Evaluating Multiclass Classifier Predictions
    8. 11.7. Visualizing a Classifier’s Performance
    9. 11.8. Evaluating Regression Models
    10. 11.9. Evaluating Clustering Models
    11. 11.10. Creating a Custom Evaluation Metric
    12. 11.11. Visualizing the Effect of Training Set Size
    13. 11.12. Creating a Text Report of Evaluation Metrics
    14. 11.13. Visualizing the Effect of Hyperparameter Values
  13. 12. Model Selection
    1. 12.0. Introduction
    2. 12.1. Selecting Best Models Using Exhaustive Search
    3. 12.2. Selecting Best Models Using Randomized Search
    4. 12.3. Selecting Best Models from Multiple Learning Algorithms
    5. 12.4. Selecting Best Models When Preprocessing
    6. 12.5. Speeding Up Model Selection with Parallelization
    7. 12.6. Speeding Up Model Selection Using Algorithm-Specific Methods
    8. 12.7. Evaluating Performance After Model Selection
  14. 13. Linear Regression
    1. 13.0. Introduction
    2. 13.1. Fitting a Line
    3. 13.2. Handling Interactive Effects
    4. 13.3. Fitting a Nonlinear Relationship
    5. 13.4. Reducing Variance with Regularization
    6. 13.5. Reducing Features with Lasso Regression
  15. 14. Trees and Forests
    1. 14.0. Introduction
    2. 14.1. Training a Decision Tree Classifier
    3. 14.2. Training a Decision Tree Regressor
    4. 14.3. Visualizing a Decision Tree Model
    5. 14.4. Training a Random Forest Classifier
    6. 14.5. Training a Random Forest Regressor
    7. 14.6. Identifying Important Features in Random Forests
    8. 14.7. Selecting Important Features in Random Forests
    9. 14.8. Handling Imbalanced Classes
    10. 14.9. Controlling Tree Size
    11. 14.10. Improving Performance Through Boosting
    12. 14.11. Evaluating Random Forests with Out-of-Bag Errors
  16. 15. K-Nearest Neighbors
    1. 15.0. Introduction
    2. 15.1. Finding an Observation’s Nearest Neighbors
    3. 15.2. Creating a K-Nearest Neighbor Classifier
    4. 15.3. Identifying the Best Neighborhood Size
    5. 15.4. Creating a Radius-Based Nearest Neighbor Classifier
  17. 16. Logistic Regression
    1. 16.0. Introduction
    2. 16.1. Training a Binary Classifier
    3. 16.2. Training a Multiclass Classifier
    4. 16.3. Reducing Variance Through Regularization
    5. 16.4. Training a Classifier on Very Large Data
    6. 16.5. Handling Imbalanced Classes
  18. 17. Support Vector Machines
    1. 17.0. Introduction
    2. 17.1. Training a Linear Classifier
    3. 17.2. Handling Linearly Inseparable Classes Using Kernels
    4. 17.3. Creating Predicted Probabilities
    5. 17.4. Identifying Support Vectors
    6. 17.5. Handling Imbalanced Classes
  19. 18. Naive Bayes
    1. 18.0. Introduction
    2. 18.1. Training a Classifier for Continuous Features
    3. 18.2. Training a Classifier for Discrete and Count Features
    4. 18.3. Training a Naive Bayes Classifier for Binary Features
    5. 18.4. Calibrating Predicted Probabilities
  20. 19. Clustering
    1. 19.0. Introduction
    2. 19.1. Clustering Using K-Means
    3. 19.2. Speeding Up K-Means Clustering
    4. 19.3. Clustering Using Meanshift
    5. 19.4. Clustering Using DBSCAN
    6. 19.5. Clustering Using Hierarchical Merging
  21. 20. Neural Networks
    1. 20.0. Introduction
    2. 20.1. Preprocessing Data for Neural Networks
    3. 20.2. Designing a Neural Network
    4. 20.3. Training a Binary Classifier
    5. 20.4. Training a Multiclass Classifier
    6. 20.5. Training a Regressor
    7. 20.6. Making Predictions
    8. 20.7. Visualize Training History
    9. 20.8. Reducing Overfitting with Weight Regularization
    10. 20.9. Reducing Overfitting with Early Stopping
    11. 20.10. Reducing Overfitting with Dropout
    12. 20.11. Saving Model Training Progress
    13. 20.12. k-Fold Cross-Validating Neural Networks
    14. 20.13. Tuning Neural Networks
    15. 20.14. Visualizing Neural Networks
    16. 20.15. Classifying Images
    17. 20.16. Improving Performance with Image Augmentation
    18. 20.17. Classifying Text
  22. 21. Saving and Loading Trained Models
    1. 21.0. Introduction
    2. 21.1. Saving and Loading a scikit-learn Model
    3. 21.2. Saving and Loading a Keras Model
  23. Index

Product information

  • Title: Machine Learning with Python Cookbook
  • Author(s): Chris Albon
  • Release date: March 2018
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491989388