Machine Learning Pocket Reference

Book description

With detailed notes, tables, and examples, this handy reference will help you navigate the basics of structured machine learning. Author Matt Harrison delivers a valuable guide that you can use for additional support during training and as a convenient resource when you dive into your next machine learning project.

Ideal for programmers, data scientists, and AI engineers, this book includes an overview of the machine learning process and walks you through classification with structured data. You’ll also learn methods for clustering, predicting a continuous value (regression), and reducing dimensionality, among other topics.

This pocket reference includes sections that cover:

  • Classification, using the Titanic dataset
  • Cleaning data and dealing with missing data
  • Exploratory data analysis
  • Common preprocessing steps using sample data
  • Selecting features useful to the model
  • Model selection
  • Metrics and classification evaluation
  • Regression examples using k-nearest neighbor, decision trees, boosting, and more
  • Metrics for regression evaluation
  • Clustering
  • Dimensionality reduction
  • Scikit-learn pipelines

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. What to Expect
    2. Who This Book Is For
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
    7. Acknowledgments
  2. 1. Introduction
    1. Libraries Used
    2. Installation with Pip
    3. Installation with Conda
  3. 2. Overview of the Machine Learning Process
  4. 3. Classification Walkthrough: Titanic Dataset
    1. Project Layout Suggestion
    2. Imports
    3. Ask a Question
    4. Terms for Data
    5. Gather Data
    6. Clean Data
    7. Create Features
    8. Sample Data
    9. Impute Data
    10. Normalize Data
    11. Refactor
    12. Baseline Model
    13. Various Families
    14. Stacking
    15. Create Model
    16. Evaluate Model
    17. Optimize Model
    18. Confusion Matrix
    19. ROC Curve
    20. Learning Curve
    21. Deploy Model
  5. 4. Missing Data
    1. Examining Missing Data
    2. Dropping Missing Data
    3. Imputing Data
    4. Adding Indicator Columns
  6. 5. Cleaning Data
    1. Column Names
    2. Replacing Missing Values
  7. 6. Exploring
    1. Data Size
    2. Summary Stats
    3. Histogram
    4. Scatter Plot
    5. Joint Plot
    6. Pair Grid
    7. Box and Violin Plots
    8. Comparing Two Ordinal Values
    9. Correlation
    10. RadViz
    11. Parallel Coordinates
  8. 7. Preprocess Data
    1. Standardize
    2. Scale to Range
    3. Dummy Variables
    4. Label Encoder
    5. Frequency Encoding
    6. Pulling Categories from Strings
    7. Other Categorical Encoding
    8. Date Feature Engineering
    9. Add col_na Feature
    10. Manual Feature Engineering
  9. 8. Feature Selection
    1. Collinear Columns
    2. Lasso Regression
    3. Recursive Feature Elimination
    4. Mutual Information
    5. Principal Component Analysis
    6. Feature Importance
  10. 9. Imbalanced Classes
    1. Use a Different Metric
    2. Tree-based Algorithms and Ensembles
    3. Penalize Models
    4. Upsampling Minority
    5. Generate Minority Data
    6. Downsampling Majority
    7. Upsampling Then Downsampling
  11. 10. Classification
    1. Logistic Regression
    2. Naive Bayes
    3. Support Vector Machine
    4. K-Nearest Neighbor
    5. Decision Tree
    6. Random Forest
    7. XGBoost
    8. Gradient Boosted with LightGBM
    9. TPOT
  12. 11. Model Selection
    1. Validation Curve
    2. Learning Curve
  13. 12. Metrics and Classification Evaluation
    1. Confusion Matrix
    2. Metrics
    3. Accuracy
    4. Recall
    5. Precision
    6. F1
    7. Classification Report
    8. ROC
    9. Precision-Recall Curve
    10. Cumulative Gains Plot
    11. Lift Curve
    12. Class Balance
    13. Class Prediction Error
    14. Discrimination Threshold
  14. 13. Explaining Models
    1. Regression Coefficients
    2. Feature Importance
    3. LIME
    4. Tree Interpretation
    5. Partial Dependence Plots
    6. Surrogate Models
    7. Shapley
  15. 14. Regression
    1. Baseline Model
    2. Linear Regression
    3. SVMs
    4. K-Nearest Neighbor
    5. Decision Tree
    6. Random Forest
    7. XGBoost Regression
    8. LightGBM Regression
  16. 15. Metrics and Regression Evaluation
    1. Metrics
    2. Residuals Plot
    3. Heteroscedasticity
    4. Normal Residuals
    5. Prediction Error Plot
  17. 16. Explaining Regression Models
    1. Shapley
  18. 17. Dimensionality Reduction
    1. PCA
    2. UMAP
    3. t-SNE
    4. PHATE
  19. 18. Clustering
    1. K-Means
    2. Agglomerative (Hierarchical) Clustering
    3. Understanding Clusters
  20. 19. Pipelines
    1. Classification Pipeline
    2. Regression Pipeline
    3. PCA Pipeline
  21. Index

Product information

  • Title: Machine Learning Pocket Reference
  • Author(s): Matt Harrison
  • Release date: August 2019
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492047544