Python: Real World Machine Learning

Book description

Learn to solve challenging data science problems by building powerful machine learning models using Python

About This Book

  • Understand which algorithms to use in a given context with the help of this exciting recipe-based guide

  • This practical tutorial tackles real-world computing problems through a rigorous and effective approach

  • Build state-of-the-art models and develop personalized recommendations to perform machine learning at scale

  • Who This Book Is For

    This Learning Path is for Python programmers who are looking to use machine learning algorithms to create real-world applications. It is ideal for Python professionals who want to work with large and complex datasets and Python developers and analysts or data scientists who are looking to add to their existing skills by accessing some of the most powerful recent trends in data science. Experience with Python, Jupyter Notebooks, and command-line execution together with a good level of mathematical knowledge to understand the concepts is expected. Machine learning basic knowledge is also expected.

    What You Will Learn

  • Use predictive modeling and apply it to real-world problems

  • Understand how to perform market segmentation using unsupervised learning

  • Apply your new-found skills to solve real problems, through clearly-explained code for every technique and test

  • Compete with top data scientists by gaining a practical and theoretical understanding of cutting-edge deep learning algorithms

  • Increase predictive accuracy with deep learning and scalable data-handling techniques

  • Work with modern state-of-the-art large-scale machine learning techniques

  • Learn to use Python code to implement a range of machine learning algorithms and techniques

  • In Detail

    Machine learning is increasingly spreading in the modern data-driven world. It is used extensively across many fields such as search engines, robotics, self-driving cars, and more. Machine learning is transforming the way we understand and interact with the world around us.

    In the first module, Python Machine Learning Cookbook, you will learn how to perform various machine learning tasks using a wide variety of machine learning algorithms to solve real-world problems and use Python to implement these algorithms.

    The second module, Advanced Machine Learning with Python, is designed to take you on a guided tour of the most relevant and powerful machine learning techniques and you’ll acquire a broad set of powerful skills in the area of feature selection and feature engineering.

    The third module in this learning path, Large Scale Machine Learning with Python, dives into scalable machine learning and the three forms of scalability. It covers the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

    This Learning Path will teach you Python machine learning for the real world. The machine learning techniques covered in this Learning Path are at the forefront of commercial practice.

    This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products:

  • Python Machine Learning Cookbook by Prateek Joshi

  • Advanced Machine Learning with Python by John Hearty

  • Large Scale Machine Learning with Python by Bastiaan Sjardin, Alberto Boschetti, Luca Massaron

  • Style and approach

    This course is a smooth learning path that will teach you how to get started with Python machine learning for the real world, and develop solutions to real-world problems. Through this comprehensive course, you’ll learn to create the most effective machine learning techniques from scratch and more!

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of contents

    1. Python: Real World Machine Learning
      1. Table of Contents
      2. Python: Real World Machine Learning
      3. Python: Real World Machine Learning
      4. Credits
      5. Preface
        1. What this learning path covers
        2. What you need for this learning path
        3. Who this learning path is for
        4. Reader feedback
        5. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      6. I. Module 1
        1. 1. The Realm of Supervised Learning
          1. Introduction
          2. Preprocessing data using different techniques
            1. Getting ready
            2. How to do it…
              1. Mean removal
              2. Scaling
              3. Normalization
              4. Binarization
              5. One Hot Encoding
          3. Label encoding
            1. How to do it…
          4. Building a linear regressor
            1. Getting ready
            2. How to do it…
          5. Computing regression accuracy
            1. Getting ready
            2. How to do it…
          6. Achieving model persistence
            1. How to do it…
          7. Building a ridge regressor
            1. Getting ready
            2. How to do it…
          8. Building a polynomial regressor
            1. Getting ready
            2. How to do it…
          9. Estimating housing prices
            1. Getting ready
            2. How to do it…
          10. Computing the relative importance of features
            1. How to do it…
          11. Estimating bicycle demand distribution
            1. Getting ready
            2. How to do it…
            3. There's more…
        2. 2. Constructing a Classifier
          1. Introduction
          2. Building a simple classifier
            1. How to do it…
            2. There's more…
          3. Building a logistic regression classifier
            1. How to do it…
          4. Building a Naive Bayes classifier
            1. How to do it…
          5. Splitting the dataset for training and testing
            1. How to do it…
          6. Evaluating the accuracy using cross-validation
            1. Getting ready…
            2. How to do it…
          7. Visualizing the confusion matrix
            1. How to do it…
          8. Extracting the performance report
            1. How to do it…
          9. Evaluating cars based on their characteristics
            1. Getting ready
            2. How to do it…
          10. Extracting validation curves
            1. How to do it…
          11. Extracting learning curves
            1. How to do it…
          12. Estimating the income bracket
            1. How to do it…
        3. 3. Predictive Modeling
          1. Introduction
          2. Building a linear classifier using Support Vector Machine (SVMs)
            1. Getting ready
            2. How to do it…
          3. Building a nonlinear classifier using SVMs
            1. How to do it…
          4. Tackling class imbalance
            1. How to do it…
          5. Extracting confidence measurements
            1. How to do it…
          6. Finding optimal hyperparameters
            1. How to do it…
          7. Building an event predictor
            1. Getting ready
            2. How to do it…
          8. Estimating traffic
            1. Getting ready
            2. How to do it…
        4. 4. Clustering with Unsupervised Learning
          1. Introduction
          2. Clustering data using the k-means algorithm
            1. How to do it…
          3. Compressing an image using vector quantization
            1. How to do it…
          4. Building a Mean Shift clustering model
            1. How to do it…
          5. Grouping data using agglomerative clustering
            1. How to do it…
          6. Evaluating the performance of clustering algorithms
            1. How to do it…
          7. Automatically estimating the number of clusters using DBSCAN algorithm
            1. How to do it…
          8. Finding patterns in stock market data
            1. How to do it…
          9. Building a customer segmentation model
            1. How to do it…
        5. 5. Building Recommendation Engines
          1. Introduction
          2. Building function compositions for data processing
            1. How to do it…
          3. Building machine learning pipelines
            1. How to do it…
            2. How it works…
          4. Finding the nearest neighbors
            1. How to do it…
          5. Constructing a k-nearest neighbors classifier
            1. How to do it…
            2. How it works…
          6. Constructing a k-nearest neighbors regressor
            1. How to do it…
            2. How it works…
          7. Computing the Euclidean distance score
            1. How to do it…
          8. Computing the Pearson correlation score
            1. How to do it…
          9. Finding similar users in the dataset
            1. How to do it…
          10. Generating movie recommendations
            1. How to do it…
        6. 6. Analyzing Text Data
          1. Introduction
          2. Preprocessing data using tokenization
            1. How to do it…
          3. Stemming text data
            1. How to do it…
            2. How it works…
          4. Converting text to its base form using lemmatization
            1. How to do it…
          5. Dividing text using chunking
            1. How to do it…
          6. Building a bag-of-words model
            1. How to do it…
            2. How it works…
          7. Building a text classifier
            1. How to do it…
            2. How it works…
          8. Identifying the gender
            1. How to do it…
          9. Analyzing the sentiment of a sentence
            1. How to do it…
            2. How it works…
          10. Identifying patterns in text using topic modeling
            1. How to do it…
            2. How it works…
        7. 7. Speech Recognition
          1. Introduction
          2. Reading and plotting audio data
            1. How to do it…
          3. Transforming audio signals into the frequency domain
            1. How to do it…
          4. Generating audio signals with custom parameters
            1. How to do it…
          5. Synthesizing music
            1. How to do it…
          6. Extracting frequency domain features
            1. How to do it…
          7. Building Hidden Markov Models
            1. How to do it…
          8. Building a speech recognizer
            1. How to do it…
        8. 8. Dissecting Time Series and Sequential Data
          1. Introduction
          2. Transforming data into the time series format
            1. How to do it…
          3. Slicing time series data
            1. How to do it…
          4. Operating on time series data
            1. How to do it…
          5. Extracting statistics from time series data
            1. How to do it…
          6. Building Hidden Markov Models for sequential data
            1. Getting ready
            2. How to do it…
          7. Building Conditional Random Fields for sequential text data
            1. Getting ready
            2. How to do it…
          8. Analyzing stock market data using Hidden Markov Models
            1. How to do it…
        9. 9. Image Content Analysis
          1. Introduction
          2. Operating on images using OpenCV-Python
            1. How to do it…
          3. Detecting edges
            1. How to do it…
          4. Histogram equalization
            1. How to do it…
          5. Detecting corners
            1. How to do it…
          6. Detecting SIFT feature points
            1. How to do it…
          7. Building a Star feature detector
            1. How to do it…
          8. Creating features using visual codebook and vector quantization
            1. How to do it…
          9. Training an image classifier using Extremely Random Forests
            1. How to do it…
          10. Building an object recognizer
            1. How to do it…
        10. 10. Biometric Face Recognition
          1. Introduction
          2. Capturing and processing video from a webcam
            1. How to do it…
          3. Building a face detector using Haar cascades
            1. How to do it…
          4. Building eye and nose detectors
            1. How to do it…
          5. Performing Principal Components Analysis
            1. How to do it…
          6. Performing Kernel Principal Components Analysis
            1. How to do it…
          7. Performing blind source separation
            1. How to do it…
          8. Building a face recognizer using Local Binary Patterns Histogram
            1. How to do it…
        11. 11. Deep Neural Networks
          1. Introduction
          2. Building a perceptron
            1. How to do it…
          3. Building a single layer neural network
            1. How to do it…
          4. Building a deep neural network
            1. How to do it…
          5. Creating a vector quantizer
            1. How to do it…
          6. Building a recurrent neural network for sequential data analysis
            1. How to do it…
          7. Visualizing the characters in an optical character recognition database
            1. How to do it…
          8. Building an optical character recognizer using neural networks
            1. How to do it…
        12. 12. Visualizing Data
          1. Introduction
          2. Plotting 3D scatter plots
            1. How to do it…
          3. Plotting bubble plots
            1. How to do it…
          4. Animating bubble plots
            1. How to do it…
          5. Drawing pie charts
            1. How to do it…
          6. Plotting date-formatted time series data
            1. How to do it…
          7. Plotting histograms
            1. How to do it…
          8. Visualizing heat maps
            1. How to do it…
          9. Animating dynamic signals
            1. How to do it…
      7. II. Module 2
        1. 1. Unsupervised Machine Learning
          1. Principal component analysis
            1. PCA – a primer
            2. Employing PCA
          2. Introducing k-means clustering
            1. Clustering – a primer
            2. Kick-starting clustering analysis
            3. Tuning your clustering configurations
          3. Self-organizing maps
            1. SOM – a primer
            2. Employing SOM
          4. Further reading
          5. Summary
        2. 2. Deep Belief Networks
          1. Neural networks – a primer
            1. The composition of a neural network
            2. Network topologies
          2. Restricted Boltzmann Machine
            1. Introducing the RBM
              1. Topology
              2. Training
            2. Applications of the RBM
            3. Further applications of the RBM
          3. Deep belief networks
            1. Training a DBN
            2. Applying the DBN
            3. Validating the DBN
          4. Further reading
          5. Summary
        3. 3. Stacked Denoising Autoencoders
          1. Autoencoders
            1. Introducing the autoencoder
              1. Topology
              2. Training
            2. Denoising autoencoders
            3. Applying a dA
          2. Stacked Denoising Autoencoders
            1. Applying the SdA
            2. Assessing SdA performance
          3. Further reading
          4. Summary
        4. 4. Convolutional Neural Networks
          1. Introducing the CNN
            1. Understanding the convnet topology
              1. Understanding convolution layers
              2. Understanding pooling layers
              3. Training a convnet
              4. Putting it all together
            2. Applying a CNN
          2. Further Reading
          3. Summary
        5. 5. Semi-Supervised Learning
          1. Introduction
          2. Understanding semi-supervised learning
          3. Semi-supervised algorithms in action
            1. Self-training
              1. Implementing self-training
              2. Finessing your self-training implementation
                1. Improving the selection process
            2. Contrastive Pessimistic Likelihood Estimation
          4. Further reading
          5. Summary
        6. 6. Text Feature Engineering
          1. Introduction
          2. Text feature engineering
            1. Cleaning text data
              1. Text cleaning with BeautifulSoup
              2. Managing punctuation and tokenizing
              3. Tagging and categorising words
                1. Tagging with NLTK
                2. Sequential tagging
                3. Backoff tagging
            2. Creating features from text data
              1. Stemming
              2. Bagging and random forests
            3. Testing our prepared data
          3. Further reading
          4. Summary
        7. 7. Feature Engineering Part II
          1. Introduction
          2. Creating a feature set
            1. Engineering features for ML applications
              1. Using rescaling techniques to improve the learnability of features
              2. Creating effective derived variables
              3. Reinterpreting non-numeric features
            2. Using feature selection techniques
              1. Performing feature selection
                1. Correlation
                2. LASSO
                3. Recursive Feature Elimination
                4. Genetic models
          3. Feature engineering in practice
            1. Acquiring data via RESTful APIs
              1. Testing the performance of our model
              2. Twitter
                1. Translink Twitter
                2. Consumer comments
                3. The Bing Traffic API
              3. Deriving and selecting variables using feature engineering techniques
                1. The weather API
          4. Further reading
          5. Summary
        8. 8. Ensemble Methods
          1. Introducing ensembles
            1. Understanding averaging ensembles
              1. Using bagging algorithms
              2. Using random forests
            2. Applying boosting methods
              1. Using XGBoost
            3. Using stacking ensembles
              1. Applying ensembles in practice
          2. Using models in dynamic applications
            1. Understanding model robustness
              1. Identifying modeling risk factors
            2. Strategies to managing model robustness
          3. Further reading
          4. Summary
        9. 9. Additional Python Machine Learning Tools
          1. Alternative development tools
            1. Introduction to Lasagne
              1. Getting to know Lasagne
            2. Introduction to TensorFlow
              1. Getting to know TensorFlow
              2. Using TensorFlow to iteratively improve our models
            3. Knowing when to use these libraries
          2. Further reading
          3. Summary
        10. A. Chapter Code Requirements
      8. III. Module 3
        1. 1. First Steps to Scalability
          1. Explaining scalability in detail
            1. Making large scale examples
            2. Introducing Python
            3. Scale up with Python
            4. Scale out with Python
          2. Python for large scale machine learning
            1. Choosing between Python 2 and Python 3
            2. Package upgrades
            3. Scientific distributions
            4. Introducing Jupyter/IPython
          3. Python packages
            1. NumPy
            2. SciPy
            3. Pandas
            4. Scikit-learn
              1. The matplotlib package
              2. Gensim
              3. H2O
              4. XGBoost
              5. Theano
              6. TensorFlow
              7. The sknn library
              8. Theanets
              9. Keras
              10. Other useful packages to install on your system
          4. Summary
        2. 2. Scalable Learning in Scikit-learn
          1. Out-of-core learning
            1. Subsampling as a viable option
            2. Optimizing one instance at a time
            3. Building an out-of-core learning system
          2. Streaming data from sources
            1. Datasets to try the real thing yourself
            2. The first example – streaming the bike-sharing dataset
            3. Using pandas I/O tools
            4. Working with databases
            5. Paying attention to the ordering of instances
          3. Stochastic learning
            1. Batch gradient descent
            2. Stochastic gradient descent
            3. The Scikit-learn SGD implementation
            4. Defining SGD learning parameters
          4. Feature management with data streams
            1. Describing the target
            2. The hashing trick
            3. Other basic transformations
            4. Testing and validation in a stream
            5. Trying SGD in action
          5. Summary
        3. 3. Fast SVM Implementations
          1. Datasets to experiment with on your own
            1. The bike-sharing dataset
            2. The covertype dataset
          2. Support Vector Machines
            1. Hinge loss and its variants
            2. Understanding the Scikit-learn SVM implementation
            3. Pursuing nonlinear SVMs by subsampling
            4. Achieving SVM at scale with SGD
          3. Feature selection by regularization
          4. Including non-linearity in SGD
            1. Trying explicit high-dimensional mappings
          5. Hyperparameter tuning
            1. Other alternatives for SVM fast learning
              1. Nonlinear and faster with Vowpal Wabbit
              2. Installing VW
              3. Understanding the VW data format
              4. Python integration
              5. A few examples using reductions for SVM and neural nets
              6. Faster bike-sharing
              7. The covertype dataset crunched by VW
          6. Summary
        4. 4. Neural Networks and Deep Learning
          1. The neural network architecture
            1. What and how neural networks learn
            2. Choosing the right architecture
              1. The input layer
              2. The hidden layer
              3. The output layer
            3. Neural networks in action
            4. Parallelization for sknn
          2. Neural networks and regularization
          3. Neural networks and hyperparameter optimization
          4. Neural networks and decision boundaries
          5. Deep learning at scale with H2O
            1. Large scale deep learning with H2O
            2. Gridsearch on H2O
          6. Deep learning and unsupervised pretraining
          7. Deep learning with theanets
          8. Autoencoders and unsupervised learning
            1. Autoencoders
          9. Summary
        5. 5. Deep Learning with TensorFlow
          1. TensorFlow installation
            1. TensorFlow operations
              1. GPU computing
              2. Linear regression with SGD
              3. A neural network from scratch in TensorFlow
          2. Machine learning on TensorFlow with SkFlow
            1. Deep learning with large files – incremental learning
          3. Keras and TensorFlow installation
          4. Convolutional Neural Networks in TensorFlow through Keras
            1. The convolution layer
            2. The pooling layer
            3. The fully connected layer
          5. CNN's with an incremental approach
          6. GPU Computing
          7. Summary
        6. 6. Classification and Regression Trees at Scale
          1. Bootstrap aggregation
          2. Random forest and extremely randomized forest
          3. Fast parameter optimization with randomized search
            1. Extremely randomized trees and large datasets
          4. CART and boosting
            1. Gradient Boosting Machines
              1. max_depth
              2. learning_rate
              3. Subsample
              4. Faster GBM with warm_start
                1. Speeding up GBM with warm_start
              5. Training and storing GBM models
          5. XGBoost
            1. XGBoost regression
              1. XGBoost and variable importance
            2. XGBoost streaming large datasets
            3. XGBoost model persistence
          6. Out-of-core CART with H2O
            1. Random forest and gridsearch on H2O
            2. Stochastic gradient boosting and gridsearch on H2O
          7. Summary
        7. 7. Unsupervised Learning at Scale
          1. Unsupervised methods
          2. Feature decomposition – PCA
            1. Randomized PCA
            2. Incremental PCA
            3. Sparse PCA
          3. PCA with H2O
          4. Clustering – K-means
            1. Initialization methods
            2. K-means assumptions
            3. Selection of the best K
            4. Scaling K-means – mini-batch
          5. K-means with H2O
          6. LDA
            1. Scaling LDA – memory, CPUs, and machines
          7. Summary
        8. 8. Distributed Environments – Hadoop and Spark
          1. From a standalone machine to a bunch of nodes
            1. Why do we need a distributed framework?
          2. Setting up the VM
            1. VirtualBox
            2. Vagrant
            3. Using the VM
          3. The Hadoop ecosystem
            1. Architecture
            2. HDFS
            3. MapReduce
            4. YARN
          4. Spark
            1. pySpark
          5. Summary
        9. 9. Practical Machine Learning with Spark
          1. Setting up the VM for this chapter
          2. Sharing variables across cluster nodes
            1. Broadcast read-only variables
            2. Accumulators write-only variables
            3. Broadcast and accumulators together – an example
          3. Data preprocessing in Spark
            1. JSON files and Spark DataFrames
            2. Dealing with missing data
            3. Grouping and creating tables in-memory
            4. Writing the preprocessed DataFrame or RDD to disk
            5. Working with Spark DataFrames
          4. Machine learning with Spark
            1. Spark on the KDD99 dataset
            2. Reading the dataset
            3. Feature engineering
            4. Training a learner
            5. Evaluating a learner's performance
            6. The power of the ML pipeline
            7. Manual tuning
            8. Cross-validation
              1. Final cleanup
          5. Summary
        10. A. Introduction to GPUs and Theano
          1. GPU computing
          2. Theano – parallel computing on the GPU
          3. Installing Theano
      9. A. Bibliography
      10. Index

    Product information

    • Title: Python: Real World Machine Learning
    • Author(s): Prateek Joshi, John Hearty, Bastiaan Sjardin, Luca Massaron, Alberto Boschetti
    • Release date: November 2016
    • Publisher(s): Packt Publishing
    • ISBN: 9781787123212