Data Science on the Google Cloud Platform

Book description

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build on top of the Google Cloud Platform (GCP). This hands-on guide shows developers entering the data science field how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP. Through the course of the book, you’ll work through a sample business decision by employing a variety of data science approaches.

Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.

You’ll learn how to:

  • Automate and schedule data ingest, using an App Engine application
  • Create and populate a dashboard in Google Data Studio
  • Build a real-time analysis pipeline to carry out streaming analytics
  • Conduct interactive data exploration with Google BigQuery
  • Create a Bayesian model on a Cloud Dataproc cluster
  • Build a logistic regression machine-learning model with Spark
  • Compute time-aggregate features with a Cloud Dataflow pipeline
  • Create a high-performing prediction model with TensorFlow
  • Use your deployed model as a microservice you can access from both batch and real-time pipelines

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who This Book Is For
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  2. 1. Making Better Decisions Based on Data
    1. Many Similar Decisions
    2. The Role of Data Engineers
    3. The Cloud Makes Data Engineers Possible
    4. The Cloud Turbocharges Data Science
    5. Case Studies Get at the Stubborn Facts
    6. A Probabilistic Decision
    7. Data and Tools
      1. Getting Started with the Code
    8. Summary
  3. 2. Ingesting Data into the Cloud
    1. Airline On-Time Performance Data
      1. Knowability
      2. Training–Serving Skew
      3. Download Procedure
      4. Dataset Fields
    2. Why Not Store the Data in Situ?
      1. Scaling Up
      2. Scaling Out
      3. Data in Situ with Colossus and Jupiter
    3. Ingesting Data
      1. Reverse Engineering a Web Form
      2. Dataset Download
      3. Exploration and Cleanup
      4. Uploading Data to Google Cloud Storage
    4. Scheduling Monthly Downloads
      1. Ingesting in Python
      2. Cloud Functions
      3. Securing the URL
      4. Scheduling the Cloud Function
      5. Improving the Cloud Function Design
    5. Summary
    6. Code Break
  4. 3. Creating Compelling Dashboards
    1. Explain Your Model with Dashboards
    2. Why Build a Dashboard First?
    3. Accuracy, Honesty, and Good Design
    4. Loading Data into Google Cloud SQL
    5. Create a Google Cloud SQL Instance
    6. Interacting with Google Cloud Platform
    7. Controlling Access to MySQL
    8. Create Tables
    9. Populating Tables
    10. Building Our First Model
      1. Contingency Table
      2. Threshold Optimization
      3. Machine Learning
    11. Building a Dashboard
    12. Getting Started with Data Studio
      1. Creating Charts
      2. Adding End-User Controls
      3. Showing Proportions with a Pie Chart
      4. Explaining a Contingency Table
    13. Summary
  5. 4. Streaming Data: Publication and Ingest
    1. Designing the Event Feed
    2. Time Correction
    3. Apache Beam/Cloud Dataflow
      1. Parsing Airports Data
      2. Adding Time Zone Information
      3. Converting Times to UTC
      4. Correcting Dates
      5. Creating Events
      6. Running the Pipeline in the Cloud
    4. Publishing an Event Stream to Cloud Pub/Sub
      1. Get Records to Publish
      2. Paging Through Records
      3. Building a Batch of Events
      4. Publishing a Batch of Events
    5. Real-Time Stream Processing
      1. Streaming in Java Dataflow
      2. Executing the Stream Processing
      3. Analyzing Streaming Data in BigQuery
      4. Real-Time Dashboard
    6. Summary
  6. 5. Interactive Data Exploration
    1. Exploratory Data Analysis
    2. Loading Flights Data into BigQuery
      1. Advantages of a Serverless Columnar Database
      2. Staging on Cloud Storage
      3. Access Control
      4. Federated Queries
      5. Ingesting CSV Files
    3. Exploratory Data Analysis in Cloud AI Platform Notebooks
      1. Jupyter Notebooks
      2. Cloud AI Platform Notebooks
      3. Installing Packages in Cloud AI Platform Notebooks
      4. Jupyter Magic for Google Cloud Platform
    4. Quality Control
      1. Oddball Values
      2. Outlier Removal: Big Data Is Different
      3. Filtering Data on Occurrence Frequency
    5. Arrival Delay Conditioned on Departure Delay
      1. Applying Probabilistic Decision Threshold
      2. Empirical Probability Distribution Function
      3. The Answer Is...
    6. Evaluating the Model
      1. Random Shuffling
      2. Splitting by Date
      3. Training and Testing
    7. Summary
  7. 6. Bayes Classifier on Cloud Dataproc
    1. MapReduce and the Hadoop Ecosystem
      1. How MapReduce Works
      2. Apache Hadoop
      3. Google Cloud Dataproc
      4. Need for Higher-Level Tools
      5. Jobs, Not Clusters
      6. Initialization Actions
    2. Quantization Using Spark SQL
      1. JupyterLab on Cloud Dataproc
      2. Independence Check Using BigQuery
      3. Spark SQL in JupyterLab
      4. Histogram Equalization
      5. Dynamically Resizing Clusters
    3. Bayes Classification Using Pig
      1. Running a Pig Job on Cloud Dataproc
      2. Automating Cloud Dataproc with Workflow Templates
      3. Limiting to Training Days
      4. The Decision Criteria
      5. Evaluating the Bayesian Model
    4. Summary
  8. 7. Machine Learning: Logistic Regression in Spark and BigQuery
    1. Logistic Regression
      1. Spark ML Library
      2. Getting Started with Spark Machine Learning
      3. Spark Logistic Regression
      4. Creating a Training Dataset
      5. Dealing with Corner Cases
      6. Creating Training Examples
      7. Training
      8. Predicting by Using a Model
      9. Evaluating a Model
    2. Feature Engineering
      1. Experimental Framework
      2. Creating the Held-Out Dataset
      3. Feature Selection
      4. Scaling and Clipping Features
      5. Feature Transforms
      6. Categorical Variables
      7. Scalable Machine Learning Models in BigQuery
      8. Repeatable, Real Time
    3. Summary
  9. 8. Time-Windowed Aggregate Features
    1. The Need for Time Averages
    2. Dataflow in Java
      1. Setting Up Development Environment
      2. Filtering with Beam
      3. Pipeline Options and Text I/O
      4. Run on Cloud
      5. Parsing into Objects
    3. Computing Time Averages
      1. Grouping and Combining
      2. Parallel Do with Side Input
      3. Debugging
      4. BigQueryIO
      5. Mutating the Flight Object
      6. Sliding Window Computation in Batch Mode
      7. Running in the Cloud
    4. Monitoring, Troubleshooting, and Performance Tuning
      1. Troubleshooting Pipeline
      2. Side Input Limitations
      3. Redesigning the Pipeline
      4. Removing Duplicates
    5. Summary
  10. 9. Machine Learning Classifier Using TensorFlow
    1. Toward More Complex Models
    2. Reading Data into TensorFlow
    3. Training and Evaluation in Keras
      1. Model Function
        1. Input and Features
      2. Training and Evaluating Input Functions
      3. Saving and Exporting
      4. Performing a Training Run
      5. Training in the Cloud
      6. Wide-and-Deep Model
      7. Hyperparameter Tuning
    4. Deploying the Model
      1. Predicting with the Model
      2. Explaining the Model
    5. Summary
  11. 10. Real-Time Machine Learning
    1. Invoking Prediction Service
      1. Java Classes for Request and Response
      2. Post Request and Parse Response
      3. Client of Prediction Service
    2. Adding Predictions to Flight Information
      1. Batch Input and Output
      2. Data Processing Pipeline
      3. Identifying Inefficiency
      4. Batching Requests
    3. Streaming Pipeline
      1. Flattening PCollections
      2. Executing Streaming Pipeline
      3. Late and Out-of-Order Records
      4. Watermarks and Triggers
    4. Transactions, Throughput, and Latency
      1. Possible Streaming Sinks
      2. Cloud Bigtable
      3. Designing Tables
      4. Designing the Row Key
      5. Streaming into Cloud Bigtable
      6. Querying from Cloud Bigtable
    5. Evaluating Model Performance
      1. The Need for Continuous Training
      2. Evaluation Pipeline
      3. Evaluating Performance
      4. Marginal Distributions
      5. Checking Model Behavior
      6. Identifying Behavioral Change
    6. Summary
    7. Book Summary
  12. A. Considerations for Sensitive Data within Machine Learning Datasets
    1. Handling Sensitive Information
      1. Identifying Sensitive Data
    2. Protecting Sensitive Data
      1. Removing Sensitive Data
      2. Masking Sensitive Data
      3. Coarsening Sensitive Data
    3. Establishing a Governance Policy
  13. Index

Product information

  • Title: Data Science on the Google Cloud Platform
  • Author(s): Valliappa Lakshmanan
  • Release date: December 2017
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491974513