Books & Videos

Table of Contents

  1. Chapter 1 Analyzing Big Data

    1. The Challenges of Data Science

    2. Introducing Apache Spark

    3. About This Book

  2. Chapter 2 Introduction to Data Analysis with Scala and Spark

    1. Scala for Data Scientists

    2. The Spark Programming Model

    3. Record Linkage

    4. Getting Started: The Spark Shell and SparkContext

    5. Bringing Data from the Cluster to the Client

    6. Shipping Code from the Client to the Cluster

    7. Structuring Data with Tuples and Case Classes

    8. Aggregations

    9. Creating Histograms

    10. Summary Statistics for Continuous Variables

    11. Creating Reusable Code for Computing Summary Statistics

    12. Simple Variable Selection and Scoring

    13. Where to Go from Here

  3. Chapter 3 Recommending Music and the Audioscrobbler Data Set

    1. Data Set

    2. The Alternating Least Squares Recommender Algorithm

    3. Preparing the Data

    4. Building a First Model

    5. Spot Checking Recommendations

    6. Evaluating Recommendation Quality

    7. Computing AUC

    8. Hyperparameter Selection

    9. Making Recommendations

    10. Where to Go from Here

  4. Chapter 4 Predicting Forest Cover with Decision Trees

    1. Fast Forward to Regression

    2. Vectors and Features

    3. Training Examples

    4. Decision Trees and Forests

    5. Covtype Data Set

    6. Preparing the Data

    7. A First Decision Tree

    8. Decision Tree Hyperparameters

    9. Tuning Decision Trees

    10. Categorical Features Revisited

    11. Random Decision Forests

    12. Making Predictions

    13. Where to Go from Here

  5. Chapter 5 Anomaly Detection in Network Traffic with K-means Clustering

    1. Anomaly Detection

    2. K-means Clustering

    3. Network Intrusion

    4. KDD Cup 1999 Data Set

    5. A First Take on Clustering

    6. Choosing k

    7. Visualization in R

    8. Feature Normalization

    9. Categorical Variables

    10. Using Labels with Entropy

    11. Clustering in Action

    12. Where to Go from Here

  6. Chapter 6 Understanding Wikipedia with Latent Semantic Analysis

    1. The Term-Document Matrix

    2. Getting the Data

    3. Parsing and Preparing the Data

    4. Lemmatization

    5. Computing the TF-IDFs

    6. Singular Value Decomposition

    7. Finding Important Concepts

    8. Querying and Scoring with the Low-Dimensional Representation

    9. Term-Term Relevance

    10. Document-Document Relevance

    11. Term-Document Relevance

    12. Multiple-Term Queries

    13. Where to Go from Here

  7. Chapter 7 Analyzing Co-occurrence Networks with GraphX

    1. The MEDLINE Citation Index: A Network Analysis

    2. Getting the Data

    3. Parsing XML Documents with Scala’s XML Library

    4. Analyzing the MeSH Major Topics and Their Co-occurrences

    5. Constructing a Co-occurrence Network with GraphX

    6. Understanding the Structure of Networks

    7. Filtering Out Noisy Edges

    8. Small-World Networks

    9. Where to Go from Here

  8. Chapter 8 Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data

    1. Getting the Data

    2. Working with Temporal and Geospatial Data in Spark

    3. Temporal Data with JodaTime and NScalaTime

    4. Geospatial Data with the Esri Geometry API and Spray

    5. Preparing the New York City Taxi Trip Data

    6. Sessionization in Spark

    7. Where to Go from Here

  9. Chapter 9 Estimating Financial Risk through Monte Carlo Simulation

    1. Terminology

    2. Methods for Calculating VaR

    3. Our Model

    4. Getting the Data

    5. Preprocessing

    6. Determining the Factor Weights

    7. Sampling

    8. Running the Trials

    9. Visualizing the Distribution of Returns

    10. Evaluating Our Results

    11. Where to Go from Here

  10. Chapter 10 Analyzing Genomics Data and the BDG Project

    1. Decoupling Storage from Modeling

    2. Ingesting Genomics Data with the ADAM CLI

    3. Predicting Transcription Factor Binding Sites from ENCODE Data

    4. Querying Genotypes from the 1000 Genomes Project

    5. Where to Go from Here

  11. Chapter 11 Analyzing Neuroimaging Data with PySpark and Thunder

    1. Overview of PySpark

    2. Overview and Installation of the Thunder Library

    3. Loading Data with Thunder

    4. Categorizing Neuron Types with Thunder

    5. Where to Go from Here

  12. Appendix Deeper into Spark

    1. Serialization

    2. Accumulators

    3. Spark and the Data Scientist’s Workflow

    4. File Formats

    5. Spark Subprojects

  13. Appendix Upcoming MLlib Pipelines API

    1. Beyond Mere Modeling

    2. The Pipelines API

    3. Text Classification Example Walkthrough