Doing Data Science

Book description

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.

In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

Topics include:

  • Statistical inference, exploratory data analysis, and the data science process
  • Algorithms
  • Spam filters, Naive Bayes, and data wrangling
  • Logistic regression
  • Financial modeling
  • Recommendation engines and causality
  • Data visualization
  • Social networks and data journalism
  • Data engineering, MapReduce, Pregel, and Hadoop

Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Motivation
    2. Origins of the Class
    3. Origins of the Book
    4. What to Expect from This Book
    5. How This Book Is Organized
    6. How to Read This Book
    7. How Code Is Used in This Book
    8. Who This Book Is For
    9. Prerequisites
    10. Supplemental Reading
    11. About the Contributors
    12. Conventions Used in This Book
    13. Using Code Examples
    14. O’Reilly Online Learning
    15. How to Contact Us
    16. Acknowledgments
  2. 1. Introduction: What Is Data Science?
    1. Big Data and Data Science Hype
    2. Getting Past the Hype
    3. Why Now?
      1. Datafication
    4. The Current Landscape (with a Little History)
      1. Data Science Jobs
    5. A Data Science Profile
    6. Thought Experiment: Meta-Definition
    7. OK, So What Is a Data Scientist, Really?
      1. In Academia
      2. In Industry
  3. 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
    1. Statistical Thinking in the Age of Big Data
      1. Statistical Inference
      2. Populations and Samples
      3. Populations and Samples of Big Data
      4. Big Data Can Mean Big Assumptions
      5. Modeling
    2. Exploratory Data Analysis
      1. Philosophy of Exploratory Data Analysis
      2. Exercise: EDA
    3. The Data Science Process
      1. A Data Scientist’s Role in This Process
    4. Thought Experiment: How Would You Simulate Chaos?
    5. Case Study: RealDirect
      1. How Does RealDirect Make Money?
      2. Exercise: RealDirect Data Strategy
  4. 3. Algorithms
    1. Machine Learning Algorithms
    2. Three Basic Algorithms
      1. Linear Regression
      2. k-Nearest Neighbors (k-NN)
      3. k-means
    3. Exercise: Basic Machine Learning Algorithms
      1. Solutions
    4. Summing It All Up
    5. Thought Experiment: Automated Statistician
  5. 4. Spam Filters, Naive Bayes, and Wrangling
    1. Thought Experiment: Learning by Example
      1. Why Won’t Linear Regression Work for Filtering Spam?
      2. How About k-nearest Neighbors?
    2. Naive Bayes
      1. Bayes Law
      2. A Spam Filter for Individual Words
      3. A Spam Filter That Combines Words: Naive Bayes
    3. Fancy It Up: Laplace Smoothing
    4. Comparing Naive Bayes to k-NN
    5. Sample Code in bash
    6. Scraping the Web: APIs and Other Tools
    7. Jake’s Exercise: Naive Bayes for Article Classification
      1. Sample R Code for Dealing with the NYT API
  6. 5. Logistic Regression
    1. Thought Experiments
    2. Classifiers
      1. Runtime
      2. You
      3. Interpretability
      4. Scalability
    3. M6D Logistic Regression Case Study
      1. Click Models
      2. The Underlying Math
      3. Estimating α and β
      4. Newton’s Method
      5. Stochastic Gradient Descent
      6. Implementation
      7. Evaluation
    4. Media 6 Degrees Exercise
      1. Sample R Code
  7. 6. Time Stamps and Financial Modeling
    1. Kyle Teague and GetGlue
    2. Timestamps
      1. Exploratory Data Analysis (EDA)
      2. Metrics and New Variables or Features
      3. What’s Next?
    3. Cathy O’Neil
    4. Thought Experiment
    5. Financial Modeling
      1. In-Sample, Out-of-Sample, and Causality
      2. Preparing Financial Data
      3. Log Returns
      4. Example: The S&P Index
      5. Working out a Volatility Measurement
      6. Exponential Downweighting
      7. The Financial Modeling Feedback Loop
      8. Why Regression?
      9. Adding Priors
      10. A Baby Model
      11. Exercise: GetGlue and Timestamped Event Data
      12. Exercise: Financial Data
  8. 7. Extracting Meaning from Data
    1. William Cukierski
      1. Background: Data Science Competitions
      2. Background: Crowdsourcing
    2. The Kaggle Model
      1. A Single Contestant
      2. Their Customers
    3. Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
    4. Feature Selection
      1. Example: User Retention
      2. Filters
      3. Wrappers
      4. Embedded Methods: Decision Trees
      5. Entropy
      6. The Decision Tree Algorithm
      7. Handling Continuous Variables in Decision Trees
      8. Random Forests
      9. User Retention: Interpretability Versus Predictive Power
    5. David Huffaker: Google’s Hybrid Approach to Social Research
      1. Moving from Descriptive to Predictive
      2. Social at Google
      3. Privacy
      4. Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
  9. 8. Recommendation Engines: Building a User-Facing Data Product at Scale
    1. A Real-World Recommendation Engine
      1. Nearest Neighbor Algorithm Review
      2. Some Problems with Nearest Neighbors
      3. Beyond Nearest Neighbor: Machine Learning Classification
      4. The Dimensionality Problem
      5. Singular Value Decomposition (SVD)
      6. Important Properties of SVD
      7. Principal Component Analysis (PCA)
      8. Alternating Least Squares
      9. Fix V and Update U
      10. Last Thoughts on These Algorithms
    2. Thought Experiment: Filter Bubbles
    3. Exercise: Build Your Own Recommendation System
      1. Sample Code in Python
  10. 9. Data Visualization and Fraud Detection
    1. Data Visualization History
      1. Gabriel Tarde
      2. Mark’s Thought Experiment
    2. What Is Data Science, Redux?
      1. Processing
      2. Franco Moretti
    3. A Sample of Data Visualization Projects
    4. Mark’s Data Visualization Projects
      1. New York Times Lobby: Moveable Type
      2. Project Cascade: Lives on a Screen
      3. Cronkite Plaza
      4. eBay Transactions and Books
      5. Public Theater Shakespeare Machine
      6. Goals of These Exhibits
    5. Data Science and Risk
      1. About Square
      2. The Risk Challenge
      3. The Trouble with Performance Estimation
      4. Model Building Tips
    6. Data Visualization at Square
    7. Ian’s Thought Experiment
    8. Data Visualization for the Rest of Us
      1. Data Visualization Exercise
  11. 10. Social Networks and Data Journalism
    1. Social Network Analysis at Morning Analytics
      1. Case-Attribute Data versus Social Network Data
    2. Social Network Analysis
    3. Terminology from Social Networks
      1. Centrality Measures
      2. The Industry of Centrality Measures
    4. Thought Experiment
    5. Morningside Analytics
      1. How Visualizations Help Us Find Schools of Fish
    6. More Background on Social Network Analysis from a Statistical Point of View
      1. Representations of Networks and Eigenvalue Centrality
      2. A First Example of Random Graphs: The Erdos-Renyi Model
      3. A Second Example of Random Graphs: The Exponential Random Graph Model
    7. Data Journalism
      1. A Bit of History on Data Journalism
      2. Writing Technical Journalism: Advice from an Expert
  12. 11. Causality
    1. Correlation Doesn’t Imply Causation
      1. Asking Causal Questions
      2. Confounders: A Dating Example
    2. OK Cupid’s Attempt
    3. The Gold Standard: Randomized Clinical Trials
    4. A/B Tests
    5. Second Best: Observational Studies
      1. Simpson’s Paradox
      2. The Rubin Causal Model
      3. Visualizing Causality
      4. Definition: The Causal Effect
    6. Three Pieces of Advice
  13. 12. Epidemiology
    1. Madigan’s Background
    2. Thought Experiment
    3. Modern Academic Statistics
    4. Medical Literature and Observational Studies
    5. Stratification Does Not Solve the Confounder Problem
      1. What Do People Do About Confounding Things in Practice?
    6. Is There a Better Way?
    7. Research Experiment (Observational Medical Outcomes Partnership)
    8. Closing Thought Experiment
  14. 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
    1. Claudia’s Data Scientist Profile
      1. The Life of a Chief Data Scientist
      2. On Being a Female Data Scientist
    2. Data Mining Competitions
    3. How to Be a Good Modeler
    4. Data Leakage
      1. Market Predictions
      2. Amazon Case Study: Big Spenders
      3. A Jewelry Sampling Problem
      4. IBM Customer Targeting
      5. Breast Cancer Detection
      6. Pneumonia Prediction
    5. How to Avoid Leakage
    6. Evaluating Models
      1. Accuracy: Meh
      2. Probabilities Matter, Not 0s and 1s
    7. Choosing an Algorithm
    8. A Final Example
    9. Parting Thoughts
  15. 14. Data Engineering: MapReduce, Pregel, and Hadoop
    1. About David Crawshaw
    2. Thought Experiment
    3. MapReduce
    4. Word Frequency Problem
      1. Enter MapReduce
    5. Other Examples of MapReduce
      1. What Can’t MapReduce Do?
    6. Pregel
    7. About Josh Wills
    8. Thought Experiment
    9. On Being a Data Scientist
      1. Data Abundance Versus Data Scarcity
      2. Designing Models
    10. Economic Interlude: Hadoop
      1. A Brief Introduction to Hadoop
      2. Cloudera
    11. Back to Josh: Workflow
    12. So How to Get Started with Hadoop?
  16. 15. The Students Speak
    1. Process Thinking
    2. Naive No Longer
    3. Helping Hands
    4. Your Mileage May Vary
    5. Bridging Tunnels
    6. Some of Our Work
  17. 16. Next-Generation Data Scientists, Hubris, and Ethics
    1. What Just Happened?
    2. What Is Data Science (Again)?
    3. What Are Next-Gen Data Scientists?
      1. Being Problem Solvers
      2. Cultivating Soft Skills
      3. Being Question Askers
    4. Being an Ethical Data Scientist
    5. Career Advice
  18. Index

Product information

  • Title: Doing Data Science
  • Author(s): Cathy O'Neil, Rachel Schutt
  • Release date: October 2013
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449358655