Think Stats, 2nd Edition
Exploratory Data Analysis
Publisher: O'Reilly Media
Release Date: October 2014
Pages: 225
Read on Safari with a 10day trial
Start your free trial now Buy on AmazonWhere’s the cart? Now you can get everything on Safari. To purchase books, visit Amazon or your favorite retailer. Questions? See our FAQ or contact customer service:
18008898969 / 7078277019
support@oreilly.com
If you know how to program, you have the skills to turn data into knowledge, using tools of probability and statistics. This concise introduction shows you how to perform statistical analysis computationally, rather than mathematically, with programs written in Python.
By working with a single case study throughout this thoroughly revised book, you’ll learn the entire process of exploratory data analysis—from collecting data and generating statistics to identifying patterns and testing hypotheses. You’ll explore distributions, rules of probability, visualization, and many other tools and concepts.
New chapters on regression, time series analysis, survival analysis, and analytic methods will enrich your discoveries.
 Develop an understanding of probability and statistics by writing and testing code
 Run experiments to test statistical behavior, such as generating samples from several distributions
 Use simulations to understand concepts that are hard to grasp mathematically
 Import data from most sources with Python, rather than rely on data that’s cleaned and formatted for statistics tools
 Use statistical inference to answer questions about realworld data
Table of Contents

Chapter 1 Exploratory Data Analysis

A Statistical Approach

The National Survey of Family Growth

Importing the Data

DataFrames

Variables

Transformation

Validation

Interpretation

Exercises

Glossary


Chapter 2 Distributions

Representing Histograms

Plotting Histograms

NSFG Variables

Outliers

First Babies

Summarizing Distributions

Variance

Effect Size

Reporting Results

Exercises

Glossary


Chapter 3 Probability Mass Functions

Pmfs

Plotting PMFs

Other Visualizations

The Class Size Paradox

DataFrame Indexing

Exercises

Glossary


Chapter 4 Cumulative Distribution Functions

The Limits of PMFs

Percentiles

CDFs

Representing CDFs

Comparing CDFs

PercentileBased Statistics

Random Numbers

Comparing Percentile Ranks

Exercises

Glossary


Chapter 5 Modeling Distributions

The Exponential Distribution

The Normal Distribution

Normal Probability Plot

The lognormal Distribution

The Pareto Distribution

Generating Random Numbers

Why Model?

Exercises

Glossary


Chapter 6 Probability Density Functions

PDFs

Kernel Density Estimation

The Distribution Framework

Hist Implementation

Pmf Implementation

Cdf Implementation

Moments

Skewness

Exercises

Glossary


Chapter 7 Relationships Between Variables

Scatter Plots

Characterizing Relationships

Correlation

Covariance

Pearson’s Correlation

Nonlinear Relationships

Spearman’s Rank Correlation

Correlation and Causation

Exercises

Glossary


Chapter 8 Estimation

The Estimation Game

Guess the Variance

Sampling Distributions

Sampling Bias

Exponential Distributions

Exercises

Glossary


Chapter 9 Hypothesis Testing

Classical Hypothesis Testing

HypothesisTest

Testing a Difference in Means

Other Test Statistics

Testing a Correlation

Testing Proportions

ChiSquared Tests

First Babies Again

Errors

Power

Replication

Exercises

Glossary


Chapter 10 Linear Least Squares

Least Squares Fit

Implementation

Residuals

Estimation

Goodness of Fit

Testing a Linear Model

Weighted Resampling

Exercises

Glossary


Chapter 11 Regression

StatsModels

Multiple Regression

Nonlinear Relationships

Data Mining

Prediction

Logistic Regression

Estimating Parameters

Implementation

Accuracy

Exercises

Glossary


Chapter 12 Time Series Analysis

Importing and Cleaning

Plotting

Linear Regression

Moving Averages

Missing Values

Serial Correlation

Autocorrelation

Prediction

Further Reading

Exercises

Glossary


Chapter 13 Survival Analysis

Survival Curves

Hazard Function

Estimating Survival Curves

KaplanMeier Estimation

The Marriage Curve

Estimating the Survival Function

Confidence Intervals

Cohort Effects

Extrapolation

Expected Remaining Lifetime

Exercises

Glossary


Chapter 14 Analytic Methods

Normal Distributions

Sampling Distributions

Representing Normal Distributions

Central Limit Theorem

Testing the CLT

Applying the CLT

Correlation Test

ChiSquared Test

Discussion

Exercises
