Think Stats, 2nd Edition

Book description

If you know how to program, you have the skills to turn data into knowledge, using tools of probability and statistics. This concise introduction shows you how to perform statistical analysis computationally, rather than mathematically, with programs written in Python.

By working with a single case study throughout this thoroughly revised book, you’ll learn the entire process of exploratory data analysis—from collecting data and generating statistics to identifying patterns and testing hypotheses. You’ll explore distributions, rules of probability, visualization, and many other tools and concepts.

New chapters on regression, time series analysis, survival analysis, and analytic methods will enrich your discoveries.

  • Develop an understanding of probability and statistics by writing and testing code
  • Run experiments to test statistical behavior, such as generating samples from several distributions
  • Use simulations to understand concepts that are hard to grasp mathematically
  • Import data from most sources with Python, rather than rely on data that’s cleaned and formatted for statistics tools
  • Use statistical inference to answer questions about real-world data

Table of contents

  1. Preface
    1. How I Wrote This Book
    2. Using the Code
    3. Contributor List
    4. Safari® Books Online
    5. How to Contact Us
  2. 1. Exploratory Data Analysis
    1. A Statistical Approach
    2. The National Survey of Family Growth
    3. Importing the Data
    4. DataFrames
    5. Variables
    6. Transformation
    7. Validation
    8. Interpretation
    9. Exercises
    10. Glossary
  3. 2. Distributions
    1. Representing Histograms
    2. Plotting Histograms
    3. NSFG Variables
    4. Outliers
    5. First Babies
    6. Summarizing Distributions
    7. Variance
    8. Effect Size
    9. Reporting Results
    10. Exercises
    11. Glossary
  4. 3. Probability Mass Functions
    1. Pmfs
    2. Plotting PMFs
    3. Other Visualizations
    4. The Class Size Paradox
    5. DataFrame Indexing
    6. Exercises
    7. Glossary
  5. 4. Cumulative Distribution Functions
    1. The Limits of PMFs
    2. Percentiles
    3. CDFs
    4. Representing CDFs
    5. Comparing CDFs
    6. Percentile-Based Statistics
    7. Random Numbers
    8. Comparing Percentile Ranks
    9. Exercises
    10. Glossary
  6. 5. Modeling Distributions
    1. The Exponential Distribution
    2. The Normal Distribution
    3. Normal Probability Plot
    4. The lognormal Distribution
    5. The Pareto Distribution
    6. Generating Random Numbers
    7. Why Model?
    8. Exercises
    9. Glossary
  7. 6. Probability Density Functions
    1. PDFs
    2. Kernel Density Estimation
    3. The Distribution Framework
    4. Hist Implementation
    5. Pmf Implementation
    6. Cdf Implementation
    7. Moments
    8. Skewness
    9. Exercises
    10. Glossary
  8. 7. Relationships Between Variables
    1. Scatter Plots
    2. Characterizing Relationships
    3. Correlation
    4. Covariance
    5. Pearson’s Correlation
    6. Nonlinear Relationships
    7. Spearman’s Rank Correlation
    8. Correlation and Causation
    9. Exercises
    10. Glossary
  9. 8. Estimation
    1. The Estimation Game
    2. Guess the Variance
    3. Sampling Distributions
    4. Sampling Bias
    5. Exponential Distributions
    6. Exercises
    7. Glossary
  10. 9. Hypothesis Testing
    1. Classical Hypothesis Testing
    2. HypothesisTest
    3. Testing a Difference in Means
    4. Other Test Statistics
    5. Testing a Correlation
    6. Testing Proportions
    7. Chi-Squared Tests
    8. First Babies Again
    9. Errors
    10. Power
    11. Replication
    12. Exercises
    13. Glossary
  11. 10. Linear Least Squares
    1. Least Squares Fit
    2. Implementation
    3. Residuals
    4. Estimation
    5. Goodness of Fit
    6. Testing a Linear Model
    7. Weighted Resampling
    8. Exercises
    9. Glossary
  12. 11. Regression
    1. StatsModels
    2. Multiple Regression
    3. Nonlinear Relationships
    4. Data Mining
    5. Prediction
    6. Logistic Regression
    7. Estimating Parameters
    8. Implementation
    9. Accuracy
    10. Exercises
    11. Glossary
  13. 12. Time Series Analysis
    1. Importing and Cleaning
    2. Plotting
    3. Linear Regression
    4. Moving Averages
    5. Missing Values
    6. Serial Correlation
    7. Autocorrelation
    8. Prediction
    9. Further Reading
    10. Exercises
    11. Glossary
  14. 13. Survival Analysis
    1. Survival Curves
    2. Hazard Function
    3. Estimating Survival Curves
    4. Kaplan-Meier Estimation
    5. The Marriage Curve
    6. Estimating the Survival Function
    7. Confidence Intervals
    8. Cohort Effects
    9. Extrapolation
    10. Expected Remaining Lifetime
    11. Exercises
    12. Glossary
  15. 14. Analytic Methods
    1. Normal Distributions
    2. Sampling Distributions
    3. Representing Normal Distributions
    4. Central Limit Theorem
    5. Testing the CLT
    6. Applying the CLT
    7. Correlation Test
    8. Chi-Squared Test
    9. Discussion
    10. Exercises
  16. Index
  17. Colophon
  18. Copyright

Product information

  • Title: Think Stats, 2nd Edition
  • Author(s): Allen B. Downey
  • Release date: October 2014
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491907368