Head First Data Analysis
A learner's guide to big numbers, statistics, and good decisions
Publisher: O'Reilly Media
Release Date: July 2009
Pages: 484
Read on Safari with a 10day trial
Start your free trial now Buy on AmazonWhere’s the cart? Now you can get everything on Safari. To purchase books, visit Amazon or your favorite retailer. Questions? See our FAQ or contact customer service:
18008898969 / 7078277019
support@oreilly.com
Whether you're a product developer researching the market viability of a new product or service, a marketing manager gauging or predicting the effectiveness of a campaign, a salesperson who needs data to support product presentations, or a lone entrepreneur responsible for all of these dataintensive functions and more, the unique approach in Head First Data Analysis is by far the most efficient way to learn what you need to know to convert raw data into a vital business tool.
You'll learn how to:
 Determine which data sources to use for collecting information
 Assess data quality and distinguish signal from noise
 Build basic data models to illuminate patterns, and assimilate new information into the models
 Cope with ambiguous information
 Design experiments to test hypotheses and draw conclusions
 Use segmentation to organize your data within discrete market groups
 Visualize data distributions to reveal new relationships and persuade others
 Predict the future with sampling and probability models
 Clean your data to make it useful
 Communicate the results of your analysis to your audience
Using the latest research in cognitive science and learning theory to craft a multisensory learning experience, Head First Data Analysis uses a visually rich format designed for the way your brain works, not a textheavy approach that puts you to sleep.
Table of Contents

Chapter 1 Introduction to Data Analysis: Break it down

Acme Cosmetics needs your help

The CEO wants data analysis to help increase sales

Data analysis is careful thinking about evidence

Define the problem

Your client will help you define your problem

Acme’s CEO has some feedback for you

Break the problem and data into smaller pieces

Now take another look at what you know

Evaluate the pieces

Analysis begins when you insert yourself

Make a recommendation

Your report is ready

The CEO likes your work

An article just came across the wire

You let the CEO’s beliefs take you down the wrong path

Your assumptions and beliefs about the world are your mental model

Your statistical model depends on your mental model

Mental models should always include what you don’t know

The CEO tells you what he doesn’t know

Acme just sent you a huge list of raw data

Time to drill further into the data

General American Wholesalers confirms your impression

Here’s what you did

Your analysis led your client to a brilliant decision


Chapter 2 Experiments: Test your theories

It’s a coffee recession!

The Starbuzz board meeting is in three months

The Starbuzz Survey

Always use the method of comparison

Comparisons are key for observational data

Could value perception be causing the revenue decline?

A typical customer’s thinking

Observational studies are full of confounders

How location might be confounding your results

Manage confounders by breaking the data into chunks

It’s worse than we thought!

You need an experiment to say which strategy will work best

The Starbuzz CEO is in a big hurry

Starbuzz drops its prices

One month later...

Control groups give you a baseline

Not getting fired 101

Let’s experiment for real!

One month later...

Confounders also plague experiments

Avoid confounders by selecting groups carefully

Randomization selects similar groups

Your experiment is ready to go

The results are in

Starbuzz has an empirically tested sales strategy


Chapter 3 Optimization: Take it to the max

You’re now in the bath toy game

Constraints limit the variables you control

Decision variables are things you can control

You have an optimization problem

Find your objective with the objective function

Your objective function

Show product mixes with your other constraints

Plot multiple constraints on the same chart

Your good options are all in the feasible region

Your new constraint changed the feasible region

Your spreadsheet does optimization

Solver crunched your optimization problem in a snap

Profits fell through the floor

Your model only describes what you put into it

Calibrate your assumptions to your analytical objectives

Watch out for negatively linked variables

Your new plan is working like a charm

Your assumptions are based on an everchanging reality


Chapter 4 Data Visualization: Pictures make you smarter

New Army needs to optimize their website

The results are in, but the information designer is out

The last information designer submitted these three infographics

What data is behind the visualizations?

Show the data!

Here’s some unsolicited advice from the last designer

Too much data is never your problem

Making the data pretty isn’t your problem either

Data visualization is all about making the right comparisons

Your visualization is already more useful than the rejected ones

Use scatterplots to explore causes

The best visualizations are highly multivariate

Show more variables by looking at charts together

The visualization is great, but the web guru’s not satisfied yet

Good visual designs help you think about causes

The experiment designers weigh in

The experiment designers have some hypotheses of their own

The client is pleased with your work

Orders are coming in from everywhere!


Chapter 5 Hypothesis Testing: Say it ain’t so

Gimme some skin...

When do we start making new phone skins?

PodPhone doesn’t want you to predict their next move

Here’s everything we know

ElectroSkinny’s analysis does fit the data

ElectroSkinny obtained this confidential strategy memo

Variables can be negatively or positively linked

Causes in the real world are networked, not linear

Hypothesize PodPhone’s options

You have what you need to run a hypothesis test

Falsification is the heart of hypothesis testing

Diagnosticity helps you find the hypothesis with the least disconfirmation

You can’t rule out all the hypotheses, but you can say which is strongest

You just got a picture message...

It’s a launch!


Chapter 6 Bayesian Statistics: Get past first base

The doctor has disturbing news

Let’s take the accuracy analysis one claim at a time

How common is lizard flu really?

You’ve been counting false positives

All these terms describe conditional probabilities

You need to count

1 percent of people have lizard flu

Your chances of having lizard flu are still pretty low

Do complex probabilistic thinking with simple whole numbers

Bayes’ rule manages your base rates when you get new data

You can use Bayes’ rule over and over

Your second test result is negative

The new test has different accuracy statistics

New information can change your base rate

What a relief!


Chapter 7 Subjective Probabilities: Numerical belief

Backwater Investments needs your help

Their analysts are at each other’s throats

Subjective probabilities describe expert beliefs

Subjective probabilities might show no real disagreement after all

The analysts responded with their subjective probabilities

The CEO doesn’t see what you’re up to

The CEO loves your work

The standard deviation measures how far points are from the average

You were totally blindsided by this news

Bayes’ rule is great for revising subjective probabilities

The CEO knows exactly what to do with this new information

Russian stock owners rejoice!


Chapter 8 Heuristics: Analyze like a human

LitterGitters submitted their report to the city council

The LitterGitters have really cleaned up this town

The LitterGitters have been measuring their campaign’s effectiveness

The mandate is to reduce the tonnage of litter

Tonnage is unfeasible to measure

Give people a hard question, and they’ll answer an easier one instead

Littering in Dataville is a complex system

You can’t build and implement a unified littermeasuring model

Heuristics are a middle ground between going with your gut and optimization

Use a fast and frugal tree

Is there a simpler way to assess LitterGitters’ success?

Stereotypes are heuristics

Your analysis is ready to present

Looks like your analysis impressed the city council members


Chapter 9 Histograms: The shape of numbers

Your annual review is coming up

Going for more cash could play out in a bunch of different ways

Here’s some data on raises

Histograms show frequencies of groups of numbers

Gaps between bars in a histogram mean gaps among the data points

Install and run R

Load data into R

R creates beautiful histograms

Make histograms from subsets of your data

Negotiation pays

What will negotiation mean for you?


Chapter 10 Regression: Prediction

What are you going to do with all this money?

An analysis that tells people what to ask for could be huge

Behold... the Raise Reckoner!

Inside the algorithm will be a method to predict raises

Scatterplots compare two variables

A line could tell your clients where to aim

Predict values in each strip with the graph of averages

The regression line predicts what raises people will receive

The line is useful if your data shows a linear correlation

You need an equation to make your predictions precise

Tell R to create a regression object

The regression equation goes hand in hand with your scatterplot

The regression equation is the Raise Reckoner algorithm

Your raise predictor didn’t work out as planned...


Chapter 11 Error: Err Well

Your clients are pretty ticked off

What did your raise prediction algorithm do?

The segments of customers

The guy who asked for 25% went outside the model

How to handle the client who wants a prediction outside the data range

The guy who got fired because of extrapolation has cooled off

You’ve only solved part of the problem

What does the data for the screwy outcomes look like?

Chance errors are deviations from what your model predicts

Error is good for you and your client

Specify error quantitatively

Quantify your residual distribution with Root Mean Squared error

Your model in R already knows the R.M.S. error

R’s summary of your linear model shows your R.M.S. error

Segmentation is all about managing error

Good regressions balance explanation and prediction

Your segmented models manage error better than the original model

Your clients are returning in droves


Chapter 12 Relational Databases: Can you relate?

The Dataville Dispatch wants to analyze sales

Here’s the data they keep to track their operations

You need to know how the data tables relate to each other

A database is a collection of data with wellspecified relations to each other

Trace a path through the relations to make the comparison you need

Create a spreadsheet that goes across that path

Your summary ties article count and sales together

Looks like your scatterplot is going over really well

Copying and pasting all that data was a pain

Relational databases manage relations for you

Dataville Dispatch built an RDBMS with your relationship diagram

Dataville Dispatch extracted your data using the SQL language

Comparison possibilities are endless if your data is in a RDBMS

You’re on the cover


Chapter 13 Cleaning Data: Impose order

Just got a client list from a defunct competitor

The dirty secret of data analysis

Head First Head Hunters wants the list for their sales team

Cleaning messy data is all about preparation

Once you’re organized, you can fix the data itself

Use the # sign as a delimiter

Excel split your data into columns using the delimiter

Use SUBSTITUTE to replace the carat character

You cleaned up all the first names

The last name pattern is too complex for SUBSTITUTE

Handle complex patterns with nested text formulas

R can use regular expressions to crunch complex data patterns

The sub command fixed your last names

Now you can ship the data to your client

Maybe you’re not quite done yet...

Sort your data to show duplicate values together

The data is probably from a relational database

Remove duplicate names

You created nice, clean, unique records

Head First Head Hunters is recruiting like gangbusters!

Leaving town...

It’s been great having you here in Dataville!


Appendix Leftovers: The Top Ten Things (we didn’t cover)

#1: Everything else in statistics

#2: Excel skills

#3: Edward Tufte and his principles of visualization

#4: PivotTables

#5: The R community

#6: Nonlinear and multiple regression

#7: Nullalternative hypothesis testing

#8: Randomness

#9: Google Docs

#10: Your expertise


Appendix Install R: Start R up!

Get started with R


Appendix Install Excel Analysis Tools: The ToolPak

Install the data analysis tools in Excel
