Data Analysis with Open Source Tools
A HandsOn Guide for Programmers and Data Scientists
Publisher: O'Reilly Media
Release Date: November 2010
Pages: 540
Read on Safari with a 10day trial
Start your free trial now Buy on AmazonWhere’s the cart? Now you can get everything on Safari. To purchase books, visit Amazon or your favorite retailer. Questions? See our FAQ or contact customer service:
18008898969 / 7078277019
support@oreilly.com
Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.
Along the way, you'll experiment with concepts through handson workshops at the end of each chapter. Above all, you'll learn how to think about the results you want to achieve  rather than rely on tools to think for you.
 Use graphics to describe data with one, two, or dozens of variables
 Develop conceptual models using backoftheenvelope calculations, as well as scaling and probability arguments
 Mine data with computationally intensive methods such as simulation and clustering
 Make your conclusions understandable through reports, dashboards, and other metrics programs
 Understand financial calculations, including the timevalue of money
 Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations
 Become familiar with different open source programming environments for data analysis
"Finally, a concise reference for understanding how to conquer piles of data." Austin King, Senior Web Developer, Mozilla
"An indispensable text for aspiring data scientists." Michael E. Driscoll, CEO/Founder, Dataspora
Table of Contents

Chapter 1 Introduction

Data Analysis

What’s in This Book

What’s with the Workshops?

What’s with the Math?

What You’ll Need

What’s Missing


Graphics: Looking at Data

Chapter 2 A Single Variable: Shape and Distribution
 Dot and Jitter Plots
 Histograms and Kernel Density Estimates
 The Cumulative Distribution Function
 RankOrder Plots and Lift Charts
 Only When Appropriate: Summary Statistics and Box Plots
 Workshop: NumPy
 Further Reading

Chapter 3 Two Variables: Establishing Relationships
 Scatter Plots
 Conquering Noise: Smoothing
 Logarithmic Plots
 Banking
 Linear Regression and All That
 Showing What’s Important
 Graphical Analysis and Presentation Graphics
 Workshop: matplotlib
 Further Reading

Chapter 4 Time As a Variable: TimeSeries Analysis
 Examples
 The Task
 Smoothing
 Don’t Overlook the Obvious!
 The Correlation Function
 Optional: Filters and Convolutions
 Workshop: scipy.signal
 Further Reading

Chapter 5 More Than Two Variables: Graphical Multivariate Analysis
 FalseColor Plots
 A Lot at a Glance: Multiplots
 Composition Problems
 Novel Plot Types
 Interactive Explorations
 Workshop: Tools for Multivariate Graphics
 Further Reading

Chapter 6 Intermezzo: A Data Analysis Session
 A Data Analysis Session
 Workshop: gnuplot
 Further Reading


Analytics: Modeling Data

Chapter 7 Guesstimation and the Back of the Envelope
 Principles of Guesstimation
 How Good Are Those Numbers?
 Optional: A Closer Look at Perturbation Theory and Error Propagation
 Workshop: The Gnu Scientific Library (GSL)
 Further Reading

Chapter 8 Models from Scaling Arguments
 Models
 Arguments from Scale
 MeanField Approximations
 Common TimeEvolution Scenarios
 Case Study: How Many Servers Are Best?
 Why Modeling?
 Workshop: Sage
 Further Reading

Chapter 9 Arguments from Probability Models
 The Binomial Distribution and Bernoulli Trials
 The Gaussian Distribution and the Central Limit Theorem
 PowerLaw Distributions and NonNormal Statistics
 Other Distributions
 Optional: Case Study—Unique Visitors over Time
 Workshop: PowerLaw Distributions
 Further Reading

Chapter 10 What You Really Need to Know About Classical Statistics
 Genesis
 Statistics Defined
 Statistics Explained
 Controlled Experiments Versus Observational Studies
 Optional: Bayesian Statistics—The Other Point of View
 Workshop: R
 Further Reading

Chapter 11 Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That
 How to Average Averages
 The Standard Deviation
 Least Squares
 Further Reading


Computation: Mining Data

Chapter 12 Simulations
 A WarmUp Question
 Monte Carlo Simulations
 Resampling Methods
 Workshop: Discrete Event Simulations with SimPy
 Further Reading

Chapter 13 Finding Clusters
 What Constitutes a Cluster?
 Distance and Similarity Measures
 Clustering Methods
 Pre and Postprocessing
 Other Thoughts
 A Special Case: Market Basket Analysis
 A Word of Warning
 Workshop: Pycluster and the C Clustering Library
 Further Reading

Chapter 14 Seeing the Forest for the Trees: Finding Important Attributes
 Principal Component Analysis
 Visual Techniques
 Kohonen Maps
 Workshop: PCA with R
 Further Reading

Chapter 15 Intermezzo: When More Is Different
 A Horror Story
 Some Suggestions
 What About Map/Reduce?
 Workshop: Generating Permutations
 Further Reading


Applications: Using Data

Chapter 16 Reporting, Business Intelligence, and Dashboards
 Business Intelligence
 Corporate Metrics and Dashboards
 Data Quality Issues
 Workshop: Berkeley DB and SQLite
 Further Reading

Chapter 17 Financial Calculations and Modeling
 The Time Value of Money
 Uncertainty in Planning and Opportunity Costs
 Cost Concepts and Depreciation
 Should You Care?
 Is This All That Matters?
 Workshop: The Newsvendor Problem
 Further Reading

Chapter 18 Predictive Analytics
 Topics in Predictive Analytics
 Some Classification Terminology
 Algorithms for Classification
 The Process
 The Secret Sauce
 The Nature of Statistical Learning
 Workshop: Two DoItYourself Classifiers
 Further Reading

Chapter 19 Epilogue: Facts Are Not Reality


Appendix Programming Environments for Scientific Computation and Data Analysis

Software Tools

A Catalog of Scientific Software

Writing Your Own

Further Reading


Appendix Results from Calculus

Common Functions

Calculus

Useful Tricks

Notation and Basic Math

Where to Go from Here

Further Reading


Appendix Working with Data

Sources for Data

Cleaning and Conditioning

Sampling

Data File Formats

The Care and Feeding of Your Data Zoo

Skills

Terminology

Further Reading


Appendix About the Author

Colophon