What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems.
From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it.
Among the many topics covered, you’ll discover how to:
Test drive your data to see if it’s ready for analysis
Work spreadsheet data into a usable form
Handle encoding problems that lurk in text data
Develop a successful web-scraping effort
Use NLP tools to reveal the real sentiment of online reviews
Address cloud computing issues that can impact your analysis effort
Avoid policies that create data analysis roadblocks
Take a systematic approach to data quality analysis
Chapter 1 Setting the Pace: What Is Bad Data?
Chapter 2 Is It Just Me, or Does This Data Smell Funny?
Understand the Data Structure
Physical Interpretation of Simple Statistics
Keyword PPC Example
Search Referral Example
Time Series Data
Chapter 3 Data Intended for Human Consumption, Not Machine Consumption
The Problem: Data Formatted for Human Consumption
The Solution: Writing Code
Chapter 4 Bad Data Lurking in Plain Text
Which Plain Text Encoding?
Guessing Text Encoding
Problem: Application-Specific Characters Leaking into Plain Text
Text Processing with Python
Chapter 5 (Re)Organizing the Web’s Data
Can You Get That?
General Workflow Example
The Real Difficulties
The Dark Side
Chapter 6 Detecting Liars and the Confused in Contradictory Online Reviews
Training a Classifier
Validating the Classifier
Designing with Data
Chapter 7 Will the Bad Data Please Stand Up?
Example 1: Defect Reduction in Manufacturing
Example 2: Who’s Calling?
Example 3: When “Typical” Does Not Mean “Average”
Will This Be on the Test?
Chapter 8 Blood, Sweat, and Urine
A Very Nerdy Body Swap Comedy
How Chemists Make Up Numbers
All Your Database Are Belong to Us
Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository
Rehab for Chemists (and Other Spreadsheet Abusers)
Chapter 9 When Data and Reality Don’t Match
Whose Ticker Is It Anyway?
Splits, Dividends, and Rescaling
Chapter 10 Subtle Sources of Bias and Error
Imputation Bias: General Issues
Reporting Errors: General Issues
Other Sources of Bias
Chapter 11 Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
But First, Let’s Reflect on Graduate School …
Moving On to the Professional World
Moving into Government Work
Government Data Is Very Real
Service Call Data as an Applied Example
Lessons Learned and Looking Ahead
Chapter 12 When Databases Attack: A Guide for When to Stick to Files
Consider Files as Your Datastore
A Web Framework Backed by Files
Chapter 13 Crouching Table, Hidden Network
A Relational Cost Allocations Model
The Delicate Sound of a Combinatorial Explosion…
The Hidden Network Emerges
Storing the Graph
Navigating the Graph with Gremlin
Finding Value in Network Properties
Think in Terms of Multiple Data Models and Use the Right Tool for the Job
Chapter 14 Myths of Cloud Computing
Introduction to the Cloud
What Is “The Cloud”?
The Cloud and Big Data
At First Everything Is Great
They Put 100% of Their Infrastructure in the Cloud
As Things Grow, They Scale Easily at First
Then Things Start Having Trouble
They Need to Improve Performance
Higher IO Becomes Critical
A Major Regional Outage Causes Massive Downtime
Higher IO Comes with a Cost
Data Sizes Increase
Geo Redundancy Becomes a Priority
Horizontal Scale Isn’t as Easy as They Hoped
Costs Increase Dramatically
Myth 1: Cloud Is a Great Solution for All Infrastructure Components
Myth 2: Cloud Will Save Us Money
Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID
Myth 4: Cloud Computing Makes Horizontal Scaling Easy
Conclusion and Recommendations
Chapter 15 The Dark Side of Data Science
Avoid These Pitfalls
Know Nothing About Thy Data
Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks
Thou Shalt Analyze for Analysis’ Sake Only
Thou Shalt Compartmentalize Learnings
Thou Shalt Expect Omnipotence from Data Scientists
Chapter 16 How to Feed and Care for Your Machine-Learning Experts
Define the Problem
Fake It Before You Make It
Create a Training Set
Pick the Features
Encode the Data
Split Into Training, Test, and Solution Sets
Describe the Problem
Respond to Questions
Integrate the Solutions
Chapter 17 Data Traceability
Immutability: Borrowing an Idea from Functional Programming
Chapter 18 Social Media: Erasable Ink?
Social Media: Whose Data Is This Anyway?
Expectations Around Communication and Expression
Technical Implications of New End User Expectations
What Does the Industry Do?
What Should End Users Do?
How Do We Work Together?
Chapter 19 Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
Framework Introduction: The Four Cs of Data Quality Analysis
Q Ethan McCallum is a consultant, writer, and technology enthusiast, though perhaps not in that order. His work has appeared online on The O’Reilly Network and Java.net, and also in print publications such as C/C++ Users Journal, Doctor Dobb’s Journal, and Linux Magazine. In his professional roles, he helps companies to make smart decisions about data and technology.
The animal on the cover of Bad Data Handbook is Ross’s goose (Chen rossii or Anser rossii), a North American species that gets its name from Bernard R. Ross, a Hudson’s Bay Company factor at Fort Resolution in Canada’s Northwest Territories. Other names coined for this species are “galoot” and “scabby-nosed wavey,” by Northmen. There is debate about whether these geese belong in the “white” geese genus of Chen or the traditional “gray” goose genus of Anser. Their plumage is primarily white with black wing tips, reminiscent of the white-phase Snow Goose, but about 40% smaller in size.
No matter their technical genus, these birds breed in northern Canada and the central Arctic (primarily in the Queen Maud Gulf Migratory Bird Sanctuary), wintering far south in the southern United States, central California, and sometimes northern Mexico. In Western Europe, these birds are kept mostly in wildfowl collections, but escaped or feral birds are often encountered with other feral geese (Cananda Goose, Greylag Goose, Barnacle Goose).
The cover font is Adobe ITC Garamond. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the code font is LucasFont’s TheSansMonoCondensed.
This book covers an interesting subject: the difficulties and traps that come with real world data. Overall the book provides some interesting insights on how to tackle common issues. Case studies and examples come from various areas from web's data to chemistry.
The level of details along the different chapters is not constant. For instance, a couple of chapters go into details and provide code snippets while many others only cover the topic in a more superficial fashion.
Some chapters encompass the "bad data" topic in the sense they do not relate to data problems but rather on bad choice of technologies for storage or analysis.
Bottom Line Yes, I would recommend this to a friend
Bad data is a fact of life. Coping with bad data is a valuable, learned skill. Bad Data Handbook offers insights from over 20 authors based on their years of personal experience managing ill-defined, often chaotic and incomplete data. We begin with a exploration of what is meant by *bad data* and what checks we can preform to help us understand data quality as a prerequisite to data analysis.
Kevin Fink offers suggestions on approaching data critically in order to ensure that we understand what we're working with before we begin to try to manipulate it. Fink offers useful scripts in shell and Perl that can be used to inspect data and perform basic sanity checks. Paul Murrell tackles the problem of scraping data from sources formatted for human consumption into a format more amenable for algorithmic analysis using R. And on and on.
Each chapter addresses a critical concern in the data life-cycle: identifying, annotating, capturing, archiving, versioning, manipulating, analyzing, and deriving actionable information from imperfect or incomplete data. The advice offered is both powerful and immediately useful to data scientists and newcomers to the field alike and for me has spurred several ideas for how to approach teaching statistics.
Given the number of authors who contributed to this volume, it should come as no surprise that the tone, writing styles, and tools used vary greatly among the chapters, sometimes wandering into technical minutia, but only infrequently. The book holds together remarkably well, regardless, and was a pleasure to read.
Bottom Line Yes, I would recommend this to a friend