Finding patterns in massive event streams can be difficult, but learning how to find them doesn’t have to be. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. You’ll gain a practical, actionable view of big data by working with real data and real problems.
Perfect for beginners, this book’s approach will also appeal to experienced practitioners who want to brush up on their skills. Part I explains how Hadoop and MapReduce work, while Part II covers many analytic patterns you can use to process any data. As you work through several exercises, you’ll also learn how to use Apache Pig to process data.
Learn the necessary mechanics of working with Hadoop, including how data and computation move around the cluster
Dive into map/reduce mechanics and build your first map/reduce job in Python
Understand how to run chains of map/reduce jobs in the form of Pig scripts
Use a real-world dataset—baseball performance statistics—throughout the book
Work with examples of several analytic patterns, and learn when and where you might use them
Introduction: Theory and Tools
Chapter 1Hadoop Basics
Chimpanzee and Elephant Start a Business
Map-Only Jobs: Process Records Individually
Pig Latin Map-Only Job
Setting Up a Docker Hadoop Cluster
Chimpanzee and Elephant Save Christmas
Pygmy Elephants Carry Each Toy Form to the Appropriate Workbench
Example: Reindeer Games
Hadoop Versus Traditional Databases
The MapReduce Haiku
Chapter 3A Quick Look into Baseball
Acronyms and Terminology
The Rules and Goals
Chapter 4Introduction to Pig
Pig Helps Hadoop Work with Tables, Not Records
Fundamental Data Operations
LOAD Locates and Describes Your Data
STORE Writes Data to Disk
Development Aid Commands
Tactics: Analytic Patterns
Chapter 5Map-Only Operations
Pattern in Use
Selecting Records That Satisfy a Condition: FILTER and Friends
Project Only Chosen Columns by Name
Operations That Break One Table into Many
Operations That Treat the Union of Several Tables as One
Chapter 6Grouping Operations
Grouping Records into a Bag by Key
Group and Aggregate
Calculating the Distribution of Numeric Values with a Histogram
The Summing Trick
Chapter 7Joining Tables
Matching Records Between Tables (Inner Join)
How a Join Works
Enumerating a Many-to-Many Relationship
Joining a Table with Itself (Self-Join)
Joining Records Without Discarding Nonmatches (Outer Join)
Selecting Only Records That Lack a Match in Another Table (Anti-Join)
Selecting Only Records That Possess a Match in Another Table (Semi-Join)
Flip is the founder and CTO at Infochimps.com, a big data platform that makes acquiring, storing and analyzing massive data streams transformatively easier. He enjoys Bowling, Scrabble, working on old cars or new wood, and rooting for the Red Sox.
Russell Jurney cut his data teeth in casino gaming, building web apps to analyze the performance of slot machines in the US and Mexico. After dabbling in entrepreneurship, interactive media and journalism, he moved to silicon valley to build analytics applications at scale at Ning and LinkedIn. He lives on the ocean in Pacifica, California with his wife Kate and two fuzzy dogs.
The animal on the cover of Big Data for Chimps is a chimpanzee. In casual usage, the name "chimpanzee" now more often designates only the common chimpanzee, or Pan troglodytes, rather than the entire Pan genus, to which the bonobo, or Pan paniscus, also belongs. Chimps, as their name is often shortened, are the human species's closest living relative, having diverged from the evolutionary line along which Homo sapiens developed between 4 and 6 million years ago. Indeed, the remarkable sophistication of the chimpanzee, according to the standard of those same Homo sapiens, extends to the chimp's capacity for making and using tools, for interacting with other members of its species in complex social and political formations, and for displaying emotions, among other things. On January 31, 1961, a common chimp later named "Ham" even preceded his human counterparts into space by a full 10 weeks.
Chimpanzees can associate in stable groups of up to 100, a number that comprises smaller groups of a handful or more that may separate from the main group for periods of time. Male chimps may hunt together, and the distribution of meat from such expeditions may be used to establish and maintain social alliances. Well-documented accounts of sustained aggression between groups of chimpanzees have made them less attractive analogs for human potential, in recent years, than the chimp's more promiscuous, frugivorous, and possibly more matriarchal bonobo cousins.
Many of the animals on O'Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com.
The cover image is from Lydekker's Royal Natural History. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag's Ubuntu Mono.
This book leaves you with two things: physical intuition for how data moves through the system in a Hadoop job, and a practical cookbook for the full range of database constructs needed by the practicing data scientist. It's a great resource for both beginning and intermediate practitioners.
It covers the big data toolkit from the outside, maximizing programmer efficiency and maintainability over raw performance and sophistication. It emphasizes high-level tools that get the job done rather than grinding through the minutiae of the primitive Java APIs.
Bottom Line Yes, I would recommend this to a friend