Big Data for Chimps

Book description

Finding patterns in massive event streams can be difficult, but learning how to find them doesn’t have to be. This unique hands-on guide shows you how to solve this and many other problems in large-scale data processing with simple, fun, and elegant tools that leverage Apache Hadoop. You’ll gain a practical, actionable view of big data by working with real data and real problems.

Perfect for beginners, this book’s approach will also appeal to experienced practitioners who want to brush up on their skills. Part I explains how Hadoop and MapReduce work, while Part II covers many analytic patterns you can use to process any data. As you work through several exercises, you’ll also learn how to use Apache Pig to process data.

  • Learn the necessary mechanics of working with Hadoop, including how data and computation move around the cluster
  • Dive into map/reduce mechanics and build your first map/reduce job in Python
  • Understand how to run chains of map/reduce jobs in the form of Pig scripts
  • Use a real-world dataset—baseball performance statistics—throughout the book
  • Work with examples of several analytic patterns, and learn when and where you might use them

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. What This Book Covers
    2. Who This Book Is For
    3. Who This Book Is Not For
    4. What This Book Does Not Cover
    5. Theory: Chimpanzee and Elephant
    6. Practice: Hadoop
    7. Example Code
    8. A Note on Python and MrJob
    9. Helpful Reading
    10. Feedback
    11. Conventions Used in This Book
    12. Using Code Examples
    13. Safari® Books Online
    14. How to Contact Us
  2. I. Introduction: Theory and Tools
  3. 1. Hadoop Basics
    1. Chimpanzee and Elephant Start a Business
    2. Map-Only Jobs: Process Records Individually
    3. Pig Latin Map-Only Job
    4. Setting Up a Docker Hadoop Cluster
      1. Run the Job
    5. Wrapping Up
  4. 2. MapReduce
    1. Chimpanzee and Elephant Save Christmas
      1. Trouble in Toyland
      2. Chimpanzees Process Letters into Labeled Toy Forms
    2. Pygmy Elephants Carry Each Toy Form to the Appropriate Workbench
    3. Example: Reindeer Games
      1. UFO Data
      2. Group the UFO Sightings by Reporting Delay
      3. Mapper
      4. Reducer
      5. Plot the Data
      6. Reindeer Conclusion
    4. Hadoop Versus Traditional Databases
    5. The MapReduce Haiku
      1. Map Phase, in Light Detail
      2. Group-Sort Phase, in Light Detail
      3. Reduce Phase, in Light Detail
    6. Wrapping Up
  5. 3. A Quick Look into Baseball
    1. The Data
    2. Acronyms and Terminology
    3. The Rules and Goals
    4. Performance Metrics
    5. Wrapping Up
  6. 4. Introduction to Pig
    1. Pig Helps Hadoop Work with Tables, Not Records
      1. Wikipedia Visitor Counts
    2. Fundamental Data Operations
      1. Control Operations
      2. Pipelinable Operations
      3. Structural Operations
    3. LOAD Locates and Describes Your Data
      1. Simple Types
      2. Complex Type 1, Tuples: Fixed-Length Sequence of Typed Fields
      3. Complex Type 2, Bags: Unbounded Collection of Tuples
      4. Defining the Schema of a Transformed Record
    4. STORE Writes Data to Disk
    5. Development Aid Commands
      1. DESCRIBE
      2. DUMP
      3. SAMPLE
      4. ILLUSTRATE
      5. EXPLAIN
    6. Pig Functions
    7. Piggybank
    8. Apache DataFu
    9. Wrapping Up
  7. II. Tactics: Analytic Patterns
  8. 5. Map-Only Operations
    1. Pattern in Use
    2. Eliminating Data
    3. Selecting Records That Satisfy a Condition: FILTER and Friends
      1. Selecting Records That Satisfy Multiple Conditions
      2. Selecting or Rejecting Records with a null Value
      3. Selecting Records That Match a Regular Expression (MATCHES)
      4. Matching Records Against a Fixed List of Lookup Values
    4. Project Only Chosen Columns by Name
      1. Using a FOREACH to Select, Rename, and Reorder fields
      2. Extracting a Random Sample of Records
      3. Extracting a Consistent Sample of Records by Key
      4. Sampling Carelessly by Only Loading Some part- Files
      5. Selecting a Fixed Number of Records with LIMIT
      6. Other Data Elimination Patterns
    5. Transforming Records
      1. Transforming Records Individually Using FOREACH
      2. A Nested FOREACH Allows Intermediate Expressions
      3. Formatting a String According to a Template
      4. Assembling Literals with Complex Types
      5. Manipulating the Type of a Field
      6. Ints and Floats and Rounding, Oh My!
      7. Calling a User-Defined Function from an External Package
    6. Operations That Break One Table into Many
      1. Directing Data Conditionally into Multiple Dataflows (SPLIT)
    7. Operations That Treat the Union of Several Tables as One
      1. Treating Several Pig Relation Tables as a Single Table (Stacking Rowsets)
    8. Wrapping Up
  9. 6. Grouping Operations
    1. Grouping Records into a Bag by Key
      1. Pattern in Use
      2. Counting Occurrences of a Key
      3. Representing a Collection of Values with a Delimited String
      4. Representing a Complex Data Structure with a Delimited String
      5. Representing a Complex Data Structure with a JSON-Encoded String
    2. Group and Aggregate
      1. Aggregating Statistics of a Group
      2. Completely Summarizing a Field
      3. Summarizing Aggregate Statistics of a Full Table
      4. Summarizing a String Field
    3. Calculating the Distribution of Numeric Values with a Histogram
      1. Pattern in Use
      2. Binning Data for a Histogram
      3. Choosing a Bin Size
      4. Interpreting Histograms and Quantiles
      5. Binning Data into Exponentially Sized Buckets
      6. Creating Pig Macros for Common Stanzas
      7. Distribution of Games Played
      8. Extreme Populations and Confounding Factors
      9. Don’t Trust Distributions at the Tails
      10. Calculating a Relative Distribution Histogram
      11. Reinjecting Global Values
      12. Calculating a Histogram Within a Group
      13. Dumping Readable Results
    4. The Summing Trick
      1. Counting Conditional Subsets of a Group—The Summing Trick
      2. Summarizing Multiple Subsets of a Group Simultaneously
      3. Testing for Absence of a Value Within a Group
    5. Wrapping Up
    6. References
  10. 7. Joining Tables
    1. Matching Records Between Tables (Inner Join)
      1. Joining Records in a Table with Directly Matching Records from Another Table (Direct Inner Join)
    2. How a Join Works
      1. A Join Is a COGROUP+FLATTEN
      2. A Join Is a MapReduce Job with a Secondary Sort on the Table Name
      3. Handling nulls and Nonmatches in Joins and Groups
    3. Enumerating a Many-to-Many Relationship
    4. Joining a Table with Itself (Self-Join)
    5. Joining Records Without Discarding Nonmatches (Outer Join)
      1. Pattern in Use
      2. Joining Tables That Do Not Have a Foreign-Key Relationship
      3. Joining on an Integer Table to Fill Holes in a List
    6. Selecting Only Records That Lack a Match in Another Table (Anti-Join)
    7. Selecting Only Records That Possess a Match in Another Table (Semi-Join)
      1. An Alternative to Anti-Join: Using a COGROUP
    8. Wrapping Up
  11. 8. Ordering Operations
    1. Preparing Career Epochs
    2. Sorting All Records in Total Order
      1. Sorting by Multiple Fields
      2. Sorting on an Expression (You Can’t)
      3. Sorting Case-Insensitive Strings
      4. Dealing with nulls When Sorting
      5. Floating Values to the Top or Bottom of the Sort Order
    3. Sorting Records Within a Group
      1. Pattern in Use
      2. Selecting Rows with the Top-K Values for a Field
      3. Top K Within a Group
    4. Numbering Records in Rank Order
      1. Finding Records Associated with Maximum Values
      2. Shuffling a Set of Records
    5. Wrapping Up
  12. 9. Duplicate and Unique Records
    1. Handling Duplicates
      1. Eliminating Duplicate Records from a Table
      2. Eliminating Duplicate Records from a Group
      3. Eliminating All But One Duplicate Based on a Key
      4. Selecting Records with Unique (or with Duplicate) Values for a Key
    2. Set Operations
      1. Set Operations on Full Tables
      2. Distinct Union
      3. Distinct Union (Alternative Method)
      4. Set Intersection
      5. Set Difference
      6. Symmetric Set Difference: (A–B)+(B–A)
      7. Set Equality
      8. Set Operations Within Groups
      9. Constructing a Sequence of Sets
      10. Set Operations Within a Group
    3. Wrapping Up
  13. Index

Product information

  • Title: Big Data for Chimps
  • Author(s): Philip Kromer, Russell Jurney
  • Release date: September 2015
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491923900