MapReduce Design Patterns

Book description

Until now, design patterns for the MapReduce framework have been scattered among various research papers, blogs, and books. This handy guide brings together a unique collection of valuable MapReduce patterns that will save you time and effort regardless of the domain, language, or development framework you’re using.

Each pattern is explained in context, with pitfalls and caveats clearly identified to help you avoid common design mistakes when modeling your big data architecture. This book also provides a complete overview of MapReduce that explains its origins and implementations, and why design patterns are so important. All code examples are written for Hadoop.

  • Summarization patterns: get a top-level view by summarizing and grouping data
  • Filtering patterns: view data subsets such as records generated from one user
  • Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier
  • Join patterns: analyze different datasets together to discover interesting relationships
  • Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
  • Input and output patterns: customize the way you use Hadoop to load or store data

"A clear exposition of MapReduce programs for common data processing patterns—this book is indespensible for anyone using Hadoop."

--Tom White, author of Hadoop: The Definitive Guide

Publisher resources

View/Submit Errata

Table of contents

  1. Dedication
  2. Preface
    1. Intended Audience
    2. Pattern Format
    3. The Examples in This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments
  3. 1. Design Patterns and MapReduce
    1. Design Patterns
    2. MapReduce History
    3. MapReduce and Hadoop Refresher
    4. Hadoop Example: Word Count
    5. Pig and Hive
  4. 2. Summarization Patterns
    1. Numerical Summarizations
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Resemblances
        8. Performance analysis
      2. Numerical Summarization Examples
        1. Minimum, maximum, and count example
          1. MinMaxCountTuple code
          2. Mapper code
          3. Reducer code
          4. Combiner optimization
          5. Data flow diagram
        2. Average example
          1. Mapper code
          2. Reducer code
          3. Combiner optimization
          4. Data flow diagram
        3. Median and standard deviation
          1. Mapper code
          2. Reducer code
          3. Combiner optimization
        4. Memory-conscious median and standard deviation
          1. Mapper code
          2. Reducer code
          3. Combiner optimization
          4. Data flow diagram
    2. Inverted Index Summarizations
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Performance analysis
      2. Inverted Index Example
        1. Wikipedia reference inverted index
          1. Mapper code
          2. Reducer code
          3. Combiner optimization
    3. Counting with Counters
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Performance analysis
      2. Counting with Counters Example
        1. Number of users per state
          1. Mapper code
          2. Driver code
  5. 3. Filtering Patterns
    1. Filtering
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Resemblances
        8. Performance analysis
      2. Filtering Examples
        1. Distributed grep
          1. Mapper code
        2. Simple Random Sampling
          1. Mapper Code
    2. Bloom Filtering
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Resemblances
        8. Performance analysis
      2. Bloom Filtering Examples
        1. Hot list
          1. Bloom filter training
          2. Mapper code
        2. HBase Query using a Bloom filter
          1. Mapper Code
    3. Top Ten
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Resemblances
        8. Performance analysis
      2. Top Ten Examples
        1. Top ten users by reputation
          1. Mapper code
          2. Reducer code
    4. Distinct
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Resemblances
        8. Performance analysis
      2. Distinct Examples
        1. Distinct user IDs
          1. Mapper code
          2. Reducer code
          3. Combiner optimization
  6. 4. Data Organization Patterns
    1. Structured to Hierarchical
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Resemblances
        8. Performance analysis
      2. Structured to Hierarchical Examples
        1. Post/comment building on StackOverflow
          1. Driver code
          2. Mapper code
          3. Reducer code
        2. Question/answer building on StackOverflow
          1. Mapper code
          2. Reducer code
    2. Partitioning
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Known uses
        7. Resemblances
        8. Performance analysis
      2. Partitioning Examples
        1. Partitioning users by last access date
          1. Driver code
          2. Mapper code
          3. Partitioner code
          4. Reducer code
    3. Binning
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Structure
        4. Consequences
        5. Resemblances
        6. Performance analysis
      2. Binning Examples
        1. Binning by Hadoop-related tags
          1. Driver code
          2. Mapper code
    4. Total Order Sorting
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Resemblances
        7. Performance analysis
      2. Total Order Sorting Examples
        1. Sort users by last visit
          1. Driver code
          2. Analyze mapper code
          3. Order mapper code
          4. Order reducer code
    5. Shuffling
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Structure
        4. Consequences
        5. Resemblances
        6. Performance analysis
      2. Shuffle Examples
        1. Anonymizing StackOverflow comments
          1. Mapper code
          2. Reducer code
  7. 5. Join Patterns
    1. A Refresher on Joins
    2. Reduce Side Join
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Resemblances
        7. Performance analysis
      2. Reduce Side Join Example
        1. User and comment join
          1. Driver code
          2. User mapper code
          3. Comment mapper code
          4. Reducer code
          5. Combiner optimization
      3. Reduce Side Join with Bloom Filter
        1. Reputable user and comment join
          1. User mapper code
          2. Comment mapper code
    3. Replicated Join
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Resemblances
        7. Performance analysis
      2. Replicated Join Examples
        1. Replicated user comment example
          1. Mapper code
    4. Composite Join
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Performance analysis
      2. Composite Join Examples
        1. Composite user comment join
          1. Driver code
          2. Mapper code
          3. Reducer and combiner
    5. Cartesian Product
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Applicability
        4. Structure
        5. Consequences
        6. Resemblances
        7. Performance Analysis
      2. Cartesian Product Examples
        1. Comment Comparison
          1. Input format code
          2. Driver code
          3. Record reader code
          4. Mapper code
  8. 6. Metapatterns
    1. Job Chaining
      1. With the Driver
      2. Job Chaining Examples
        1. Basic job chaining
          1. Job one mapper
          2. Job one reducer
          3. Job two mapper
          4. Driver code
        2. Parallel job chaining
          1. Mapper code
          2. Reducer code
          3. Driver code
      3. With Shell Scripting
        1. Bash example
          1. Bash script
          2. Sample run
      4. With JobControl
        1. Job control example
          1. Main method
          2. Helper methods
    2. Chain Folding
      1. The ChainMapper and ChainReducer Approach
      2. Chain Folding Example
        1. Bin users by reputation
          1. Parsing mapper code
          2. Replicated join mapper code
          3. Reducer code
          4. Binning mapper code
          5. Driver code
    3. Job Merging
      1. Job Merging Examples
        1. Anonymous comments and distinct users
          1. TaggedText WritableComparable
          2. Merged mapper code
          3. Merged reducer code
          4. Driver code
  9. 7. Input and Output Patterns
    1. Customizing Input and Output in Hadoop
      1. InputFormat
      2. RecordReader
      3. OutputFormat
      4. RecordWriter
    2. Generating Data
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Structure
        4. Consequences
        5. Resemblances
        6. Performance analysis
      2. Generating Data Examples
        1. Generating random StackOverflow comments
          1. Driver code
          2. InputSplit code
          3. InputFormat code
          4. RecordReader code
    3. External Source Output
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Structure
        4. Consequences
        5. Performance analysis
      2. External Source Output Example
        1. Writing to Redis instances
          1. OutputFormat code
          2. RecordWriter code
          3. Mapper Code
          4. Driver Code
    4. External Source Input
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Structure
        4. Consequences
        5. Performance analysis
      2. External Source Input Example
        1. Reading from Redis Instances
          1. InputSplit code
          2. InputFormat code
          3. RecordReader code
          4. Driver code
    5. Partition Pruning
      1. Pattern Description
        1. Intent
        2. Motivation
        3. Structure
        4. Consequences
        5. Resemblances
        6. Performance analysis
      2. Partition Pruning Examples
        1. Partitioning by last access date to Redis instances
          1. Custom WritableComparable code
          2. OutputFormat code
          3. RecordWriter code
          4. Mapper code
          5. Driver code
        2. Querying for user reputation by last access date
          1. InputSplit code
          2. InputFormat code
          3. RecordReader code
          4. Driver code
  10. 8. Final Thoughts and the Future of Design Patterns
    1. Trends in the Nature of Data
      1. Images, Audio, and Video
      2. Streaming Data
    2. The Effects of YARN
    3. Patterns as a Library or Component
    4. How You Can Help
  11. A. Bloom Filters
    1. Overview
    2. Use Cases
      1. Representing a Data Set
      2. Reduce Queries to External Database
      3. Google BigTable
    3. Downsides
    4. Tweaking Your Bloom Filter
  12. Index
  13. About the Authors
  14. Colophon
  15. Copyright

Product information

  • Title: MapReduce Design Patterns
  • Author(s): Donald Miner, Adam Shook
  • Release date: December 2012
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449327170