Hadoop: The Definitive Guide, 3rd Edition

Book description

Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).

  • Store large datasets with the Hadoop Distributed File System (HDFS)
  • Run distributed computations with MapReduce
  • Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
  • Discover common pitfalls and advanced features for writing real-world MapReduce programs
  • Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
  • Load data from relational databases into HDFS, using Sqoop
  • Perform large-scale data processing with the Pig query language
  • Analyze datasets with Hive, Hadoop’s data warehousing system
  • Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

Publisher resources

View/Submit Errata

Table of contents

  1. Hadoop: The Definitive Guide
  2. Dedication
  3. Foreword
  4. Preface
    1. Administrative Notes
    2. What’s in This Book?
    3. What’s New in the Second Edition?
    4. What’s New in the Third Edition?
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  5. 1. Meet Hadoop
    1. Data!
    2. Data Storage and Analysis
    3. Comparison with Other Systems
      1. Rational Database Management System
      2. Grid Computing
      3. Volunteer Computing
    4. A Brief History of Hadoop
    5. Apache Hadoop and the Hadoop Ecosystem
    6. Hadoop Releases
      1. What’s Covered in This Book
        1. Configuration names
        2. MapReduce APIs
      2. Compatibility
  6. 2. MapReduce
    1. A Weather Dataset
      1. Data Format
    2. Analyzing the Data with Unix Tools
    3. Analyzing the Data with Hadoop
      1. Map and Reduce
      2. Java MapReduce
        1. A test run
        2. The old and the new Java MapReduce APIs
    4. Scaling Out
      1. Data Flow
      2. Combiner Functions
        1. Specifying a combiner function
      3. Running a Distributed MapReduce Job
    5. Hadoop Streaming
      1. Ruby
      2. Python
    6. Hadoop Pipes
      1. Compiling and Running
  7. 3. The Hadoop Distributed Filesystem
    1. The Design of HDFS
    2. HDFS Concepts
      1. Blocks
      2. Namenodes and Datanodes
      3. HDFS Federation
      4. HDFS High-Availability
        1. Failover and fencing
    3. The Command-Line Interface
      1. Basic Filesystem Operations
    4. Hadoop Filesystems
      1. Interfaces
        1. HTTP
        2. C
        3. FUSE
    5. The Java Interface
      1. Reading Data from a Hadoop URL
      2. Reading Data Using the FileSystem API
        1. FSDataInputStream
      3. Writing Data
        1. FSDataOutputStream
      4. Directories
      5. Querying the Filesystem
        1. File metadata: FileStatus
        2. Listing files
        3. File patterns
        4. PathFilter
      6. Deleting Data
    6. Data Flow
      1. Anatomy of a File Read
      2. Anatomy of a File Write
      3. Coherency Model
        1. Consequences for application design
    7. Data Ingest with Flume and Sqoop
    8. Parallel Copying with distcp
      1. Keeping an HDFS Cluster Balanced
    9. Hadoop Archives
      1. Using Hadoop Archives
      2. Limitations
  8. 4. Hadoop I/O
    1. Data Integrity
      1. Data Integrity in HDFS
      2. LocalFileSystem
      3. ChecksumFileSystem
    2. Compression
      1. Codecs
        1. Compressing and decompressing streams with CompressionCodec
        2. Inferring CompressionCodecs using CompressionCodecFactory
        3. Native libraries
          1. CodecPool
      2. Compression and Input Splits
      3. Using Compression in MapReduce
        1. Compressing map output
    3. Serialization
      1. The Writable Interface
        1. WritableComparable and comparators
      2. Writable Classes
        1. Writable wrappers for Java primitives
        2. Text
          1. Indexing
          2. Unicode
          3. Iteration
          4. Mutability
          5. Resorting to String
        3. BytesWritable
        4. NullWritable
        5. ObjectWritable and GenericWritable
        6. Writable collections
      3. Implementing a Custom Writable
        1. Implementing a RawComparator for speed
        2. Custom comparators
      4. Serialization Frameworks
        1. Serialization IDL
    4. Avro
      1. Avro Data Types and Schemas
      2. In-Memory Serialization and Deserialization
        1. The specific API
      3. Avro Datafiles
      4. Interoperability
        1. Python API
        2. C API
      5. Schema Resolution
      6. Sort Order
      7. Avro MapReduce
      8. Sorting Using Avro MapReduce
      9. Avro MapReduce in Other Languages
    5. File-Based Data Structures
      1. SequenceFile
        1. Writing a SequenceFile
        2. Reading a SequenceFile
        3. Displaying a SequenceFile with the command-line interface
        4. Sorting and merging SequenceFiles
        5. The SequenceFile format
      2. MapFile
        1. Writing a MapFile
        2. Reading a MapFile
        3. MapFile variants
        4. Converting a SequenceFile to a MapFile
  9. 5. Developing a MapReduce Application
    1. The Configuration API
      1. Combining Resources
      2. Variable Expansion
    2. Setting Up the Development Environment
      1. Managing Configuration
      2. GenericOptionsParser, Tool, and ToolRunner
    3. Writing a Unit Test with MRUnit
      1. Mapper
      2. Reducer
    4. Running Locally on Test Data
      1. Running a Job in a Local Job Runner
        1. Fixing the mapper
      2. Testing the Driver
    5. Running on a Cluster
      1. Packaging a Job
        1. The client classpath
        2. The task classpath
        3. Packaging dependencies
        4. Task classpath precedence
      2. Launching a Job
      3. The MapReduce Web UI
        1. The jobtracker page
        2. The job page
      4. Retrieving the Results
      5. Debugging a Job
        1. The tasks page
        2. The task details page
        3. Handling malformed data
      6. Hadoop Logs
      7. Remote Debugging
    6. Tuning a Job
      1. Profiling Tasks
        1. The HPROF profiler
        2. Other profilers
    7. MapReduce Workflows
      1. Decomposing a Problem into MapReduce Jobs
      2. JobControl
      3. Apache Oozie
        1. Defining an Oozie workflow
        2. Packaging and deploying an Oozie workflow application
        3. Running an Oozie workflow job
  10. 6. How MapReduce Works
    1. Anatomy of a MapReduce Job Run
      1. Classic MapReduce (MapReduce 1)
        1. Job submission
        2. Job initialization
        3. Task assignment
        4. Task execution
          1. Streaming and pipes
        5. Progress and status updates
        6. Job completion
      2. YARN (MapReduce 2)
        1. Job submission
        2. Job initialization
        3. Task assignment
        4. Task execution
        5. Progress and status updates
        6. Job completion
    2. Failures
      1. Failures in Classic MapReduce
        1. Task failure
        2. Tasktracker failure
        3. Jobtracker failure
      2. Failures in YARN
        1. Task failure
        2. Application master failure
        3. Node manager failure
        4. Resource manager failure
    3. Job Scheduling
      1. The Fair Scheduler
      2. The Capacity Scheduler
    4. Shuffle and Sort
      1. The Map Side
      2. The Reduce Side
      3. Configuration Tuning
    5. Task Execution
      1. The Task Execution Environment
        1. Streaming environment variables
      2. Speculative Execution
      3. Output Committers
        1. Task side-effect files
      4. Task JVM Reuse
      5. Skipping Bad Records
  11. 7. MapReduce Types and Formats
    1. MapReduce Types
      1. The Default MapReduce Job
        1. The default Streaming job
        2. Keys and values in Streaming
    2. Input Formats
      1. Input Splits and Records
        1. FileInputFormat
        2. FileInputFormat input paths
        3. FileInputFormat input splits
        4. Small files and CombineFileInputFormat
        5. Preventing splitting
        6. File information in the mapper
        7. Processing a whole file as a record
      2. Text Input
        1. TextInputFormat
        2. KeyValueTextInputFormat
        3. NLineInputFormat
        4. XML
      3. Binary Input
        1. SequenceFileInputFormat
        2. SequenceFileAsTextInputFormat
        3. SequenceFileAsBinaryInputFormat
      4. Multiple Inputs
      5. Database Input (and Output)
    3. Output Formats
      1. Text Output
      2. Binary Output
        1. SequenceFileOutputFormat
        2. SequenceFileAsBinaryOutputFormat
        3. MapFileOutputFormat
      3. Multiple Outputs
        1. An example: Partitioning data
        2. MultipleOutputs
      4. Lazy Output
      5. Database Output
  12. 8. MapReduce Features
    1. Counters
      1. Built-in Counters
        1. Task counters
        2. Job counters
      2. User-Defined Java Counters
        1. Dynamic counters
        2. Readable counter names
        3. Retrieving counters
          1. Using the new MapReduce API
      3. User-Defined Streaming Counters
    2. Sorting
      1. Preparation
      2. Partial Sort
        1. An application: Partitioned MapFile lookups
      3. Total Sort
      4. Secondary Sort
        1. Java code
        2. Streaming
    3. Joins
      1. Map-Side Joins
      2. Reduce-Side Joins
    4. Side Data Distribution
      1. Using the Job Configuration
      2. Distributed Cache
        1. Usage
        2. How it works
        3. The distributed cache API
    5. MapReduce Library Classes
  13. 9. Setting Up a Hadoop Cluster
    1. Cluster Specification
      1. Network Topology
        1. Rack awareness
    2. Cluster Setup and Installation
      1. Installing Java
      2. Creating a Hadoop User
      3. Installing Hadoop
      4. Testing the Installation
    3. SSH Configuration
    4. Hadoop Configuration
      1. Configuration Management
        1. Control scripts
        2. Master node scenarios
      2. Environment Settings
        1. Memory
        2. Java
        3. System logfiles
        4. SSH settings
      3. Important Hadoop Daemon Properties
        1. HDFS
        2. MapReduce
      4. Hadoop Daemon Addresses and Ports
      5. Other Hadoop Properties
        1. Cluster membership
        2. Buffer size
        3. HDFS block size
        4. Reserved storage space
        5. Trash
        6. Job scheduler
        7. Reduce slow start
        8. Task memory limits
      6. User Account Creation
    5. YARN Configuration
      1. Important YARN Daemon Properties
        1. Memory
      2. YARN Daemon Addresses and Ports
    6. Security
      1. Kerberos and Hadoop
        1. An example
      2. Delegation Tokens
      3. Other Security Enhancements
    7. Benchmarking a Hadoop Cluster
      1. Hadoop Benchmarks
        1. Benchmarking HDFS with TestDFSIO
        2. Benchmarking MapReduce with Sort
        3. Other benchmarks
      2. User Jobs
    8. Hadoop in the Cloud
      1. Apache Whirr
        1. Setup
        2. Launching a cluster
        3. Configuration
        4. Running a proxy
        5. Running a MapReduce job
        6. Shutting down a cluster
  14. 10. Administering Hadoop
    1. HDFS
      1. Persistent Data Structures
        1. Namenode directory structure
        2. The filesystem image and edit log
        3. Secondary namenode directory structure
        4. Datanode directory structure
      2. Safe Mode
        1. Entering and leaving safe mode
      3. Audit Logging
      4. Tools
        1. dfsadmin
        2. Filesystem check (fsck)
          1. Finding the blocks for a file
        3. Datanode block scanner
        4. Balancer
    2. Monitoring
      1. Logging
        1. Setting log levels
        2. Getting stack traces
      2. Metrics
        1. FileContext
        2. GangliaContext
        3. NullContextWithUpdateThread
        4. CompositeContext
      3. Java Management Extensions
    3. Maintenance
      1. Routine Administration Procedures
        1. Metadata backups
        2. Data backups
        3. Filesystem check (fsck)
        4. Filesystem balancer
      2. Commissioning and Decommissioning Nodes
        1. Commissioning new nodes
        2. Decommissioning old nodes
      3. Upgrades
        1. HDFS data and metadata upgrades
          1. Start the upgrade
          2. Wait until the upgrade is complete
          3. Check the upgrade
          4. Roll back the upgrade (optional)
          5. Finalize the upgrade (optional)
  15. 11. Pig
    1. Installing and Running Pig
      1. Execution Types
        1. Local mode
        2. MapReduce mode
      2. Running Pig Programs
      3. Grunt
      4. Pig Latin Editors
    2. An Example
      1. Generating Examples
    3. Comparison with Databases
    4. Pig Latin
      1. Structure
      2. Statements
      3. Expressions
      4. Types
      5. Schemas
        1. Validation and nulls
        2. Schema merging
      6. Functions
      7. Macros
    5. User-Defined Functions
      1. A Filter UDF
        1. Leveraging types
      2. An Eval UDF
        1. Dynamic invokers
      3. A Load UDF
        1. Using a schema
    6. Data Processing Operators
      1. Loading and Storing Data
      2. Filtering Data
        1. FOREACH...GENERATE
        2. STREAM
      3. Grouping and Joining Data
        1. JOIN
        2. COGROUP
        3. CROSS
        4. GROUP
      4. Sorting Data
      5. Combining and Splitting Data
    7. Pig in Practice
      1. Parallelism
      2. Parameter Substitution
        1. Dynamic parameters
        2. Parameter substitution processing
  16. 12. Hive
    1. Installing Hive
      1. The Hive Shell
    2. An Example
    3. Running Hive
      1. Configuring Hive
        1. Logging
      2. Hive Services
        1. Hive clients
      3. The Metastore
    4. Comparison with Traditional Databases
      1. Schema on Read Versus Schema on Write
      2. Updates, Transactions, and Indexes
    5. HiveQL
      1. Data Types
        1. Primitive types
        2. Complex types
      2. Operators and Functions
        1. Conversions
    6. Tables
      1. Managed Tables and External Tables
      2. Partitions and Buckets
        1. Partitions
        2. Buckets
      3. Storage Formats
        1. The default storage format: Delimited text
        2. Binary storage formats: Sequence files, Avro datafiles and RCFiles
        3. An example: RegexSerDe
      4. Importing Data
        1. Inserts
        2. Multitable insert
        3. CREATE TABLE...AS SELECT
      5. Altering Tables
      6. Dropping Tables
    7. Querying Data
      1. Sorting and Aggregating
      2. MapReduce Scripts
      3. Joins
        1. Inner joins
        2. Outer joins
        3. Semi joins
        4. Map joins
      4. Subqueries
      5. Views
    8. User-Defined Functions
      1. Writing a UDF
      2. Writing a UDAF
        1. A more complex UDAF
  17. 13. HBase
    1. HBasics
      1. Backdrop
    2. Concepts
      1. Whirlwind Tour of the Data Model
        1. Regions
        2. Locking
      2. Implementation
        1. HBase in operation
    3. Installation
      1. Test Drive
    4. Clients
      1. Java
        1. MapReduce
      2. Avro, REST, and Thrift
        1. REST
        2. Thrift
        3. Avro
    5. Example
      1. Schemas
      2. Loading Data
        1. Optimization notes
      3. Web Queries
    6. HBase Versus RDBMS
      1. Successful Service
      2. HBase
      3. Use Case: HBase at Streamy.com
        1. Very large items tables
        2. Very large sort merges
        3. Life with HBase
    7. Praxis
      1. Versions
      2. HDFS
      3. UI
      4. Metrics
      5. Schema Design
        1. Joins
        2. Row keys
      6. Counters
      7. Bulk Load
  18. 14. ZooKeeper
    1. Installing and Running ZooKeeper
    2. An Example
      1. Group Membership in ZooKeeper
      2. Creating the Group
      3. Joining a Group
      4. Listing Members in a Group
        1. ZooKeeper command-line tools
      5. Deleting a Group
    3. The ZooKeeper Service
      1. Data Model
        1. Ephemeral znodes
        2. Sequence numbers
        3. Watches
      2. Operations
        1. Multiupdate
        2. APIs
        3. Watch triggers
        4. ACLs
      3. Implementation
      4. Consistency
      5. Sessions
        1. Time
      6. States
    4. Building Applications with ZooKeeper
      1. A Configuration Service
      2. The Resilient ZooKeeper Application
        1. InterruptedException
        2. KeeperException
          1. State exceptions
          2. Recoverable exceptions
          3. Unrecoverable exceptions
        3. A reliable configuration service
      3. A Lock Service
        1. The herd effect
        2. Recoverable exceptions
        3. Unrecoverable exceptions
        4. Implementation
      4. More Distributed Data Structures and Protocols
        1. BookKeeper and Hedwig
    5. ZooKeeper in Production
      1. Resilience and Performance
      2. Configuration
  19. 15. Sqoop
    1. Getting Sqoop
    2. Sqoop Connectors
    3. A Sample Import
      1. Text and Binary File Formats
    4. Generated Code
      1. Additional Serialization Systems
    5. Imports: A Deeper Look
      1. Controlling the Import
      2. Imports and Consistency
      3. Direct-mode Imports
    6. Working with Imported Data
      1. Imported Data and Hive
    7. Importing Large Objects
    8. Performing an Export
    9. Exports: A Deeper Look
      1. Exports and Transactionality
      2. Exports and SequenceFiles
  20. 16. Case Studies
    1. Hadoop Usage at Last.fm
      1. Last.fm: The Social Music Revolution
      2. Hadoop at Last.fm
      3. Generating Charts with Hadoop
      4. The Track Statistics Program
        1. Calculating the number of unique listeners
          1. UniqueListenersMapper
          2. UniqueListenersReducer
        2. Summing the track totals
          1. SumMapper
          2. SumReducer
        3. Merging the results
          1. MergeListenersMapper
          2. IdentityMapper
          3. SumReducer
      5. Summary
    2. Hadoop and Hive at Facebook
      1. Hadoop at Facebook
        1. History
        2. Use cases
        3. Data architecture
        4. Hadoop configuration
      2. Hypothetical Use Case Studies
        1. Advertiser insights and performance
        2. Ad hoc analysis and product feedback
        3. Data analysis
      3. Hive
        1. Data organization
        2. Query language
        3. Data pipelines using Hive
      4. Problems and Future Work
        1. Fair sharing
        2. Space management
        3. Scribe-HDFS integration
        4. Improvements to Hive
    3. Nutch Search Engine
      1. Data Structures
        1. CrawlDb
        2. LinkDb
        3. Segments
      2. Selected Examples of Hadoop Data Processing in Nutch
        1. Link inversion
        2. Generation of fetchlists
          1. Step 1: Select, sort by score, limit by URL count per host
          2. Step 2: Invert, partition by host, sort randomly
        3. Fetcher: A multithreaded MapRunner in action
        4. Indexer: Using custom OutputFormat
      3. Summary
    4. Log Processing at Rackspace
      1. Requirements/The Problem
        1. Logs
      2. Brief History
      3. Choosing Hadoop
      4. Collection and Storage
        1. Log collection
        2. Log storage
      5. MapReduce for Logs
        1. Processing
          1. Phase 1: Map
          2. Phase 1: Reduce
          3. Phase 2: Map
          4. Phase 2: Reduce
        2. Merging for near-term search
          1. Sharding
          2. Search results
        3. Archiving for analysis
    5. Cascading
      1. Fields, Tuples, and Pipes
      2. Operations
      3. Taps, Schemes, and Flows
      4. Cascading in Practice
      5. Flexibility
      6. Hadoop and Cascading at ShareThis
      7. Summary
    6. TeraByte Sort on Apache Hadoop
    7. Using Pig and Wukong to Explore Billion-edge Network Graphs
      1. Measuring Community
      2. Everybody’s Talkin’ at Me: The Twitter Reply Graph
        1. Edge pairs versus adjacency list
        2. Degree
      3. Symmetric Links
      4. Community Extraction
        1. Get neighbors
        2. Community metrics and the 1 million × 1 million problem
        3. Local properties at global scale
  21. A. Installing Apache Hadoop
    1. Prerequisites
    2. Installation
    3. Configuration
      1. Standalone Mode
      2. Pseudodistributed Mode
        1. Configuring SSH
        2. Formatting the HDFS filesystem
        3. Starting and stopping the daemons (MapReduce 1)
        4. Starting and stopping the daemons (MapReduce 2)
      3. Fully Distributed Mode
  22. B. Cloudera’s Distribution Including Apache Hadoop
  23. C. Preparing the NCDC Weather Data
  24. Index
  25. About the Author
  26. Colophon
  27. Copyright

Product information

  • Title: Hadoop: The Definitive Guide, 3rd Edition
  • Author(s): Tom White
  • Release date: May 2012
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449311520