Mastering Hadoop

Book description

Go beyond the basics and master the next generation of Hadoop data processing platforms

In Detail

Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise.

This book explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation.

This book is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams.

What You Will Learn

  • Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0
  • Customize and optimize MapReduce jobs in Hadoop 2.0
  • Explore Hadoop I/O and different data formats
  • Dive into YARN and Storm and use YARN to integrate Storm with Hadoop
  • Deploy Hadoop on Amazon Elastic MapReduce
  • Discover HDFS replacements and learn about HDFS Federation
  • Get to grips with Hadoop's main security aspects
  • Utilize Mahout and RHadoop for Hadoop analytics

Table of contents

  1. Mastering Hadoop
    1. Table of Contents
    2. Mastering Hadoop
    3. Credits
    4. About the Author
    5. Acknowledgments
    6. About the Reviewers
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book?
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    9. 1. Hadoop 2.X
      1. The inception of Hadoop
      2. The evolution of Hadoop
        1. Hadoop's genealogy
          1. Hadoop-0.20-append
          2. Hadoop-0.20-security
          3. Hadoop's timeline
      3. Hadoop 2.X
        1. Yet Another Resource Negotiator (YARN)
          1. Architecture overview
        2. Storage layer enhancements
          1. High availability
          2. HDFS Federation
          3. HDFS snapshots
          4. Other enhancements
        3. Support enhancements
      4. Hadoop distributions
        1. Which Hadoop distribution?
          1. Performance
          2. Scalability
          3. Reliability
          4. Manageability
        2. Available distributions
          1. Cloudera Distribution of Hadoop (CDH)
          2. Hortonworks Data Platform (HDP)
          3. MapR
          4. Pivotal HD
      5. Summary
    10. 2. Advanced MapReduce
      1. MapReduce input
        1. The InputFormat class
        2. The InputSplit class
      2. The RecordReader class
      3. Hadoop's "small files" problem
      4. Filtering inputs
      5. The Map task
        1. The dfs.blocksize attribute
        2. Sort and spill of intermediate outputs
        3. Node-local Reducers or Combiners
        4. Fetching intermediate outputs – Map-side
      6. The Reduce task
        1. Fetching intermediate outputs – Reduce-side
        2. Merge and spill of intermediate outputs
      7. MapReduce output
        1. Speculative execution of tasks
      8. MapReduce job counters
      9. Handling data joins
        1. Reduce-side joins
        2. Map-side joins
      10. Summary
    11. 3. Advanced Pig
      1. Pig versus SQL
      2. Different modes of execution
      3. Complex data types in Pig
      4. Compiling Pig scripts
        1. The logical plan
        2. The physical plan
        3. The MapReduce plan
      5. Development and debugging aids
        1. The DESCRIBE command
        2. The EXPLAIN command
        3. The ILLUSTRATE command
      6. The advanced Pig operators
        1. The advanced FOREACH operator
          1. The FLATTEN operator
          2. The nested FOREACH operator
          3. The COGROUP operator
          4. The UNION operator
          5. The CROSS operator
        2. Specialized joins in Pig
          1. The Replicated join
          2. Skewed joins
          3. The Merge join
      7. User-defined functions
        1. The evaluation functions
          1. The aggregate functions
            1. The Algebraic interface
            2. The Accumulator interface
          2. The filter functions
        2. The load functions
        3. The store functions
      8. Pig performance optimizations
        1. The optimization rules
        2. Measurement of Pig script performance
        3. Combiners in Pig
        4. Memory for the Bag data type
        5. Number of reducers in Pig
        6. The multiquery mode in Pig
      9. Best practices
        1. The explicit usage of types
        2. Early and frequent projection
        3. Early and frequent filtering
        4. The usage of the LIMIT operator
        5. The usage of the DISTINCT operator
        6. The reduction of operations
        7. The usage of Algebraic UDFs
        8. The usage of Accumulator UDFs
        9. Eliminating nulls in the data
        10. The usage of specialized joins
        11. Compressing intermediate results
        12. Combining smaller files
      10. Summary
    12. 4. Advanced Hive
      1. The Hive architecture
        1. The Hive metastore
        2. The Hive compiler
        3. The Hive execution engine
        4. The supporting components of Hive
      2. Data types
      3. File formats
        1. Compressed files
        2. ORC files
        3. The Parquet files
      4. The data model
        1. Dynamic partitions
          1. Semantics for dynamic partitioning
        2. Indexes on Hive tables
      5. Hive query optimizers
      6. Advanced DML
        1. The GROUP BY operation
        2. ORDER BY versus SORT BY clauses
        3. The JOIN operator and its types
          1. Map-side joins
        4. Advanced aggregation support
        5. Other advanced clauses
      7. UDF, UDAF, and UDTF
      8. Summary
    13. 5. Serialization and Hadoop I/O
      1. Data serialization in Hadoop
        1. Writable and WritableComparable
        2. Hadoop versus Java serialization
      2. Avro serialization
        1. Avro and MapReduce
        2. Avro and Pig
        3. Avro and Hive
        4. Comparison – Avro versus Protocol Buffers / Thrift
      3. File formats
        1. The Sequence file format
          1. Reading and writing Sequence files
        2. The MapFile format
        3. Other data structures
      4. Compression
        1. Splits and compressions
        2. Scope for compression
      5. Summary
    14. 6. YARN – Bringing Other Paradigms to Hadoop
      1. The YARN architecture
        1. Resource Manager (RM)
        2. Application Master (AM)
        3. Node Manager (NM)
        4. YARN clients
      2. Developing YARN applications
        1. Writing YARN clients
        2. Writing the Application Master entity
      3. Monitoring YARN
      4. Job scheduling in YARN
        1. CapacityScheduler
        2. FairScheduler
      5. YARN commands
        1. User commands
        2. Administration commands
      6. Summary
    15. 7. Storm on YARN – Low Latency Processing in Hadoop
      1. Batch processing versus streaming
      2. Apache Storm
        1. Architecture of an Apache Storm cluster
        2. Computation and data modeling in Apache Storm
        3. Use cases for Apache Storm
        4. Developing with Apache Storm
        5. Apache Storm 0.9.1
      3. Storm on YARN
        1. Installing Apache Storm-on-YARN
          1. Prerequisites
        2. Installation procedure
      4. Summary
    16. 8. Hadoop on the Cloud
      1. Cloud computing characteristics
      2. Hadoop on the cloud
      3. Amazon Elastic MapReduce (EMR)
        1. Provisioning a Hadoop cluster on EMR
      4. Summary
    17. 9. HDFS Replacements
      1. HDFS – advantages and drawbacks
      2. Amazon AWS S3
        1. Hadoop support for S3
      3. Implementing a filesystem in Hadoop
      4. Implementing an S3 native filesystem in Hadoop
      5. Summary
    18. 10. HDFS Federation
      1. Limitations of the older HDFS architecture
      2. Architecture of HDFS Federation
        1. Benefits of HDFS Federation
        2. Deploying federated NameNodes
      3. HDFS high availability
        1. Secondary NameNode, Checkpoint Node, and Backup Node
        2. High availability – edits sharing
        3. Useful HDFS tools
        4. Three-layer versus four-layer network topology
      4. HDFS block placement
        1. Pluggable block placement policy
      5. Summary
    19. 11. Hadoop Security
      1. The security pillars
      2. Authentication in Hadoop
        1. Kerberos authentication
        2. The Kerberos architecture and workflow
        3. Kerberos authentication and Hadoop
        4. Authentication via HTTP interfaces
      3. Authorization in Hadoop
        1. Authorization in HDFS
          1. Identity of an HDFS user
          2. Group listings for an HDFS user
          3. HDFS APIs and shell commands
          4. Specifying the HDFS superuser
          5. Turning off HDFS authorization
        2. Limiting HDFS usage
          1. Name quotas in HDFS
          2. Space quotas in HDFS
        3. Service-level authorization in Hadoop
      4. Data confidentiality in Hadoop
        1. HTTPS and encrypted shuffle
          1. SSL configuration changes
          2. Configuring the keystore and truststore
      5. Audit logging in Hadoop
      6. Summary
    20. 12. Analytics Using Hadoop
      1. Data analytics workflow
      2. Machine learning
      3. Apache Mahout
      4. Document analysis using Hadoop and Mahout
        1. Term frequency
        2. Document frequency
        3. Term frequency – inverse document frequency
        4. Tf-Idf in Pig
        5. Cosine similarity distance measures
        6. Clustering using k-means
        7. K-means clustering using Apache Mahout
      5. RHadoop
      6. Summary
    21. A. Hadoop for Microsoft Windows
      1. Deploying Hadoop on Microsoft Windows
        1. Prerequisites
        2. Building Hadoop
        3. Configuring Hadoop
        4. Deploying Hadoop
      2. Summary
    22. Index

Product information

  • Title: Mastering Hadoop
  • Author(s): Sandeep Karanth
  • Release date: December 2014
  • Publisher(s): Packt Publishing
  • ISBN: 9781783983643