Book description
Unlock the power of your data with Hadoop 2.X ecosystem and its data warehousing techniques across large data sets
About This Book
- Conquer the mountain of data using Hadoop 2.X tools
- The authors succeed in creating a context for Hadoop and its ecosystem
- Hands-on examples and recipes giving the bigger picture and helping you to master Hadoop 2.X data processing platforms
- Overcome the challenging data processing problems using this exhaustive course with Hadoop 2.X
Who This Book Is For
This course is for Java developers, who know scripting, wanting a career shift to Hadoop - Big Data segment of the IT industry. So if you are a novice in Hadoop or an expert, this book will make you reach the most advanced level in Hadoop 2.X.
What You Will Learn
- Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
- Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
- Installing and maintaining Hadoop 2.X cluster and its ecosystem
- Advanced Data Analysis using the Hive, Pig, and Map Reduce programs
- Machine learning principles with libraries such as Mahout and Batch and Stream data processing using Apache Spark
- Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0
- Dive into YARN and Storm and use YARN to integrate Storm with Hadoop
- Deploy Hadoop on Amazon Elastic MapReduce and Discover HDFS replacements and learn about HDFS Federation
In Detail
As Marc Andreessen has said "Data is eating the world," which can be witnessed today being the age of Big Data, businesses are producing data in huge volumes every day and this rise in tide of data need to be organized and analyzed in a more secured way. With proper and effective use of Hadoop, you can build new-improved models, and based on that you will be able to make the right decisions.
The first module, Hadoop beginners Guide will walk you through on understanding Hadoop with very detailed instructions and how to go about using it. Commands are explained using sections called "What just happened" for more clarity and understanding.
The second module, Hadoop Real World Solutions Cookbook, 2nd edition, is an essential tutorial to effectively implement a big data warehouse in your business, where you get detailed practices on the latest technologies such as YARN and Spark.
Big data has become a key basis of competition and the new waves of productivity growth. Hence, once you get familiar with the basics and implement the end-to-end big data use cases, you will start exploring the third module, Mastering Hadoop.
So, now the question is if you need to broaden your Hadoop skill set to the next level after you nail the basics and the advance concepts, then this course is indispensable. When you finish this course, you will be able to tackle the real-world scenarios and become a big data expert using the tools and the knowledge based on the various step-by-step tutorials and recipes.
Style and approach
This course has covered everything right from the basic concepts of Hadoop till you master the advance mechanisms to become a big data expert. The goal here is to help you learn the basic essentials using the step-by-step tutorials and from there moving toward the recipes with various real-world solutions for you. It covers all the important aspects of Hadoop from system designing and configuring Hadoop, machine learning principles with various libraries with chapters illustrated with code fragments and schematic diagrams. This is a compendious course to explore Hadoop from the basics to the most advanced techniques available in Hadoop 2.X.
Table of contents
-
Hadoop: Data Processing and Modelling
- Table of Contents
- Hadoop: Data Processing and Modelling
- Hadoop: Data Processing and Modelling
- Credits
- Preface
-
1. Module 1
- 1. What It's All About
-
2. Getting Hadoop Up and Running
- Hadoop on a local Ubuntu host
- Time for action – checking the prerequisites
- Time for action – downloading Hadoop
- Time for action – setting up SSH
- Time for action – using Hadoop to calculate Pi
- Time for action – configuring the pseudo-distributed mode
- Time for action – changing the base HDFS directory
- Time for action – formatting the NameNode
- Time for action – starting Hadoop
- Time for action – using HDFS
- Time for action – WordCount, the Hello World of MapReduce
- Using Elastic MapReduce
- Time for action – WordCount on EMR using the management console
- Comparison of local versus EMR Hadoop
- Summary
-
3. Understanding MapReduce
- Key/value pairs
- The Hadoop Java API for MapReduce
- Writing MapReduce programs
- Time for action – setting up the classpath
- Time for action – implementing WordCount
- Time for action – building a JAR file
- Time for action – running WordCount on a local Hadoop cluster
- Time for action – running WordCount on EMR
- Time for action – WordCount the easy way
-
Walking through a run of WordCount
- Startup
- Splitting the input
- Task assignment
- Task startup
- Ongoing JobTracker monitoring
- Mapper input
- Mapper execution
- Mapper output and reduce input
- Partitioning
- The optional partition function
- Reducer input
- Reducer execution
- Reducer output
- Shutdown
- That's all there is to it!
- Apart from the combiner…maybe
- Time for action – WordCount with a combiner
- Time for action – fixing WordCount to work with a combiner
- Hadoop-specific data types
- Time for action – using the Writable wrapper classes
- Input/output
- Summary
-
4. Developing MapReduce Programs
- Using languages other than Java with Hadoop
- Time for action – implementing WordCount using Streaming
- Analyzing a large dataset
- Time for action – summarizing the UFO data
- Time for action – summarizing the shape data
- Time for action – correlating of sighting duration to UFO shape
- Time for action – performing the shape/time analysis from the command line
- Time for action – using ChainMapper for field validation/analysis
- Time for action – using the Distributed Cache to improve location output
- Counters, status, and other output
- Time for action – creating counters, task states, and writing log output
- Summary
-
5. Advanced MapReduce Techniques
- Simple, advanced, and in-between
- Joins
- Time for action – reduce-side join using MultipleInputs
- Graph algorithms
- Time for action – representing the graph
- Time for action – creating the source code
- Time for action – the first run
- Time for action – the second run
- Time for action – the third run
- Time for action – the fourth and last run
- Using language-independent data structures
- Time for action – getting and installing Avro
- Time for action – defining the schema
- Time for action – creating the source Avro data with Ruby
- Time for action – consuming the Avro data with Java
- Time for action – generating shape summaries in MapReduce
- Time for action – examining the output data with Ruby
- Time for action – examining the output data with Java
- Summary
-
6. When Things Break
- Failure
- Time for action – killing a DataNode process
- Time for action – the replication factor in action
- Time for action – intentionally causing missing blocks
- Time for action – killing a TaskTracker process
- Time for action – killing the JobTracker
-
Time for action – killing the NameNode process
-
What just happened?
- Starting a replacement NameNode
- The role of the NameNode in more detail
- File systems, files, blocks, and nodes
- The single most important piece of data in the cluster – fsimage
- DataNode startup
- Safe mode
- SecondaryNameNode
- So what to do when the NameNode process has a critical failure?
- BackupNode/CheckpointNode and NameNode HA
- Hardware failure
- Host failure
- Host corruption
- The risk of correlated failures
- Task failure due to software
-
What just happened?
- Time for action – causing task failure
- Time for action – handling dirty data by using skip mode
- Summary
-
7. Keeping Things Running
- A note on EMR
- Hadoop configuration properties
- Time for action – browsing default properties
- Setting up a cluster
- Time for action – examining the default rack configuration
- Time for action – adding a rack awareness script
- Cluster access control
- Time for action – demonstrating the default security
- Managing the NameNode
- Time for action – adding an additional fsimage location
- Time for action – swapping to a new NameNode host
- Managing HDFS
- MapReduce management
- Time for action – changing job priorities and killing a job
- Scaling
- Summary
-
8. A Relational View on Data with Hive
- Overview of Hive
- Setting up Hive
- Time for action – installing Hive
- Using Hive
- Time for action – creating a table for the UFO data
- Time for action – inserting the UFO data
- Time for action – validating the table
- Time for action – redefining the table with the correct column separator
- Time for action – creating a table from an existing file
- Time for action – performing a join
- Time for action – using views
- Time for action – exporting query output
- Time for action – making a partitioned UFO sighting table
- Time for action – adding a new User Defined Function (UDF)
- Hive on Amazon Web Services
- Time for action – running UFO analysis on EMR
- Summary
-
9. Working with Relational Databases
- Common data paths
- Setting up MySQL
- Time for action – installing and setting up MySQL
- Time for action – configuring MySQL to allow remote connections
- Time for action – setting up the employee database
- Getting data into Hadoop
- Time for action – downloading and configuring Sqoop
- Time for action – exporting data from MySQL to HDFS
- Time for action – exporting data from MySQL into Hive
- Time for action – a more selective import
- Time for action – using a type mapping
- Time for action – importing data from a raw query
- Getting data out of Hadoop
- Time for action – importing data from Hadoop into MySQL
- Time for action – importing Hive data into MySQL
- Time for action – fixing the mapping and re-running the export
- AWS considerations
- Summary
-
10. Data Collection with Flume
- A note about AWS
- Data data everywhere...
- Time for action – getting web server data into Hadoop
- Introducing Apache Flume
- Time for action – installing and configuring Flume
- Time for action – capturing network traffic in a log file
- Time for action – logging to the console
- Time for action – capturing the output of a command to a flat file
- Time for action – capturing a remote file in a local flat file
- Time for action – writing network traffic onto HDFS
- Time for action – adding timestamps
- Time for action – multi level Flume networks
- Time for action – writing to multiple sinks
- The bigger picture
- Summary
- 11. Where to Go Next
- A. Pop Quiz Answers
-
2. Module 2
-
1. Getting Started with Hadoop 2.X
- Introduction
- Installing a single-node Hadoop Cluster
- Installing a multi-node Hadoop cluster
- Adding new nodes to existing Hadoop clusters
- Executing the balancer command for uniform data distribution
- Entering and exiting from the safe mode in a Hadoop cluster
- Decommissioning DataNodes
- Performing benchmarking on a Hadoop cluster
-
2. Exploring HDFS
- Introduction
- Loading data from a local machine to HDFS
- Exporting HDFS data to a local machine
- Changing the replication factor of an existing file in HDFS
- Setting the HDFS block size for all the files in a cluster
- Setting the HDFS block size for a specific file in a cluster
- Enabling transparent encryption for HDFS
- Importing data from another Hadoop cluster
- Recycling deleted data from trash to HDFS
- Saving compressed data in HDFS
-
3. Mastering Map Reduce Programs
- Introduction
- Writing the Map Reduce program in Java to analyze web log data
- Executing the Map Reduce program in a Hadoop cluster
- Adding support for a new writable data type in Hadoop
- Implementing a user-defined counter in a Map Reduce program
- Map Reduce program to find the top X
- Map Reduce program to find distinct values
- Map Reduce program to partition data using a custom partitioner
- Writing Map Reduce results to multiple output files
- Performing Reduce side Joins using Map Reduce
- Unit testing the Map Reduce code using MRUnit
-
4. Data Analysis Using Hive, Pig, and Hbase
- Introduction
- Storing and processing Hive data in a sequential file format
- Storing and processing Hive data in the ORC file format
- Storing and processing Hive data in the ORC file format
- Storing and processing Hive data in the Parquet file format
- Performing FILTER By queries in Pig
- Performing Group By queries in Pig
- Performing Order By queries in Pig
- Performing JOINS in Pig
- Writing a user-defined function in Pig
- Analyzing web log data using Pig
- Performing the Hbase operation in CLI
- Performing Hbase operations in Java
- Executing the MapReduce programming with an Hbase Table
-
5. Advanced Data Analysis Using Hive
- Introduction
- Processing JSON data in Hive using JSON SerDe
- Processing XML data in Hive using XML SerDe
- Processing Hive data in the Avro format
- Writing a user-defined function in Hive
- Performing table joins in Hive
- Executing map side joins in Hive
- Performing context Ngram in Hive
- Call Data Record Analytics using Hive
- Twitter sentiment analysis using Hive
- Implementing Change Data Capture using Hive
- Multiple table inserting using Hive
-
6. Data Import/Export Using Sqoop and Flume
- Introduction
- Importing data from RDMBS to HDFS using Sqoop
- Exporting data from HDFS to RDBMS
- Using query operator in Sqoop import
- Importing data using Sqoop in compressed format
- Performing Atomic export using Sqoop
- Importing data into Hive tables using Sqoop
- Importing data into HDFS from Mainframes
- Incremental import using Sqoop
- Creating and executing Sqoop job
- Importing data from RDBMS to Hbase using Sqoop
- Importing Twitter data into HDFS using Flume
- Importing data from Kafka into HDFS using Flume
- Importing web logs data into HDFS using Flume
-
7. Automation of Hadoop Tasks Using Oozie
- Introduction
- Implementing a Sqoop action job using Oozie
- Implementing a Map Reduce action job using Oozie
- Implementing a Java action job using Oozie
- Implementing a Hive action job using Oozie
- Implementing a Pig action job using Oozie
- Implementing an e-mail action job using Oozie
- Executing parallel jobs using Oozie (fork)
- Scheduling a job in Oozie
-
8. Machine Learning and Predictive Analytics Using Mahout and R
- Introduction
- Setting up the Mahout development environment
- Creating an item-based recommendation engine using Mahout
- Creating a user-based recommendation engine using Mahout
- Using Predictive analytics on Bank Data using Mahout
- Clustering text data using K-Means
- Performing Population Data Analytics using R
- Performing Twitter Sentiment Analytics using R
- Performing Predictive Analytics using R
-
9. Integration with Apache Spark
- Introduction
- Running Spark standalone
- Running Spark on YARN
- Olympics Athletes analytics using the Spark Shell
- Creating Twitter trending topics using Spark Streaming
- Twitter trending topics using Spark streaming
- Analyzing Parquet files using Spark
- Analyzing JSON data using Spark
- Processing graphs using Graph X
- Conducting predictive analytics using Spark MLib
- 10. Hadoop Use Cases
-
1. Getting Started with Hadoop 2.X
-
3. Module 3
- 1. Hadoop 2.X
- 2. Advanced MapReduce
-
3. Advanced Pig
- Pig versus SQL
- Different modes of execution
- Complex data types in Pig
- Compiling Pig scripts
- Development and debugging aids
- The advanced Pig operators
- User-defined functions
- Pig performance optimizations
-
Best practices
- The explicit usage of types
- Early and frequent projection
- Early and frequent filtering
- The usage of the LIMIT operator
- The usage of the DISTINCT operator
- The reduction of operations
- The usage of Algebraic UDFs
- The usage of Accumulator UDFs
- Eliminating nulls in the data
- The usage of specialized joins
- Compressing intermediate results
- Combining smaller files
- Summary
- 4. Advanced Hive
- 5. Serialization and Hadoop I/O
- 6. YARN – Bringing Other Paradigms to Hadoop
- 7. Storm on YARN – Low Latency Processing in Hadoop
- 8. Hadoop on the Cloud
- 9. HDFS Replacements
- 10. HDFS Federation
- 11. Hadoop Security
- 12. Analytics Using Hadoop
- 13. Hadoop for Microsoft Windows
- A. Bibliography
- Index
Product information
- Title: Hadoop: Data Processing and Modelling
- Author(s):
- Release date: August 2016
- Publisher(s): Packt Publishing
- ISBN: 9781787125162
You might also like
book
Hadoop MapReduce v2 Cookbook - Second Edition
Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets In Detail Starting …
book
Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem
Get Started Fast with Apache Hadoop ® 2, YARN, and Today’s Hadoop Ecosystem With Hadoop 2.x …
book
Hadoop Beginner's Guide
Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of …
book
PolyBase Revealed: Data Virtualization with SQL Server, Hadoop, Apache Spark, and Beyond
Harness the power of PolyBase data virtualization software to make data from a variety of sources …