The MapReduce algorithmic pattern may be Google's secret weapon for dealing with enormous quantities of data, but many programmers only see it as intimidating and obscure. In this video master class, data expert Pete Warden shows you how to build simple MapReduce jobs, using concrete use cases and descriptive examples to demystify the approach. All you need to get started is basic knowledge of Python and the Unix shell.
Warden demonstrates what happens when 500 million records are loaded into a database the traditional way: performance falls off dramatically once the working set is larger than memory. Discover how to solve the problem by introducing a sorting step—the method that lies at the heart of MapReduce.
In this video, you learn how to:
- Tackle a "Hello World" example for MapReduce. Count word frequencies in a large body of text, then split the script into separate map and reduce stages and run it on the command line.
- Run a job in Hadoop using Amazon’s Elastic MapReduce service. Set up a streaming job—upload scripts and data, debug run-time problems, and grab the results.
- Prepare for very large data sets. Redesign scripts to find the most frequent words in 17GB of Wikipedia data.