The Web is getting faster, and the data it delivers is getting bigger. How can you handle everything efficiently? This book introduces Spark, an open source cluster computing system that makes data analytics fast to run and fast to write. You’ll learn how to run programs faster, using primitives for in-memory cluster computing. With Spark, your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.
Written by the developers of Spark, this book will have you up and running in no time. You’ll learn how to express MapReduce jobs with just a few simple lines of Spark code, instead of spending extra time and effort working with Hadoop’s raw Java API.
Quickly dive into Spark capabilities such as collect, count, reduce, and save
Use one programming paradigm instead of mixing and matching tools such as Hive, Hadoop, Mahout, and S4/Storm
Learn how to run interactive, iterative, and incremental analyses
Integrate with Scala to manipulate distributed datasets like local collections
Tackle partitioning issues, data locality, default hash partitioning, user-defined partitioners, and custom serialization
Use other languages by means of pipe() to achieve the equivalent of Hadoop streaming
Holden Karau, Andy Kowinski, Matei Zaharia
Safari Books Online
Early Release Ebook
February 2015 (est.)
| ISBN 10:
Rough Cut ISBN:
| ISBN 10:
Early Release Ebook ISBN:
| ISBN 10:
Holden Karau is a software development engineer at Databricks and is active in open source. She is the author of an earlier Spark book. Prior to Databricks she worked on a variety of search and classification problems at Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science. Outside of software she enjoys paying with fire, welding, and hula hooping.
Most recently, Andy Konwinski co-founded Databricks. Before that he was a PhD student and then postdoc in the AMPLab at UC Berkeley, focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project. He also worked with systems engineers and researchers at Google on the design of Omega, their next generation cluster scheduling system. More recently, he developed and led the AMP Camp Big Data Bootcamps and first Spark Summit, and has been contributing to the Spark project.
Matei Zaharia is a PhD student in the AMP Lab at UC Berkeley, working on topics in computer systems, cloud computing and big data. He is also a committer on Apache Hadoop and Apache Mesos. At Berkeley, he leads the development of the Spark cluster computing framework, and has also worked on projects including Mesos, the Hadoop Fair Scheduler, Hadoop's straggler detection algorithm, Shark, and multi-resource sharing. Matei got his undergraduate degree at the University of Waterloo in Canada.