Spark Cookbook

Book description

Over 60 recipes on Spark, covering Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries

In Detail

By introducing in-memory persistent storage, Apache Spark eliminates the need to store intermediate data in filesystems, thereby increasing processing speed by up to 100 times.

This book will focus on how to analyze large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will cover setting up development environments. You will then cover various recipes to perform interactive queries using Spark SQL and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will then focus on machine learning, including supervised learning, unsupervised learning, and recommendation engine algorithms. After mastering graph processing using GraphX, you will cover various recipes for cluster optimization and troubleshooting.

What You Will Learn

  • Install and configure Apache Spark with various cluster managers
  • Set up development environments
  • Perform interactive queries using Spark SQL
  • Get to grips with real-time streaming analytics using Spark Streaming
  • Master supervised learning and unsupervised learning using MLlib
  • Build a recommendation engine using MLlib
  • Develop a set of common applications or project types, and solutions that solve complex big data problems
  • Use Apache Spark as your single big data compute platform and master its libraries

Table of contents

  1. Spark Cookbook
    1. Table of Contents
    2. Spark Cookbook
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Sections
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Conventions
      6. Reader feedback
      7. Customer support
        1. Downloading the color images of this book
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Getting Started with Apache Spark
      1. Introduction
      2. Installing Spark from binaries
        1. Getting ready
        2. How to do it...
      3. Building the Spark source code with Maven
        1. Getting ready
        2. How to do it...
      4. Launching Spark on Amazon EC2
        1. Getting ready
        2. How to do it...
          1. See also
      5. Deploying on a cluster in standalone mode
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. See also
      6. Deploying on a cluster with Mesos
        1. How to do it...
      7. Deploying on a cluster with YARN
        1. Getting ready
        2. How to do it...
        3. How it works…
      8. Using Tachyon as an off-heap storage layer
        1. How to do it...
        2. See also
    9. 2. Developing Applications with Spark
      1. Introduction
      2. Exploring the Spark shell
        1. How to do it...
      3. Developing Spark applications in Eclipse with Maven
        1. Getting ready
        2. How to do it...
      4. Developing Spark applications in Eclipse with SBT
        1. How to do it...
      5. Developing a Spark application in IntelliJ IDEA with Maven
        1. How to do it...
      6. Developing a Spark application in IntelliJ IDEA with SBT
        1. How to do it...
    10. 3. External Data Sources
      1. Introduction
      2. Loading data from the local filesystem
        1. How to do it...
      3. Loading data from HDFS
        1. How to do it...
        2. There's more…
      4. Loading data from HDFS using a custom InputFormat
        1. How to do it...
      5. Loading data from Amazon S3
        1. How to do it...
      6. Loading data from Apache Cassandra
        1. How to do it...
        2. There's more...
          1. Merge strategies in sbt-assembly
      7. Loading data from relational databases
        1. Getting ready
        2. How to do it...
        3. How it works…
    11. 4. Spark SQL
      1. Introduction
      2. Understanding the Catalyst optimizer
        1. How it works…
          1. Analysis
          2. Logical plan optimization
          3. Physical planning
          4. Code generation
      3. Creating HiveContext
        1. Getting ready
        2. How to do it...
      4. Inferring schema using case classes
        1. How to do it...
      5. Programmatically specifying the schema
        1. How to do it...
        2. How it works…
      6. Loading and saving data using the Parquet format
        1. How to do it...
        2. How it works…
        3. There's more…
      7. Loading and saving data using the JSON format
        1. How to do it...
        2. How it works…
        3. There's more…
      8. Loading and saving data from relational databases
        1. Getting ready
        2. How to do it...
      9. Loading and saving data from an arbitrary source
        1. How to do it...
        2. There's more…
    12. 5. Spark Streaming
      1. Introduction
      2. Word count using Streaming
        1. How to do it...
      3. Streaming Twitter data
        1. How to do it...
      4. Streaming using Kafka
        1. Getting ready
        2. How to do it...
        3. There's more…
    13. 6. Getting Started with Machine Learning Using MLlib
      1. Introduction
      2. Creating vectors
        1. How to do it…
        2. How it works...
      3. Creating a labeled point
        1. How to do it…
      4. Creating matrices
        1. How to do it…
      5. Calculating summary statistics
        1. How to do it…
      6. Calculating correlation
        1. Getting ready
        2. How to do it…
      7. Doing hypothesis testing
        1. How to do it…
      8. Creating machine learning pipelines using ML
        1. Getting ready
        2. How to do it…
    14. 7. Supervised Learning with MLlib – Regression
      1. Introduction
      2. Using linear regression
        1. Getting ready
        2. How to do it…
      3. Understanding cost function
      4. Doing linear regression with lasso
        1. How to do it…
      5. Doing ridge regression
        1. How to do it…
    15. 8. Supervised Learning with MLlib – Classification
      1. Introduction
      2. Doing classification using logistic regression
        1. Getting ready
        2. How to do it…
      3. Doing binary classification using SVM
        1. How to do it…
      4. Doing classification using decision trees
        1. Getting ready
        2. How to do it…
        3. How it works…
      5. Doing classification using Random Forests
        1. Getting ready
        2. How to do it…
        3. How it works…
      6. Doing classification using Gradient Boosted Trees
        1. Getting ready
        2. How to do it…
      7. Doing classification with Naïve Bayes
        1. Getting ready
        2. How to do it…
    16. 9. Unsupervised Learning with MLlib
      1. Introduction
      2. Clustering using k-means
        1. Getting ready
        2. How to do it…
      3. Dimensionality reduction with principal component analysis
        1. Getting ready
        2. How to do it…
      4. Dimensionality reduction with singular value decomposition
        1. Getting ready
        2. How to do it…
    17. 10. Recommender Systems
      1. Introduction
      2. Collaborative filtering using explicit feedback
        1. Getting ready
        2. How to do it…
      3. Collaborative filtering using implicit feedback
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
    18. 11. Graph Processing Using GraphX
      1. Introduction
      2. Fundamental operations on graphs
        1. Getting ready
        2. How to do it…
      3. Using PageRank
        1. Getting ready
        2. How to do it…
      4. Finding connected components
        1. Getting ready
        2. How to do it…
      5. Performing neighborhood aggregation
        1. Getting ready
        2. How to do it…
    19. 12. Optimizations and Performance Tuning
      1. Introduction
      2. Optimizing memory
      3. Using compression to improve performance
      4. Using serialization to improve performance
        1. How to do it…
      5. Optimizing garbage collection
        1. How to do it…
      6. Optimizing the level of parallelism
        1. How to do it…
      7. Understanding the future of optimization – project Tungsten
        1. Manual memory management by leverage application semantics
        2. Using algorithms and data structures
          1. Code generation
    20. Index

Product information

  • Title: Spark Cookbook
  • Author(s): Rishi Yadav
  • Release date: July 2015
  • Publisher(s): Packt Publishing
  • ISBN: 9781783987061