High Performance Spark

Book description

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.

With this book, you’ll explore:

  • How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
  • The choice between data joins in Core Spark and Spark SQL
  • Techniques for getting the most out of standard RDD transformations
  • How to work around performance issues in Spark’s key/value pair paradigm
  • Writing high-performance Spark code without Scala or the JVM
  • How to test for functionality and performance when applying suggested improvements
  • Using Spark MLlib and Spark ML machine learning libraries
  • Spark’s Streaming components and external community packages

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. First Edition Notes
    2. Supporting Books and Materials
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Safari
    6. How to Contact the Authors
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Introduction to High Performance Spark
    1. What Is Spark and Why Performance Matters
    2. What You Can Expect to Get from This Book
    3. Spark Versions
    4. Why Scala?
      1. To Be a Spark Expert You Have to Learn a Little Scala Anyway
      2. The Spark Scala API Is Easier to Use Than the Java API
      3. Scala Is More Performant Than Python
      4. Why Not Scala?
      5. Learning Scala
    5. Conclusion
  3. 2. How Spark Works
    1. How Spark Fits into the Big Data Ecosystem
      1. Spark Components
    2. Spark Model of Parallel Computing: RDDs
      1. Lazy Evaluation
      2. In-Memory Persistence and Memory Management
      3. Immutability and the RDD Interface
      4. Types of RDDs
      5. Functions on RDDs: Transformations Versus Actions
      6. Wide Versus Narrow Dependencies
    3. Spark Job Scheduling
      1. Resource Allocation Across Applications
      2. The Spark Application
    4. The Anatomy of a Spark Job
      1. The DAG
      2. Jobs
      3. Stages
      4. Tasks
    5. Conclusion
  4. 3. DataFrames, Datasets, and Spark SQL
    1. Getting Started with the SparkSession (or HiveContext or SQLContext)
    2. Spark SQL Dependencies
      1. Managing Spark Dependencies
      2. Avoiding Hive JARs
    3. Basics of Schemas
    4. DataFrame API
      1. Transformations
      2. Multi-DataFrame Transformations
      3. Plain Old SQL Queries and Interacting with Hive Data
    5. Data Representation in DataFrames and Datasets
      1. Tungsten
    6. Data Loading and Saving Functions
      1. DataFrameWriter and DataFrameReader
      2. Formats
      3. Save Modes
      4. Partitions (Discovery and Writing)
    7. Datasets
      1. Interoperability with RDDs, DataFrames, and Local Collections
      2. Compile-Time Strong Typing
      3. Easier Functional (RDD “like”) Transformations
      4. Relational Transformations
      5. Multi-Dataset Relational Transformations
      6. Grouped Operations on Datasets
    8. Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
    9. Query Optimizer
      1. Logical and Physical Plans
      2. Code Generation
      3. Large Query Plans and Iterative Algorithms
    10. Debugging Spark SQL Queries
    11. JDBC/ODBC Server
    12. Conclusion
  5. 4. Joins (SQL and Core)
    1. Core Spark Joins
      1. Choosing a Join Type
      2. Choosing an Execution Plan
    2. Spark SQL Joins
      1. DataFrame Joins
      2. Dataset Joins
    3. Conclusion
  6. 5. Effective Transformations
    1. Narrow Versus Wide Transformations
      1. Implications for Performance
      2. Implications for Fault Tolerance
      3. The Special Case of coalesce
    2. What Type of RDD Does Your Transformation Return?
    3. Minimizing Object Creation
      1. Reusing Existing Objects
      2. Using Smaller Data Structures
    4. Iterator-to-Iterator Transformations with mapPartitions
      1. What Is an Iterator-to-Iterator Transformation?
      2. Space and Time Advantages
      3. An Example
    5. Set Operations
    6. Reducing Setup Overhead
      1. Shared Variables
      2. Broadcast Variables
      3. Accumulators
    7. Reusing RDDs
      1. Cases for Reuse
      2. Deciding if Recompute Is Inexpensive Enough
      3. Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
      4. Alluxio (nee Tachyon)
      5. LRU Caching
      6. Noisy Cluster Considerations
      7. Interaction with Accumulators
    8. Conclusion
  7. 6. Working with Key/Value Data
    1. The Goldilocks Example
      1. Goldilocks Version 0: Iterative Solution
      2. How to Use PairRDDFunctions and OrderedRDDFunctions
    2. Actions on Key/Value Pairs
    3. What’s So Dangerous About the groupByKey Function
      1. Goldilocks Version 1: groupByKey Solution
    4. Choosing an Aggregation Operation
      1. Dictionary of Aggregation Operations with Performance Considerations
    5. Multiple RDD Operations
      1. Co-Grouping
    6. Partitioners and Key/Value Data
      1. Using the Spark Partitioner Object
      2. Hash Partitioning
      3. Range Partitioning
      4. Custom Partitioning
      5. Preserving Partitioning Information Across Transformations
      6. Leveraging Co-Located and Co-Partitioned RDDs
      7. Dictionary of Mapping and Partitioning Functions PairRDDFunctions
    7. Dictionary of OrderedRDDOperations
      1. Sorting by Two Keys with SortByKey
    8. Secondary Sort and repartitionAndSortWithinPartitions
      1. Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
      2. How Not to Sort by Two Orderings
      3. Goldilocks Version 2: Secondary Sort
      4. A Different Approach to Goldilocks
      5. Goldilocks Version 3: Sort on Cell Values
    9. Straggler Detection and Unbalanced Data
      1. Back to Goldilocks (Again)
      2. Goldilocks Version 4: Reduce to Distinct on Each Partition
    10. Conclusion
  8. 7. Going Beyond Scala
    1. Beyond Scala within the JVM
    2. Beyond Scala, and Beyond the JVM
      1. How PySpark Works
      2. How SparkR Works
      3. Spark.jl (Julia Spark)
      4. How Eclair JS Works
      5. Spark on the Common Language Runtime (CLR)—C# and Friends
    3. Calling Other Languages from Spark
      1. Using Pipe and Friends
      2. JNI
      3. Java Native Access (JNA)
      4. Underneath Everything Is FORTRAN
      5. Getting to the GPU
    4. The Future
    5. Conclusion
  9. 8. Testing and Validation
    1. Unit Testing
      1. General Spark Unit Testing
      2. Mocking RDDs
    2. Getting Test Data
      1. Generating Large Datasets
      2. Sampling
    3. Property Checking with ScalaCheck
      1. Computing RDD Difference
    4. Integration Testing
      1. Choosing Your Integration Testing Environment
    5. Verifying Performance
      1. Spark Counters for Verifying Performance
      2. Projects for Verifying Performance
    6. Job Validation
    7. Conclusion
  10. 9. Spark MLlib and ML
    1. Choosing Between Spark MLlib and Spark ML
    2. Working with MLlib
      1. Getting Started with MLlib (Organization and Imports)
      2. MLlib Feature Encoding and Data Preparation
      3. Feature Scaling and Selection
      4. MLlib Model Training
      5. Predicting
      6. Serving and Persistence
      7. Model Evaluation
    3. Working with Spark ML
      1. Spark ML Organization and Imports
      2. Pipeline Stages
      3. Explain Params
      4. Data Encoding
      5. Data Cleaning
      6. Spark ML Models
      7. Putting It All Together in a Pipeline
      8. Training a Pipeline
      9. Accessing Individual Stages
      10. Data Persistence and Spark ML
      11. Extending Spark ML Pipelines with Your Own Algorithms
      12. Model and Pipeline Persistence and Serving with Spark ML
    4. General Serving Considerations
    5. Conclusion
  11. 10. Spark Components and Packages
    1. Stream Processing with Spark
      1. Sources and Sinks
      2. Batch Intervals
      3. Data Checkpoint Intervals
      4. Considerations for DStreams
      5. Considerations for Structured Streaming
      6. High Availability Mode (or Handling Driver Failure or Checkpointing)
    2. GraphX
    3. Using Community Packages and Libraries
      1. Creating a Spark Package
    4. Conclusion
  12. A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist
    1. Spark Tuning and Cluster Sizing
      1. How to Adjust Spark Settings
      2. How to Determine the Relevant Information About Your Cluster
    2. Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
      1. Calculating Executor and Driver Memory Overhead
      2. How Large to Make the Spark Driver
      3. A Few Large Executors or Many Small Executors?
      4. Allocating Cluster Resources and Dynamic Allocation
      5. Dividing the Space Within One Executor
      6. Number and Size of Partitions
    3. Serialization Options
      1. Kryo
    4. Some Additional Debugging Techniques
  13. Index

Product information

  • Title: High Performance Spark
  • Author(s): Holden Karau, Rachel Warren
  • Release date: May 2017
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491943151