Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

Book description

None

Table of contents

  1. Cover
  2. Title
  3. Copyright
  4. Dedication
  5. Contents at a Glance
  6. Contents
  7. About the Author
  8. About the Technical Reviewers
  9. Acknowledgments
  10. Introduction
  11. Chapter 1 : Big Data Technology Landscape
    1. Hadoop
      1. HDFS (Hadoop Distributed File System)
      2. MapReduce
      3. Hive
    2. Data Serialization
      1. Avro
      2. Thrift
      3. Protocol Buffers
      4. SequenceFile
    3. Columnar Storage
      1. RCFile
      2. ORC
      3. Parquet
    4. Messaging Systems
      1. Kafka
      2. ZeroMQ
    5. NoSQL
      1. Cassandra
      2. HBase
    6. Distributed SQL Query Engine
      1. Impala
      2. Presto
      3. Apache Drill
    7. Summary
  12. Chapter 2 : Programming in Scala
    1. Functional Programming (FP)
      1. Functions
      2. Immutable Data Structures
      3. Everything Is an Expression
    2. Scala Fundamentals
      1. Getting Started
      2. Basic Types
      3. Variables
      4. Functions
      5. Classes
      6. Singletons
      7. Case Classes
      8. Pattern Matching
      9. Operators
      10. Traits
      11. Tuples
      12. Option Type
      13. Collections
    3. A Standalone Scala Application
    4. Summary
  13. Chapter 3 : Spark Core
    1. Overview
      1. Key Features
      2. Ideal Applications
    2. High-level Architecture
      1. Workers
      2. Cluster Managers
      3. Driver Programs
      4. Executors
      5. Tasks
    3. Application Execution
      1. Terminology
      2. How an Application Works
    4. Data Sources
    5. Application Programming Interface (API)
      1. SparkContext
      2. Resilient Distributed Datasets (RDD)
      3. Creating an RDD
      4. RDD Operations
      5. Saving an RDD
    6. Lazy Operations
      1. Action Triggers Computation
    7. Caching
      1. RDD Caching Methods
      2. RDD Caching Is Fault Tolerant
      3. Cache Memory Management
    8. Spark Jobs
    9. Shared Variables
      1. Broadcast Variables
      2. Accumulators
    10. Summary
  14. Chapter 4 : Interactive Data Analysis with Spark Shell
    1. Getting Started
      1. Download
      2. Extract
      3. Run
    2. REPL Commands
    3. Using the Spark Shell as a Scala Shell
    4. Number Analysis
    5. Log Analysis
    6. Summary
  15. Chapter 5 : Writing a Spark Application
    1. Hello World in Spark
    2. Compiling and Running the Application
      1. sbt (Simple Build Tool)
      2. Compiling the Code
      3. Running the Application
    3. Monitoring the Application
    4. Debugging the Application
    5. Summary
  16. Chapter 6 : Spark Streaming
    1. Introducing Spark Streaming
      1. Spark Streaming Is a Spark Add-on
      2. High-Level Architecture
      3. Data Stream Sources
      4. Receiver
      5. Destinations
    2. Application Programming Interface (API)
      1. StreamingContext
      2. Basic Structure of a Spark Streaming Application
      3. Discretized Stream (DStream)
      4. Creating a DStream
      5. Processing a Data Stream
      6. Output Operations
      7. Window Operation
    3. A Complete Spark Streaming Application
    4. Summary
  17. Chapter 7 : Spark SQL
    1. Introducing Spark SQL
      1. Integration with Other Spark Libraries
      2. Usability
      3. Data Sources
      4. Data Processing Interface
      5. Hive Interoperability
    2. Performance
      1. Reduced Disk I/O
      2. Partitioning
      3. Columnar Storage
      4. In-Memory Columnar Caching
      5. Skip Rows
      6. Predicate Pushdown
      7. Query Optimization
    3. Applications
      1. ETL (Extract Transform Load)
      2. Data Virtualization
      3. Distributed JDBC/ODBC SQL Query Engine
      4. Data Warehousing
    4. Application Programming Interface (API)
      1. Key Abstractions
      2. Creating DataFrames
      3. Processing Data Programmatically with SQL/HiveQL
      4. Processing Data with the DataFrame API
      5. Saving a DataFrame
    5. Built-in Functions
      1. Aggregate
      2. Collection
      3. Date/Time
      4. Math
      5. String
      6. Window
    6. UDFs and UDAFs
    7. Interactive Analysis Example
    8. Interactive Analysis with Spark SQL JDBC Server
    9. Summary
  18. Chapter 8 : Machine Learning with Spark
    1. Introducing Machine Learning
      1. Features
      2. Labels
      3. Models
      4. Training Data
      5. Test Data
      6. Machine Learning Applications
      7. Machine Learning Algorithms
      8. Hyperparameter
      9. Model Evaluation
      10. Machine Learning High-level Steps
    2. Spark Machine Learning Libraries
    3. MLlib Overview
      1. Integration with Other Spark Libraries
      2. Statistical Utilities
      3. Machine Learning Algorithms
    4. The MLlib API
      1. Data Types
      2. Algorithms and Models
      3. Model Evaluation
    5. An Example MLlib Application
      1. Dataset
      2. Goal
      3. Code
    6. Spark ML
      1. ML Dataset
      2. Transformer
      3. Estimator
      4. Pipeline
      5. PipelineModel
      6. Evaluator
      7. Grid Search
      8. CrossValidator
    7. An Example Spark ML Application
      1. Dataset
      2. Goal
      3. Code
    8. Summary
  19. Chapter 9 : Graph Processing with Spark
    1. Introducing Graphs
      1. Undirected Graphs
      2. Directed Graphs
      3. Directed Multigraphs
      4. Property Graphs
    2. Introducing GraphX
    3. GraphX API
      1. Data Abstractions
      2. Creating a Graph
      3. Graph Properties
      4. Graph Operators
    4. Summary
  20. Chapter 10 : Cluster Managers
    1. Standalone Cluster Manager
      1. Architecture
      2. Setting Up a Standalone Cluster
      3. Running a Spark Application on a Standalone Cluster
    2. Apache Mesos
      1. Architecture
      2. Setting Up a Mesos Cluster
      3. Running a Spark Application on a Mesos Cluster
    3. YARN
      1. Architecture
      2. Running a Spark Application on a YARN Cluster
    4. Summary
  21. Chapter 11 : Monitoring
    1. Monitoring a Standalone Cluster
      1. Monitoring a Spark Master
      2. Monitoring a Spark Worker
    2. Monitoring a Spark Application
      1. Monitoring Jobs Launched by an Application
      2. Monitoring Stages in a Job
      3. Monitoring Tasks in a Stage
      4. Monitoring RDD Storage
      5. Monitoring Environment
      6. Monitoring Executors
      7. Monitoring a Spark Streaming Application
      8. Monitoring Spark SQL Queries
      9. Monitoring Spark SQL JDBC/ODBC Server
    3. Summary
  22. Bibliography
  23. Index

Product information

  • Title: Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing
  • Author(s):
  • Release date:
  • Publisher(s): Apress
  • ISBN: None