Storm Blueprints: Patterns for Distributed Real-time Computation

Book description

One of the best ways of getting to grips with the world's most popular framework for real-time processing is to study real-world projects. This books lets you do just that, resulting in a sound understanding of the fundamentals.

In Detail

Storm is the most popular framework for real-time stream processing. Storm provides the fundamental primitives and guarantees required for fault-tolerant distributed computing in high-volume, mission critical applications. It is both an integration technology as well as a data flow and control mechanism, making it the core of many big data platforms. Storm is essential if you want to deploy, operate, and develop data processing flows capable of processing billions of transactions.

"Storm: Distributed Real-time Computation Blueprints" covers a broad range of distributed computing topics, including not only design and integration patterns, but also domains and applications to which the technology is immediately useful and commonly applied. This book introduces you to Storm using real-world examples, beginning with simple Storm topologies. The examples increase in complexity, introducing advanced Storm concepts as well as more sophisticated approaches to deployment and operational concerns.

"Storm: Distributed Real-time Computation Blueprints" covers a broad range of distributed computing topics, including not only design and integration patterns, but also domains and applications to which the technology is immediately useful and commonly applied. This book introduces you to Storm using real-world examples, beginning with simple Storm topologies. The examples increase in complexity, introducing advanced Storm concepts as well as more sophisticated approaches to deployment and operational concerns.

This book covers the domains of real-time log processing, sensor data analysis, collective and artificial intelligence, financial market analysis, Natural Language Processing (NLP), graph analysis, polyglot persistence and online advertising. While exploring distributed computing applications in each of those domains, the book covers advanced Storm topics such as Trident and Distributed State, as well as integration patterns for Druid and Titan. Simultaneously, the book also describes the deployment of Storm to YARN and the Amazon infrastructure, as well as other key operational concerns such as centralized logging.

By the end of the book, you will have gained an understanding of the fundamentals of Storm and Trident and be able to identify and apply those fundamentals to any suitable problem.

What You Will Learn

  • Learn the fundamentals of Storm
  • Install and configure storm in pseudo-distributed and fully-distributed mode
  • Familiarize yourself with the fundamentals of Trident and distributed state
  • Design patterns for data flows in a distributed system
  • Create integration patterns for persistence mechanisms such as Titan
  • Deploy and run Storm clusters by leveraging YARN
  • Achieve continuous availability and fault tolerance through distributed storage
  • Recognize centralized logging mechanisms and processing
  • Implement polyglot persistence and distributed transactions
  • Calculate the effectiveness of a campaign using click-through analysis

Table of contents

  1. Storm Blueprints: Patterns for Distributed Real-time Computation
    1. Table of Contents
    2. Storm Blueprints: Patterns for Distributed Real-time Computation
    3. Credits
    4. About the Authors
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Distributed Word Count
      1. Introducing elements of a Storm topology – streams, spouts, and bolts
        1. Streams
        2. Spouts
        3. Bolts
      2. Introducing the word count topology data flow
        1. Sentence spout
          1. Introducing the split sentence bolt
          2. Introducing the word count bolt
          3. Introducing the report bolt
      3. Implementing the word count topology
        1. Setting up a development environment
        2. Implementing the sentence spout
        3. Implementing the split sentence bolt
        4. Implementing the word count bolt
        5. Implementing the report bolt
        6. Implementing the word count topology
      4. Introducing parallelism in Storm
        1. WordCountTopology parallelism
          1. Adding workers to a topology
          2. Configuring executors and tasks
      5. Understanding stream groupings
      6. Guaranteed processing
        1. Reliability in spouts
        2. Reliability in bolts
        3. Reliable word count
      7. Summary
    9. 2. Configuring Storm Clusters
      1. Introducing the anatomy of a Storm cluster
        1. Understanding the nimbus daemon
        2. Working with the supervisor daemon
        3. Introducing Apache ZooKeeper
        4. Working with Storm's DRPC server
        5. Introducing the Storm UI
      2. Introducing the Storm technology stack
        1. Java and Clojure
        2. Python
      3. Installing Storm on Linux
        1. Installing the base operating system
        2. Installing Java
        3. ZooKeeper installation
        4. Storm installation
        5. Running the Storm daemons
        6. Configuring Storm
        7. Mandatory settings
        8. Optional settings
        9. The Storm executable
        10. Setting up the Storm executable on a workstation
        11. The daemon commands
          1. Nimbus
          2. Supervisor
          3. UI
          4. DRPC
        12. The management commands
          1. Jar
          2. Kill
          3. Deactivate
          4. Activate
          5. Rebalance
          6. Remoteconfvalue
        13. Local debug/development commands
          1. REPL
          2. Classpath
          3. Localconfvalue
      4. Submitting topologies to a Storm cluster
      5. Automating the cluster configuration
      6. A rapid introduction to Puppet
        1. Puppet manifests
        2. Puppet classes and modules
        3. Puppet templates
        4. Managing environments with Puppet Hiera
        5. Introducing Hiera
      7. Summary
    10. 3. Trident Topologies and Sensor Data
      1. Examining our use case
      2. Introducing Trident topologies
      3. Introducing Trident spouts
      4. Introducing Trident operations – filters and functions
        1. Introducing Trident filters
        2. Introducing Trident functions
      5. Introducing Trident aggregators – Combiners and Reducers
        1. CombinerAggregator
        2. ReducerAggregator
        3. Aggregator
      6. Introducing the Trident state
        1. The Repeat Transactional state
        2. The Opaque state
      7. Executing the topology
      8. Summary
    11. 4. Real-time Trend Analysis
      1. Use case
      2. Architecture
        1. The source application
        2. The logback Kafka appender
        3. Apache Kafka
        4. Kafka spout
        5. The XMPP server
      3. Installing the required software
        1. Installing Kafka
        2. Installing OpenFire
      4. Introducing the sample application
        1. Sending log messages to Kafka
      5. Introducing the log analysis topology
        1. Kafka spout
        2. The JSON project function
        3. Calculating a moving average
        4. Adding a sliding window
        5. Implementing the moving average function
        6. Filtering on thresholds
        7. Sending notifications with XMPP
      6. The final topology
      7. Running the log analysis topology
      8. Summary
    12. 5. Real-time Graph Analysis
      1. Use case
      2. Architecture
        1. The Twitter client
        2. Kafka spout
        3. A titan-distributed graph database
      3. A brief introduction to graph databases
        1. Accessing the graph – the TinkerPop stack
        2. Manipulating the graph with the Blueprints API
        3. Manipulating the graph with the Gremlin shell
      4. Software installation
        1. Titan installation
      5. Setting up Titan to use the Cassandra storage backend
        1. Installing Cassandra
        2. Starting Titan with the Cassandra backend
      6. Graph data model
      7. Connecting to the Twitter stream
        1. Setting up the Twitter4J client
        2. The OAuth configuration
          1. The TwitterStreamConsumer class
          2. The TwitterStatusListener class
      8. Twitter graph topology
        1. The JSONProjectFunction class
      9. Implementing GraphState
        1. GraphFactory
        2. GraphTupleProcessor
        3. GraphStateFactory
        4. GraphState
        5. GraphUpdater
      10. Implementing GraphFactory
      11. Implementing GraphTupleProcessor
      12. Putting it all together – the TwitterGraphTopology class
        1. The TwitterGraphTopology class
      13. Querying the graph with Gremlin
      14. Summary
    13. 6. Artificial Intelligence
      1. Designing for our use case
      2. Establishing the architecture
        1. Examining the design challenges
        2. Implementing the recursion
          1. Accessing the function's return values
          2. Immutable tuple field values
          3. Upfront field declaration
          4. Tuple acknowledgement in recursion
          5. Output to multiple streams
          6. Read-before-write
        3. Solving the challenges
      3. Implementing the architecture
        1. The data model
        2. Examining the recursive topology
        3. The queue interaction
        4. Functions and filters
        5. Examining the Scoring Topology
          1. Addressing read-before-write
            1. Distributed locking
            2. Retry when stale
            3. Executing the topology
          2. Enumerating the game tree
        6. Distributed Remote Procedure Call (DRPC)
          1. Remote deployment
      4. Summary
    14. 7. Integrating Druid for Financial Analytics
      1. Use case
      2. Integrating a non-transactional system
      3. The topology
        1. The spout
        2. The filter
        3. The state design
      4. Implementing the architecture
        1. DruidState
        2. Implementing the StormFirehose object
        3. Implementing the partition status in ZooKeeper
      5. Executing the implementation
      6. Examining the analytics
      7. Summary
    15. 8. Natural Language Processing
      1. Motivating a Lambda architecture
      2. Examining our use case
      3. Realizing a Lambda architecture
      4. Designing the topology for our use case
      5. Implementing the design
        1. TwitterSpout/TweetEmitter
        2. Functions
          1. TweetSplitterFunction
          2. WordFrequencyFunction
          3. PersistenceFunction
      6. Examining the analytics
      7. Batch processing / historical analysis
      8. Hadoop
        1. An overview of MapReduce
        2. The Druid setup
          1. HadoopDruidIndexer
      9. Summary
    16. 9. Deploying Storm on Hadoop for Advertising Analysis
      1. Examining the use case
      2. Establishing the architecture
        1. Examining HDFS
        2. Examining YARN
      3. Configuring the infrastructure
        1. The Hadoop infrastructure
        2. Configuring HDFS
          1. Configuring the NameNode
          2. Configuring the DataNode
          3. Configuring YARN
            1. Configuring the ResourceManager
          4. Configuring the NodeManager
      4. Deploying the analytics
        1. Performing a batch analysis with the Pig infrastructure
        2. Performing a real-time analysis with the Storm-YARN infrastructure
      5. Performing the analytics
        1. Executing the batch analysis
        2. Executing real-time analysis
      6. Deploying the topology
      7. Executing the topology
      8. Summary
    17. 10. Storm in the Cloud
      1. Introducing Amazon Elastic Compute Cloud (EC2)
        1. Setting up an AWS account
        2. The AWS Management Console
          1. Creating an SSH key pair
        3. Launching an EC2 instance manually
          1. Logging in to the EC2 instance
      2. Introducing Apache Whirr
        1. Installing Whirr
      3. Configuring a Storm cluster with Whirr
        1. Launching the cluster
      4. Introducing Whirr Storm
        1. Setting up Whirr Storm
          1. Cluster configuration
          2. Customizing Storm's configuration
          3. Customizing firewall rules
      5. Introducing Vagrant
        1. Installing Vagrant
        2. Launching your first virtual machine
          1. The Vagrantfile and shared filesystem
          2. Vagrant provisioning
          3. Configuring multimachine clusters with Vagrant
      6. Creating Storm-provisioning scripts
        1. ZooKeeper
        2. Storm
        3. Supervisord
          1. The Storm Vagrantfile
          2. Launching the Storm cluster
      7. Summary
    18. Index

Product information

  • Title: Storm Blueprints: Patterns for Distributed Real-time Computation
  • Author(s): P. Taylor Goetz, Brian O'Neill
  • Release date: March 2014
  • Publisher(s): Packt Publishing
  • ISBN: 9781782168294