Using Flume

Book description

How can you get your data from frontend servers to Hadoop in near real time? With this complete reference guide, you’ll learn Flume’s rich set of features for collecting, aggregating, and writing large amounts of streaming data to the Hadoop Distributed File System (HDFS), Apache HBase, SolrCloud, Elastic Search, and other systems.

Using Flume shows operations engineers how to configure, deploy, and monitor a Flume cluster, and teaches developers how to write Flume plugins and custom components for their specific use-cases. You’ll learn about Flume’s design and implementation, as well as various features that make it highly scalable, flexible, and reliable. Code examples and exercises are available on GitHub.

  • Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumers
  • Dive into key Flume components, including sources that accept data and sinks that write and deliver it
  • Write custom plugins to customize the way Flume receives, modifies, formats, and writes data
  • Explore APIs for sending data to Flume agents from your own applications
  • Plan and deploy Flume in a scalable and flexible way—and monitor your cluster once it’s running

Table of contents

  1. Foreword
  2. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
    5. Acknowledgments
  3. 1. Apache Hadoop and Apache HBase: An Introduction
    1. HDFS
      1. HDFS Data Formats
      2. Processing Data on HDFS
    2. Apache HBase
    3. Summary
    4. References
  4. 2. Streaming Data Using Apache Flume
    1. The Need for Flume
    2. Is Flume a Good Fit?
    3. Inside a Flume Agent
    4. Configuring Flume Agents
    5. Getting Flume Agents to Talk to Each Other
    6. Complex Flows
    7. Replicating Data to Various Destinations
    8. Dynamic Routing
    9. Flume’s No Data Loss Guarantee, Channels, and Transactions
      1. Transactions in Flume Channels
    10. Agent Failure and Data Loss
    11. The Importance of Batching
    12. What About Duplicates?
    13. Running a Flume Agent
    14. Summary
    15. References
  5. 3. Sources
    1. Lifecycle of a Source
    2. Sink-to-Source Communication
      1. Avro Source
      2. Thrift Source
      3. Failure Handling in RPC Sources
    3. HTTP Source
      1. Writing Handlers for the HTTP Source*
    4. Spooling Directory Source
      1. Reading Custom Formats Using Deserializers*
      2. Spooling Directory Source Performance
    5. Syslog Sources
    6. Exec Source
    7. JMS Source
      1. Converting JMS Messages into Flume Events*
    8. Writing Your Own Sources*
      1. Event-Driven and Pollable Sources
    9. Summary
    10. References
  6. 4. Channels
    1. Transaction Workflow
    2. Channels Bundled with Flume
      1. Memory Channel
      2. File Channel
    3. Summary
    4. References
  7. 5. Sinks
    1. Lifecycle of a Sink
    2. Optimizing the Performance of Sinks
    3. Writing to HDFS: The HDFS Sink
      1. Understanding Buckets
      2. Configuring the HDFS Sink
      3. Controlling the Data Format Using Serializers*
    4. HBase Sinks
      1. Translating Flume Events to HBase Puts and Increments Using Serializers*
    5. RPC Sinks
      1. Avro Sink
      2. Thrift Sink
    6. Morphline Solr Sink
    7. Elastic Search Sink
      1. Customizing the Data Format*
    8. Other Sinks: Null Sink, Rolling File Sink, Logger Sink
    9. Writing Your Own Sink*
    10. Summary
    11. References
  8. 6. Interceptors, Channel Selectors, Sink Groups, and Sink Processors
    1. Interceptors
      1. Timestamp Interceptor
      2. Host Interceptor
      3. Static Interceptor
      4. Regex Filtering Interceptor
      5. Morphline Interceptor
      6. UUID Interceptor
      7. Writing Interceptors*
    2. Channel Selectors
      1. Replicating Channel Selector
      2. Multiplexing Channel Selector
      3. Custom Channel Selectors*
    3. Sink Groups and Sink Processors
      1. Load-Balancing Sink Processor
      2. Failover Sink Processor
    4. Summary
    5. References
  9. 7. Getting Data into Flume*
    1. Building Flume Events
    2. Flume Client SDK
      1. Building Flume RPC Clients
      2. RPC Client Interface
      3. Configuration Parameters Common to All RPC Clients
      4. Default RPC Client
      5. Load-Balancing RPC Client
      6. Failover RPC Client
      7. Thrift RPC Client
    3. Embedded Agent
      1. Configuring an Embedded Agent
    4. log4j Appenders
      1. Load-Balancing log4j Appender
    5. Summary
    6. References
  10. 8. Planning, Deploying, and Monitoring Flume
    1. Planning a Flume Deployment
      1. Time to Repair
      2. How Much Capacity Do I Need in My Flume Channels?
      3. How Many Tiers?
      4. Sending Data over Cross–Data Center Links
      5. Sharding Tiers
    2. Deploying Flume
      1. Deploying Custom Code
    3. Monitoring Flume
      1. Reporting Metrics from Custom Components
    4. Summary
    5. References
  11. Index

Product information

  • Title: Using Flume
  • Author(s): Hari Shreedharan
  • Release date: September 2014
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491905333