Learning Apache Drill

Book description

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.

In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you’ll learn how Drill helps you analyze data more effectively to drive down time to insight.

  • Use Drill to clean, prepare, and summarize delimited data for further analysis
  • Query file types including logfiles, Parquet, JSON, and other complex formats
  • Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL
  • Connect to Drill programmatically using a variety of languages
  • Use Drill even with challenging or ambiguous file formats
  • Perform sophisticated analysis by extending Drill’s functionality with user-defined functions
  • Facilitate data analysis for network security, image metadata, and machine learning

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book
    2. Why We Wrote This Book
    3. Navigating This Book
    4. Online Resources
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Safari
    8. How to Contact Us
    9. Acknowledgments
    10. Special Thanks from Charles
    11. Special Thanks from Paul
  2. 1. Introduction to Apache Drill
    1. What Is Apache Drill?
      1. Drill Is Versatile
      2. Drill Is Easy to Use
      3. A Word About Drill’s Performance
      4. A Very Brief History of Big Data
      5. Drill in the Big Data Ecosystem
      6. Comparing Drill with Similar Tools
  3. 2. Installing and Running Drill
    1. Preparing Your Machine for Drill
      1. Special Configuration Instructions for Windows Installations
    2. Installing Drill on Windows
      1. Starting Drill on a Windows Machine
    3. Installing Drill in Embedded Mode on macOS or Linux
      1. Starting Drill on macOS or Linux in Embedded Mode
    4. Installing Drill in Distributed Mode on macOS or Linux
      1. Preparing Your Cluster for Drill
      2. Starting Drill in Distributed Mode
    5. Connecting to the Cluster
    6. Conclusion
  4. 3. Overview of Apache Drill
    1. The Apache Hadoop Ecosystem
      1. Drill Is a Low-Latency Query Engine
      2. Distributed Processing with HDFS
      3. Elements of a Drill System
      4. Drill Operation: The 30,000-Foot View
      5. Drill Is a Query Engine, Not a Database
    2. Drill Operation Overview
      1. Drill Components
      2. SQL Session State
      3. Statement Preparation
      4. Statement Execution
      5. Low-Latency Features
    3. Conclusion
  5. 4. Querying Delimited Data
    1. Ways of Querying Data with Drill
      1. Other Interfaces
    2. Drill SQL Query Format
      1. Choosing a Data Source
      2. Defining a Workspace
      3. Specifying a Default Data Source
      4. Accessing Columns in a Query
      5. Delimited Data with Column Headers
      6. Table Functions
      7. Querying Directories
    3. Understanding Drill Data Types
    4. Cleaning and Preparing Data Using String Manipulation Functions
      1. Complex Data Conversion Functions
    5. Working with Dates and Times in Drill
      1. Converting Strings to Dates
      2. Reformatting Dates
      3. Date Arithmetic and Manipulation
      4. Date and Time Functions in Drill
    6. Creating Views
    7. Data Analysis Using Drill
      1. Summarizing Data with Aggregate Functions
    8. Common Problems in Querying Delimited Data
      1. Spaces in Column Names
      2. Illegal Characters in Column Headers
      3. Reserved Words in Column Names
    9. Conclusion
  6. 5. Analyzing Complex and Nested Data
    1. Arrays and Maps
      1. Arrays in Drill
      2. Accessing Maps (Key–Value Pairs) in Drill
      3. Querying Nested Data
    2. Analyzing Log Files with Drill
      1. Configuring Drill to Read HTTPD Web Server Logs
      2. Querying Web Server Logs
      3. Other Log Analysis with Drill
    3. Conclusion
  7. 6. Connecting Drill to Data Sources
    1. Querying Multiple Data Sources
      1. Configuring a New Storage Plug-in
      2. Connecting Drill to a Relational Database
      3. Querying Data in Hadoop from Drill
      4. Connecting to and Querying HBase from Drill
      5. Querying Hive Data from Drill
      6. Connecting to and Querying Streaming Data with Drill and Kafka
      7. Connecting to and Querying Kudu
      8. Connecting to and Querying MongoDB from Drill
      9. Connecting Drill to Cloud Storage
      10. Querying Time Series Data from Drill and OpenTSDB
    2. Conclusion
  8. 7. Connecting to Drill
    1. Understanding Drill’s Interfaces
      1. JDBC and Drill
      2. ODBC and Drill
      3. Drill’s REST Interface
    2. Connecting to Drill with Python
      1. Using drillpy to Query Drill
      2. Connecting to Drill Using pydrill
      3. Other Ways of Connecting to Drill from Python
    3. Connecting to Drill Using R
      1. Querying Drill from R Using sergeant
    4. Connecting to Drill Using Java
    5. Querying Drill with PHP
      1. Using the Connector
      2. Querying Drill from PHP
      3. Interacting with Drill from PHP
    6. Querying Drill Using Node.js
    7. Using Drill as a Data Source in BI Tools
      1. Exploring Data with Apache Zeppelin and Drill
      2. Exploring Data with Apache Superset
    8. Conclusion
  9. 8. Data Engineering with Drill
    1. Schema-on-Read
      1. The SQL Relational Model
      2. Data Life Cycle: Data Exploration to Production
      3. Schema Inference
    2. Data Source Inference
      1. Storage Plug-ins
      2. Storage Configurations
      3. Workspaces
      4. Querying Directories
      5. Default Schema
    3. File Type Inference
      1. Format Plug-ins and Format Configuration
      2. Format Inference
      3. File Format Variations
    4. Schema Inference Overview
    5. Distributed File Scans
      1. Schema Inference for Delimited Data
      2. CSV Summary
      3. Schema Inference for JSON
      4. Ambiguous Numeric Schemas
    6. Aligning Schemas Across Files
    7. JSON Objects
      1. JSON Lists in Drill
      2. JSON Summary
    8. Using Drill with the Parquet File Format
      1. Schema Evolution in Parquet
    9. Partitioning Data Directories
      1. Defining a Table Workspace
    10. Working with Queries in Production
      1. Capturing Schema Mapping in Views
      2. Running Challenging Queries in Scripts
    11. Conclusion
  10. 9. Deploying Drill in Production
    1. Installing Drill
      1. Prerequisites
      2. Production Installation
      3. Configuring ZooKeeper
      4. Configuring Memory
      5. Configuring Logging
      6. Testing the Installation
      7. Distributing Drill Binaries and Configuration
      8. Starting the Drill Cluster
    2. Configuring Storage
      1. Working with Apache Hadoop HDFS
      2. Working with Amazon S3
    3. Admission Control
    4. Additional Configuration
      1. User-Defined Functions and Custom Plug-ins
      2. Security
      3. Logging Levels
      4. Controlling CPU Usage
    5. Monitoring
      1. Monitoring the Drill Process
      2. Monitoring JMX Metrics
      3. Monitoring Queries
    6. Other Deployment Options
      1. MapR Installer
      2. Drill-on-YARN
      3. Docker
    7. Conclusion
  11. 10. Setting Up Your Development Environment
    1. Installing Maven
    2. Creating the Drill Build Environment
      1. Setting Up Git and Getting the Source Code
      2. Building Drill from Source
    3. Installing the IDE
    4. Conclusion
  12. 11. Writing Drill User-Defined Functions
    1. Use Case: Finding and Filtering Valid Credit Card Numbers
    2. How User-Defined Functions Work in Drill
    3. Structure of a Simple Drill UDF
      1. The pom.xml File
      2. The Function File
      3. The Simple Function API
      4. Putting It All Together
    4. Building and Installing Your UDF
      1. Statically Installing a UDF
      2. Dynamically Installing a UDF
    5. Complex Functions: UDFs That Return Maps or Arrays
      1. Example: Extracting User Agent Metadata
      2. The ComplexWriter
    6. Writing Aggregate User-Defined Functions
      1. The Aggregate Function API
      2. Example Aggregate UDF: Kendall’s Rank Correlation Coefficient
    7. Conclusion
  13. 12. Writing a Format Plug-in
    1. The Example Regex Format Plug-in
    2. Creating the “Easy” Format Plug-in
      1. Creating the Maven pom.xml File
      2. Creating the Plug-in Package
      3. Drill Module Configuration
      4. Format Plug-in Configuration
      5. Cautions Before Getting Started
    3. Creating the Regex Plug-in Configuration Class
      1. Copyright Headers and Code Format
      2. Testing the Configuration
      3. Fixing Configuration Problems
      4. Troubleshooting
    4. Creating the Format Plug-in Class
      1. Creating a Test File
      2. Configuring RAT
      3. Efficient Debugging
      4. Creating the Unit Test
      5. How Drill Finds Your Plug-in
    5. The Record Reader
      1. Testing the Reader Shell
      2. Logging
      3. Error Handling
      4. Setup
      5. Regex Parsing
      6. Defining Column Names
      7. Projection
      8. Column Projection Accounting
      9. Project None
      10. Project All
      11. Project Some
      12. Opening the File
      13. Record Batches
      14. Drill’s Columnar Structure
      15. Defining Vectors
      16. Reading Data
      17. Loading Data into Vectors
      18. Releasing Resources
    6. Testing the Reader
      1. Testing the Wildcard Case
      2. Testing Explicit Projection
      3. Testing Empty Projection
      4. Scaling Up
    7. Additional Details
      1. File Chunks
      2. Default Format Configuration
      3. Next Steps
      4. Production Build
      5. Contributing to Drill: The Pull Request
      6. Maintaining Your Branch
      7. Create a Plug-In Project
    8. Conclusion
  14. 13. Unique Uses of Drill
    1. Finding Photos Taken Within a Geographic Region
    2. Drilling Excel Files
      1. The pom.xml File
      2. The Excel Custom Record Reader
      3. Using the Excel Format Plug-in
    3. Network Packet Analysis (PCAP) with Drill
      1. Examples of Queries Using PCAP Data Files
    4. Analyzing Twitter Data with Drill
    5. Using Drill in a Machine Learning Pipeline
      1. Making Predictions Within Drill
      2. Building and Serializing a Model
      3. Writing the UDF Wrapper
      4. Making Predictions Using the UDF
    6. Conclusion
  15. A. List of Drill Functions
    1. Aggregate and Window Functions
      1. Window Functions
    2. Cryptological and Hashing Functions
    3. Data Conversion Functions
    4. Geospatial Functions
    5. Math and Trigonometric Functions
    6. Networking Functions
    7. Null Handling Functions
    8. String Manipulation Functions
    9. Approximate String Matching Functions
      1. Phonetic Functions
      2. String Distance Functions
  16. B. Drill Formatting Strings
  17. Index

Product information

  • Title: Learning Apache Drill
  • Author(s): Charles Givre, Paul Rogers
  • Release date: November 2018
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492032793