Programming Hive

Book description

Need to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem.

This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.

  • Use Hive to create, alter, and drop databases, tables, views, functions, and indexes
  • Customize data formats and storage options, from files to external databases
  • Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods
  • Gain best practices for creating user defined functions (UDFs)
  • Learn Hive patterns you should use and anti-patterns you should avoid
  • Integrate Hive with other data processing programs
  • Use storage handlers for NoSQL databases and other datastores
  • Learn the pros and cons of running Hive on Amazon’s Elastic MapReduce

Publisher resources

View/Submit Errata

Table of contents

  1. Programming Hive
  2. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
    5. What Brought Us to Hive?
      1. Edward Capriolo
      2. Dean Wampler
      3. Jason Rutherglen
    6. Acknowledgments
  3. 1. Introduction
    1. An Overview of Hadoop and MapReduce
      1. MapReduce
    2. Hive in the Hadoop Ecosystem
      1. Pig
      2. HBase
      3. Cascading, Crunch, and Others
    3. Java Versus Hive: The Word Count Algorithm
    4. What’s Next
  4. 2. Getting Started
    1. Installing a Preconfigured Virtual Machine
    2. Detailed Installation
      1. Installing Java
        1. Linux-specific Java steps
        2. Mac OS X−specific Java steps
      2. Installing Hadoop
      3. Local Mode, Pseudodistributed Mode, and Distributed Mode
      4. Testing Hadoop
      5. Installing Hive
    3. What Is Inside Hive?
    4. Starting Hive
    5. Configuring Your Hadoop Environment
      1. Local Mode Configuration
      2. Distributed and Pseudodistributed Mode Configuration
      3. Metastore Using JDBC
    6. The Hive Command
      1. Command Options
    7. The Command-Line Interface
      1. CLI Options
      2. Variables and Properties
      3. Hive “One Shot” Commands
      4. Executing Hive Queries from Files
      5. The .hiverc File
      6. More on Using the Hive CLI
        1. Autocomplete
      7. Command History
      8. Shell Execution
      9. Hadoop dfs Commands from Inside Hive
      10. Comments in Hive Scripts
      11. Query Column Headers
  5. 3. Data Types and File Formats
    1. Primitive Data Types
    2. Collection Data Types
    3. Text File Encoding of Data Values
    4. Schema on Read
  6. 4. HiveQL: Data Definition
    1. Databases in Hive
    2. Alter Database
    3. Creating Tables
      1. Managed Tables
      2. External Tables
    4. Partitioned, Managed Tables
      1. External Partitioned Tables
      2. Customizing Table Storage Formats
    5. Dropping Tables
    6. Alter Table
      1. Renaming a Table
      2. Adding, Modifying, and Dropping a Table Partition
      3. Changing Columns
      4. Adding Columns
      5. Deleting or Replacing Columns
      6. Alter Table Properties
      7. Alter Storage Properties
      8. Miscellaneous Alter Table Statements
  7. 5. HiveQL: Data Manipulation
    1. Loading Data into Managed Tables
    2. Inserting Data into Tables from Queries
      1. Dynamic Partition Inserts
    3. Creating Tables and Loading Them in One Query
    4. Exporting Data
  8. 6. HiveQL: Queries
    1. SELECT … FROM Clauses
      1. Specify Columns with Regular Expressions
      2. Computing with Column Values
      3. Arithmetic Operators
      4. Using Functions
        1. Mathematical functions
        2. Aggregate functions
        3. Table generating functions
        4. Other built-in functions
      5. LIMIT Clause
      6. Column Aliases
      7. Nested SELECT Statements
      8. CASE … WHEN … THEN Statements
      9. When Hive Can Avoid MapReduce
    2. WHERE Clauses
      1. Predicate Operators
      2. Gotchas with Floating-Point Comparisons
      3. LIKE and RLIKE
    3. GROUP BY Clauses
      1. HAVING Clauses
    4. JOIN Statements
      1. Inner JOIN
      2. Join Optimizations
      3. LEFT OUTER JOIN
      4. OUTER JOIN Gotcha
      5. RIGHT OUTER JOIN
      6. FULL OUTER JOIN
      7. LEFT SEMI-JOIN
      8. Cartesian Product JOINs
      9. Map-side Joins
    5. ORDER BY and SORT BY
    6. DISTRIBUTE BY with SORT BY
    7. CLUSTER BY
    8. Casting
      1. Casting BINARY Values
    9. Queries that Sample Data
      1. Block Sampling
      2. Input Pruning for Bucket Tables
    10. UNION ALL
  9. 7. HiveQL: Views
    1. Views to Reduce Query Complexity
    2. Views that Restrict Data Based on Conditions
    3. Views and Map Type for Dynamic Tables
    4. View Odds and Ends
  10. 8. HiveQL: Indexes
    1. Creating an Index
      1. Bitmap Indexes
    2. Rebuilding the Index
    3. Showing an Index
    4. Dropping an Index
    5. Implementing a Custom Index Handler
  11. 9. Schema Design
    1. Table-by-Day
    2. Over Partitioning
    3. Unique Keys and Normalization
    4. Making Multiple Passes over the Same Data
    5. The Case for Partitioning Every Table
    6. Bucketing Table Data Storage
    7. Adding Columns to a Table
    8. Using Columnar Tables
      1. Repeated Data
      2. Many Columns
    9. (Almost) Always Use Compression!
  12. 10. Tuning
    1. Using EXPLAIN
    2. EXPLAIN EXTENDED
    3. Limit Tuning
    4. Optimized Joins
    5. Local Mode
    6. Parallel Execution
    7. Strict Mode
    8. Tuning the Number of Mappers and Reducers
    9. JVM Reuse
    10. Indexes
    11. Dynamic Partition Tuning
    12. Speculative Execution
    13. Single MapReduce MultiGROUP BY
    14. Virtual Columns
  13. 11. Other File Formats and Compression
    1. Determining Installed Codecs
    2. Choosing a Compression Codec
    3. Enabling Intermediate Compression
    4. Final Output Compression
    5. Sequence Files
    6. Compression in Action
    7. Archive Partition
    8. Compression: Wrapping Up
  14. 12. Developing
    1. Changing Log4J Properties
    2. Connecting a Java Debugger to Hive
    3. Building Hive from Source
      1. Running Hive Test Cases
      2. Execution Hooks
    4. Setting Up Hive and Eclipse
    5. Hive in a Maven Project
    6. Unit Testing in Hive with hive_test
    7. The New Plugin Developer Kit
  15. 13. Functions
    1. Discovering and Describing Functions
    2. Calling Functions
    3. Standard Functions
    4. Aggregate Functions
    5. Table Generating Functions
    6. A UDF for Finding a Zodiac Sign from a Day
    7. UDF Versus GenericUDF
    8. Permanent Functions
    9. User-Defined Aggregate Functions
      1. Creating a COLLECT UDAF to Emulate GROUP_CONCAT
    10. User-Defined Table Generating Functions
      1. UDTFs that Produce Multiple Rows
      2. UDTFs that Produce a Single Row with Multiple Columns
      3. UDTFs that Simulate Complex Types
    11. Accessing the Distributed Cache from a UDF
    12. Annotations for Use with Functions
      1. Deterministic
      2. Stateful
      3. DistinctLike
    13. Macros
  16. 14. Streaming
    1. Identity Transformation
    2. Changing Types
    3. Projecting Transformation
    4. Manipulative Transformations
    5. Using the Distributed Cache
    6. Producing Multiple Rows from a Single Row
    7. Calculating Aggregates with Streaming
    8. CLUSTER BY, DISTRIBUTE BY, SORT BY
    9. GenericMR Tools for Streaming to Java
    10. Calculating Cogroups
  17. 15. Customizing Hive File and Record Formats
    1. File Versus Record Formats
    2. Demystifying CREATE TABLE Statements
    3. File Formats
      1. SequenceFile
      2. RCFile
      3. Example of a Custom Input Format: DualInputFormat
    4. Record Formats: SerDes
    5. CSV and TSV SerDes
    6. ObjectInspector
    7. Think Big Hive Reflection ObjectInspector
    8. XML UDF
    9. XPath-Related Functions
    10. JSON SerDe
    11. Avro Hive SerDe
      1. Defining Avro Schema Using Table Properties
      2. Defining a Schema from a URI
      3. Evolving Schema
    12. Binary Output
  18. 16. Hive Thrift Service
    1. Starting the Thrift Server
    2. Setting Up Groovy to Connect to HiveService
    3. Connecting to HiveServer
    4. Getting Cluster Status
    5. Result Set Schema
    6. Fetching Results
    7. Retrieving Query Plan
    8. Metastore Methods
      1. Example Table Checker
        1. Finding tables not marked as external
    9. Administrating HiveServer
      1. Productionizing HiveService
      2. Cleanup
    10. Hive ThriftMetastore
      1. ThriftMetastore Configuration
      2. Client Configuration
  19. 17. Storage Handlers and NoSQL
    1. Storage Handler Background
    2. HiveStorageHandler
    3. HBase
    4. Cassandra
      1. Static Column Mapping
      2. Transposed Column Mapping for Dynamic Columns
      3. Cassandra SerDe Properties
    5. DynamoDB
  20. 18. Security
    1. Integration with Hadoop Security
    2. Authentication with Hive
    3. Authorization in Hive
      1. Users, Groups, and Roles
      2. Privileges to Grant and Revoke
      3. Partition-Level Privileges
      4. Automatic Grants
  21. 19. Locking
    1. Locking Support in Hive with Zookeeper
    2. Explicit, Exclusive Locks
  22. 20. Hive Integration with Oozie
    1. Oozie Actions
      1. Hive Thrift Service Action
    2. A Two-Query Workflow
    3. Oozie Web Console
    4. Variables in Workflows
    5. Capturing Output
    6. Capturing Output to Variables
  23. 21. Hive and Amazon Web Services (AWS)
    1. Why Elastic MapReduce?
    2. Instances
    3. Before You Start
    4. Managing Your EMR Hive Cluster
    5. Thrift Server on EMR Hive
    6. Instance Groups on EMR
    7. Configuring Your EMR Cluster
      1. Deploying hive-site.xml
      2. Deploying a .hiverc Script
        1. Deploying .hiverc using a config step
        2. Deploying a .hiverc using a bootstrap action
      3. Setting Up a Memory-Intensive Configuration
    8. Persistence and the Metastore on EMR
    9. HDFS and S3 on EMR Cluster
    10. Putting Resources, Configs, and Bootstrap Scripts on S3
    11. Logs on S3
    12. Spot Instances
    13. Security Groups
    14. EMR Versus EC2 and Apache Hive
    15. Wrapping Up
  24. 22. HCatalog
    1. Introduction
    2. MapReduce
      1. Reading Data
      2. Writing Data
    3. Command Line
    4. Security Model
    5. Architecture
  25. 23. Case Studies
    1. m6d.com (Media6Degrees)
      1. Data Science at M6D Using Hive and R
      2. M6D UDF Pseudorank
      3. M6D Managing Hive Data Across Multiple MapReduce Clusters
        1. Cross deployment queries with Hive
        2. Replicating Hive data between deployments
    2. Outbrain
      1. In-Site Referrer Identification
        1. Cleaning up the URLs
        2. Determining referrer type
        3. Multiple URLs
      2. Counting Uniques
        1. Why this is a problem
        2. Load a temp table
        3. Querying the temp table
      3. Sessionization
        1. Setting it up
        2. Finding origin pageviews
        3. Bucketing PVs to origins
        4. Aggregating on origins
        5. Aggregating on origin type
        6. Measure engagement
    3. NASA’s Jet Propulsion Laboratory
      1. The Regional Climate Model Evaluation System
      2. Our Experience: Why Hive?
      3. Some Challenges and How We Overcame Them
        1. Conclusion
    4. Photobucket
      1. Big Data at Photobucket
      2. What Hardware Do We Use for Hive?
      3. What’s in Hive?
      4. Who Does It Support?
    5. SimpleReach
    6. Experiences and Needs from the Customer Trenches
      1. A Karmasphere Perspective
      2. Introduction
      3. Use Case Examples from the Customer Trenches
        1. Customer trenches #1: Optimal data formatting for Hive
        2. Customer trenches #2: Partitions and performance
        3. Customer trenches #3: Text analytics with Regex, Lateral View Explode, Ngram, and other UDFs
          1. Apache Hive in production: Incremental needs and capabilities
          2. About Karmasphere
  26. Glossary
  27. A. References
  28. Index
  29. About the Authors
  30. Colophon
  31. Copyright

Product information

  • Title: Programming Hive
  • Author(s): Edward Capriolo, Dean Wampler, Jason Rutherglen
  • Release date: September 2012
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449326975