Integrating data from multiple sources is essential in the age of big data, but it can be a challenging and time-consuming task. This handy cookbook provides dozens of ready-to-use recipes for using Apache Sqoop, the command-line interface application that optimizes data transfers between relational databases and Hadoop.
Sqoop is both powerful and bewildering, but with this cookbook’s problem-solution-discussion format, you’ll quickly learn how to deploy and then apply Sqoop in your environment. The authors provide MySQL, Oracle, and PostgreSQL database examples on GitHub that you can easily adapt for SQL Server, Netezza, Teradata, or other relational systems.
Transfer data from a single database table into your Hadoop ecosystem
Keep table data and Hadoop in sync by importing data incrementally
Import data from more than one database table
Customize transferred data by calling various database functions
Export generated, processed, or backed-up data from Hadoop to your database
Run Sqoop within Oozie, Hadoop’s specialized workflow scheduler
Load data into Hadoop’s data warehouse (Hive) or database (HBase)
Handle installation, connection, and syntax issues common to specific database vendors
Chapter 1 Getting Started
Downloading and Installing Sqoop
Installing JDBC Drivers
Installing Specialized Connectors
Getting Help with Sqoop
Chapter 2 Importing Data
Transferring an Entire Table
Specifying a Target Directory
Importing Only a Subset of Data
Protecting Your Password
Using a File Format Other Than CSV
Compressing Imported Data
Speeding Up Transfers
Overriding Type Mapping
Encoding NULL Values
Importing All Your Tables
Chapter 3 Incremental Import
Importing Only New Data
Incrementally Importing Mutable Data
Preserving the Last Imported Value
Storing Passwords in the Metastore
Overriding the Arguments to a Saved Job
Sharing the Metastore Between Sqoop Clients
Chapter 4 Free-Form Query Import
Importing Data from Two Tables
Using Custom Boundary Queries
Renaming Sqoop Job Instances
Importing Queries with Duplicated Columns
Chapter 5 Export
Transferring Data from Hadoop
Inserting Data in Batches
Exporting with All-or-Nothing Semantics
Updating an Existing Data Set
Updating or Inserting at the Same Time
Using Stored Procedures
Exporting into a Subset of Columns
Encoding the NULL Value Differently
Exporting Corrupted Data
Chapter 6 Hadoop Ecosystem Integration
Scheduling Sqoop Jobs with Oozie
Specifying Commands in Oozie
Using Property Parameters in Oozie
Installing JDBC Drivers in Oozie
Importing Data Directly into Hive
Using Partitioned Hive Tables
Replacing Special Delimiters During Hive Import
Using the Correct NULL String in Hive
Importing Data into HBase
Importing All Rows into HBase
Improving Performance When Importing into HBase
Chapter 7 Specialized Connectors
Overriding Imported boolean Values in PostgreSQL Direct Import
Importing a Table Stored in Custom Schema in PostgreSQL
Exporting into PostgreSQL Using pg_bulkload
Connecting to MySQL
Using Direct MySQL Import into Hive
Using the upsert Feature When Exporting into MySQL
Kathleen Ting is currently a Customer Operations Engineering Manager at Cloudera where she helps customers deploy and use the Hadoop ecosystem in production. She has spoken on Hadoop, ZooKeeper, and Sqoop at many Big Data conferences including Hadoop World, ApacheCon, and OSCON. She's contributed to several projects in the open source community and is a Committer and PMC Member on Sqoop.
Jarek Jarcec Cecho is currently a Software Engineer at Cloudera where he develops software to help customers better access and integrate with the Hadoop ecosystem. He has led the Sqoop community in the architecture of the next generation of Sqoop, known as Sqoop 2. He's contributed to several projects in the open source community and is a Committer and PMC Member on Sqoop, Flume, and MRUnit.
The animal on the cover of Apache Sqoop Cookbook is the Great White Pelican(Pelecanus onocrotalus).The cover image is from Meyers Kleines. The cover font is Adobe ITC Garamond. Thetext font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and thecode font is Dalton Maag’s Ubuntu Mono.
Although just 75 pages long, this book is both a great overview and a valuable reference. It focuses on what's important rather than trying to cover every possible detail of Sqoop. Both authors are involved in the development and leadership of Sqoop and their knowledge is extensive. This shines through in the explanations, which I found both helpful and technically accurate.
Sqoop is a powerful tool with lots of options. Beginners are often unaware of its capabilities and wind up doing things the hard way. Even those who have used Sqoop for years might not know about some of its newer features, such as how to use saved jobs to track incremental imports. I'd recommend this book to either group, because spending just an hour or two reading it now could save you a lot more time later.
Since Sqoop is used to get data into and out of a Hadoop cluster, it is typically the first or last step in a much larger data processing workflow. In other words, everything else depends on your ability to use Sqoop quickly and correctly. That makes the "cookbook" format of this book all the more valuable -- it lets you flip right to the page you need and read a concise explanation that shows you exactly how to get the job done.
The content of the book is great, but I am giving this book only four stars due to a problem with the printing itself. The ink used in this book is shiny and actually causes a glare that can make it difficult to read in certain lighting. I have noticed this problem with a few newer O'Reilly books and I hope it's something that they'll fix soon.
Disclosure: Both authors are co-workers of mine at Cloudera. I volunteered to serve as a technical reviewer for this book and the publisher sent me a free copy after it went to press.
Bottom Line Yes, I would recommend this to a friend
Images shared by Tom Wheeler
This picture illustrates the ink glare I described
Tags: Using Product, Picture of Product, Made with Product
Overall, really liked the organization and information presented in the book. I wish the installation section had a little more detailed information. Once I got past the install, the other recipes worked very well.
Bottom Line Yes, I would recommend this to a friend