Enterprise Data Workflows with Cascading
Streamlined Enterprise Data Management and Analysis
Publisher: O'Reilly Media
Released: July 2013
Pages: 170

There is an easier way to build Hadoop applications. With this hands-on book, you’ll learn how to use Cascading, the open source abstraction framework for Hadoop that lets you easily create and manage powerful enterprise-grade data processing applications—without having to learn the intricacies of MapReduce.

Working with sample apps based on Java and other JVM languages, you’ll quickly learn Cascading’s streamlined approach to data processing, data filtering, and workflow optimization. This book demonstrates how this framework can help your business extract meaningful information from large amounts of distributed data.

  • Start working on Cascading example projects right away
  • Model and analyze unstructured data in any format, from any source
  • Build and test applications with familiar constructs and reusable components
  • Work with the Scalding and Cascalog Domain-Specific Languages
  • Easily deploy applications to Hadoop, regardless of cluster location or data size
  • Build workflows that integrate several big data frameworks and processes
  • Explore common use cases for Cascading, including features and tools that support them
  • Examine a case study that uses a dataset from the Open Data Initiative
Table of Contents
Product Details
About the Author
Colophon
Recommended for You
Customer Reviews

REVIEW SNAPSHOT®

by PowerReviews
oreillyEnterprise Data Workflows with Cascading
 
3.0

(based on 3 reviews)

Ratings Distribution

  • 5 Stars

     

    (1)

  • 4 Stars

     

    (0)

  • 3 Stars

     

    (1)

  • 2 Stars

     

    (0)

  • 1 Stars

     

    (1)

67%

of respondents would recommend this to a friend.

Pros

    Cons

      Best Uses

          • Reviewer Profile:
          • Developer (3)

        Reviewed by 3 customers

        Sort by

        Displaying reviews 1-3

        Back to top

         
        3.0

        Gentle introduction to Cascading

        By Abe Taha

        from San Francisco, CA

        About Me Developer

        Verified Reviewer

        Pros

        • Easy to understand
        • Helpful examples

        Cons

          Best Uses

          • Novice

          Comments about oreilly Enterprise Data Workflows with Cascading:

          For people interested in developing Hadoop analytic applications there is a plethora of options. The options range from writing low-level, hand-tuned Java map-reduce code, to using a higher level language to manipulate the data such as Pig and Hive. There are pros and cons for each option. For the first, the code becomes complex for anything other than the canonical word-count example, and for the latter, to do anything meaningful, you almost always end up augmenting the higher level language with user-defined functions written in a different language to regain power and flexibility, causing maintenance nightmares. A happy medium in between is to use one of the data-flow libraries for Hadoop, of which Cascading is one.

          Since Cascading has been around for some time, the online documentation is relatively mature, and includes a gentle introduction to the library, with example source code, and a well written user's guide. However this does not obviate the need for a book that describes the library and walks the reader gently through its usage and subtleties. "Enterprise Data Workflows with Cascading" is such a book.

          The book starts with a simple example of copying a file on Hadoop, and introduces the concepts of taps for data sources, and data sinks, as well as data pipes that connect them. It then graduates to the canonical word count example, using it as a vehicle to explain flows, and the operations that can be performed on them through the use of functions and aggregation functions.

          Next comes more complex tasks that require joins. The book starts with HashJoins, and then progresses to LeftJoins and distributed joins. The book then uses a meaty example of a text analytics pipeline to calculate term frequencies/inverse document frequency for a text corpus (TF-IDF), and uses that as a vehicle to walk through splits, merges, and more complex joins.

          By then, the reader has become familiar and comfortable with Cascading, and the author walks him through the benefits of developing applications in a data-flow language instead of the other options available for Hadoop developers. Some of these benefits are the ability to test the code before deployment, and the author walks through an example of a TDD pipeline. Others include using a consistent pattern language to describe the workflows, and having a single deployable JAR that can be used in dev/test/production environments.

          Toward the end the author lists other language bindings for Cascading, such as Scalding (Scala), and Cascalog (Clojure). The later chapters contain good references for further reading on TDD/Scala/Clojure. The book closes with an open-data use-case.

          Throughout the book, the author provides ample links to the source code, and code gists on github, as well as alternate implementations in different languages.

          I liked the style of the book: it is a gentle introduction to Cascading, interspersed with some good advice on doing TDD for enterprise applications, the use of a pattern language for describing data-flows, and an introduction to other language bindings for Cascading.

          (4 of 5 customers found this review helpful)

           
          1.0

          It's the FREE user doc, only for a price

          By asarkar

          from Ohio

          About Me Developer

          Verified Reviewer

          Pros

          • None

          Cons

          • Copied From User Doc
          • Difficult to understand

          Best Uses

            Comments about oreilly Enterprise Data Workflows with Cascading:

            I needed to learn Cascading fast for a new project and since this was the only book in the market, I purchased it without a second thought. Big mistake, should've looked at the Cascading website first! It's everything that's FREELY available in the user documentation, only for a price. The author didn't even bother to change the code, every single line is copied from the documentation. Same can be said about the content of the book. As far as quality goes, he doesn't take the time to explain why he writes the code the way he writes it. If, like me, you're new to Cascading, and I'm guessing you are since you're looking at this book, subtle nuances will leave you frustrated and with failing code. For example, I spent an hour debugging why an Assertion on a bad input record failed my program instead of making it to the Trap when, apparently, my code was similar to the example given in the book (c.f. example 5). It turned out for the program to work the Trap needed to be connected to the Pipe that's produced the bad record and not anything else.
            Chapter 3 is named "Test-Driven Development" but the only things it has is a Unit test ditto copied from the user doc and some defensive programming strategies (Assertions and Trapping), which are good to know but not Unit tests by any means. To make things worse, Cascading framework itself does not seem to have a good support for Unit tests. First of all, the only class they ask you to extend (CascadingTestCase) is conspicuously absent from the API (JavaDoc). Like me, you are left to dig into the source code to find out what capabilities this class offers. If you are expecting this book to shade some light, you're in for a disappointment.
            Don't kill a tree by buying this book, read the online documentation and save the money for a good dining experience.

            (4 of 6 customers found this review helpful)

             
            5.0

            Excellent coverage of the subject matter

            By Matt PC

            from Scottsdale, AZ

            About Me Designer, Developer, Educator, Maker, Sys Admin

            Verified Reviewer

            Pros

            • Accurate
            • Concise
            • Easy to understand
            • Helpful examples
            • Well-written

            Cons

              Best Uses

              • Expert
              • Intermediate
              • Novice
              • Student

              Comments about oreilly Enterprise Data Workflows with Cascading:

              I have used Cascading in large scale production systems since 2009 and have contributed to Cascading.Avro. This book covers the essentals to be successful with building enterprise grade data processing systems with Hadoop and Cascading.

              Displaying reviews 1-3

              Back to top

               
              Buy 2 Get 1 Free Free Shipping Guarantee
              Buying Options
              Immediate Access - Go Digital what's this?
              Ebook: $27.99
              Formats:  ePub, Mobi, PDF
              Print & Ebook: $38.49
              Print: $34.99