Parallel R

Book description

It’s tough to argue with R as a high-quality, cross-platform, open source statistical software product—unless you’re in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets, including three chapters on using R and Hadoop together. You’ll learn the basics of Snow, Multicore, Parallel, Segue, RHIPE, and Hadoop Streaming, including how to find them, how to use them, when they work well, and when they don’t.

With these packages, you can overcome R’s single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R’s memory barrier.

  • Snow: works well in a traditional cluster environment
  • Multicore: popular for multiprocessor and multicore computers
  • Parallel: part of the upcoming R 2.14.0 release
  • R+Hadoop: provides low-level access to a popular form of cluster computing
  • RHIPE: uses Hadoop’s power with R’s language and interactive shell
  • Segue: lets you use Elastic MapReduce as a backend for lapply-style operations

Table of contents

  1. Parallel R
  2. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  3. A Note Regarding Supplemental Files
  4. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
    5. Acknowledgments
      1. Q. Ethan McCallum
      2. Stephen Weston
  5. 1. Getting Started
    1. Why R?
    2. Why Not R?
    3. The Solution: Parallel Execution
    4. A Road Map for This Book
      1. What We’ll Cover
      2. Looking Forward…
      3. What We’ll Assume You Already Know
    5. In a Hurry?
      1. snow
      2. multicore
      3. parallel
      4. R+Hadoop
      5. RHIPE
      6. Segue
    6. Summary
  6. 2. snow
    1. Quick Look
    2. How It Works
    3. Setting Up
    4. Working with It
      1. Creating Clusters with makeCluster
      2. Parallel K-Means
      3. Initializing Workers
      4. Load Balancing with clusterApplyLB
      5. Task Chunking with parLapply
      6. Vectorizing with clusterSplit
      7. Load Balancing Redux
      8. Functions and Environments
      9. Random Number Generation
      10. snow Configuration
      11. Installing Rmpi
      12. Executing snow Programs on a Cluster with Rmpi
      13. Executing snow Programs with a Batch Queueing System
      14. Troubleshooting snow Programs
    5. When It Works…
    6. …And When It Doesn’t
    7. The Wrap-up
  7. 3. multicore
    1. Quick Look
    2. How It Works
    3. Setting Up
    4. Working with It
      1. The mclapply Function
      2. The mc.cores Option
      3. The mc.set.seed Option
      4. Load Balancing with mclapply
      5. The pvec Function
      6. The parallel and collect Functions
      7. Using collect Options
      8. Parallel Random Number Generation
      9. The Low-Level API
    5. When It Works…
    6. …And When It Doesn’t
    7. The Wrap-up
  8. 4. parallel
    1. Quick Look
    2. How It Works
    3. Setting Up
    4. Working with It
      1. Getting Started
      2. Creating Clusters with makeCluster
      3. Parallel Random Number Generation
    5. Summary of Differences
    6. When It Works…
    7. …And When It Doesn’t
    8. The Wrap-up
  9. 5. A Primer on MapReduce and Hadoop
    1. Hadoop at Cruising Altitude
    2. A MapReduce Primer
    3. Thinking in MapReduce: Some Pseudocode Examples
      1. Calculate Average Call Length for Each Date
      2. Number of Calls by Each User, on Each Date
      3. Run a Special Algorithm on Each Record
    4. Binary and Whole-File Data: SequenceFiles
    5. No Cluster? No Problem! Look to the Clouds…
    6. The Wrap-up
  10. 6. R+Hadoop
    1. Quick Look
    2. How It Works
    3. Setting Up
    4. Working with It
      1. Simple Hadoop Streaming (All Text)
      2. Streaming, Redux: Indirectly Working with Binary Data
      3. The Java API: Binary Input and Output
      4. Processing Related Groups (the Full Map and Reduce Phases)
    5. When It Works…
    6. …And When It Doesn’t
    7. The Wrap-up
  11. 7. RHIPE
    1. Quick Look
    2. How It Works
    3. Setting Up
    4. Working with It
      1. Phone Call Records, Redux
      2. Tweet Brevity
      3. More Complex Tweet Analysis
    5. When It Works…
    6. …And When It Doesn’t
    7. The Wrap-up
  12. 8. Segue
    1. Quick Look
    2. How It Works
    3. Setting Up
    4. Working with It
      1. Model Testing: Parameter Sweep
    5. When It Works…
    6. …And When It Doesn’t
    7. The Wrap-up
  13. 9. New and Upcoming
    1. doRedis
    2. RevoScale R and RevoConnectR (RHadoop)
    3. cloudNumbers.com
  14. About the Authors
  15. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  16. Copyright

Product information

  • Title: Parallel R
  • Author(s): Q. Ethan McCallum, Stephen Weston
  • Release date: October 2011
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449320331