Bad Data Handbook
Mapping the World of Data Problems
Publisher: O'Reilly Media
Final Release Date: November 2012
Pages: 264

What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems.

From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it.

Among the many topics covered, you’ll discover how to:

  • Test drive your data to see if it’s ready for analysis
  • Work spreadsheet data into a usable form
  • Handle encoding problems that lurk in text data
  • Develop a successful web-scraping effort
  • Use NLP tools to reveal the real sentiment of online reviews
  • Address cloud computing issues that can impact your analysis effort
  • Avoid policies that create data analysis roadblocks
  • Take a systematic approach to data quality analysis
Table of Contents
Product Details
About the Author
Colophon
Recommended for You
Customer Reviews

REVIEW SNAPSHOT®

by PowerReviews
oreillyBad Data Handbook
 
4.4

(based on 5 reviews)

Ratings Distribution

  • 5 Stars

     

    (2)

  • 4 Stars

     

    (3)

  • 3 Stars

     

    (0)

  • 2 Stars

     

    (0)

  • 1 Stars

     

    (0)

100%

of respondents would recommend this to a friend.

Pros

  • Helpful examples (5)
  • Concise (3)

Cons

    Best Uses

    • Intermediate (4)
    • Novice (4)

    Reviewed by 5 customers

    Sort by

    Displaying reviews 1-5

    Back to top

     
    5.0

    Best practices for real world data

    By ahmetRasit

    from Ankara, Turkey

    About Me Developer, Educator

    Verified Reviewer

    Pros

    • Concise
    • Easy to understand
    • Helpful examples

    Cons

      Best Uses

      • Expert
      • Intermediate
      • Novice
      • Student

      Comments about oreilly Bad Data Handbook:

      I'm dealing with life science oriented big data for a long time, and I was feeling alone about the variety of problems with the data, until this book. I've realized that most of the problems I'm facing are the real world problems, and they're well-summarized in each chapter. In addition, possible solutions and actions are also well-thought. Almost all of the data-related books I've encountered deals with the so-called ideal data, but this book definitely differs by its focus on real world dirty data. I believe that this book will not loose its importance over time.

      Some of the chapters might seem unnecessary or less helpful. However, if the book would contain half of the actual content, I'ld still think that it deserves its price.

       
      5.0

      Great Health IT book

      By ftrotter

      from Houston, TX

      About Me Developer

      Pros

      • Concise
      • Helpful examples

      Cons

        Best Uses

        • Intermediate
        • Novice
        • Student

        Comments about oreilly Bad Data Handbook:

        This book has a specific chapter on lab work which is awesome for anyone with an Health IT background. I wish I had written it.

        This really should be required reading for anyone thinking about doing Big Data in Healthcare.

        (0 of 1 customers found this review helpful)

         
        4.0

        A good book on real world data

        By Erik

        from Brussels, BELGIUM

        Verified Reviewer

        Pros

        • Helpful examples
        • Well-written

        Cons

        • Too basic

        Best Uses

        • Intermediate
        • Novice

        Comments about oreilly Bad Data Handbook:

        This book covers an interesting subject: the difficulties and traps
        that come with real world data. Overall the book provides some
        interesting insights on how to tackle common issues. Case studies
        and examples come from various areas from web's data to chemistry.

        The level of details along the different chapters is not constant.
        For instance, a couple of chapters go into details and provide code snippets
        while many others only cover the topic in a more superficial fashion.

        Some chapters encompass the "bad data" topic in the sense they do not
        relate to data problems but rather on bad choice of technologies for storage
        or analysis.

         
        4.0

        Well worth reading -- twice!

        By Wil

        from Saratoga Springs, NY

        About Me Educator

        Verified Reviewer

        Pros

        • Concise
        • Helpful examples

        Cons

          Best Uses

          • Intermediate
          • Novice

          Comments about oreilly Bad Data Handbook:

          Bad data is a fact of life. Coping with bad data is a valuable, learned skill. Bad Data Handbook offers insights from over 20 authors based on their years of personal experience managing ill-defined, often chaotic and incomplete data. We begin with a exploration of what is meant by *bad data* and what checks we can preform to help us understand data quality as a prerequisite to data analysis.

          Kevin Fink offers suggestions on approaching data critically in order to ensure that we understand what we're working with before we begin to try to manipulate it. Fink offers useful scripts in shell and Perl that can be used to inspect data and perform basic sanity checks. Paul Murrell tackles the problem of scraping data from sources formatted for human consumption into a format more amenable for algorithmic analysis using R. And on and on.

          Each chapter addresses a critical concern in the data life-cycle: identifying, annotating, capturing, archiving, versioning, manipulating, analyzing, and deriving actionable information from imperfect or incomplete data. The advice offered is both powerful and immediately useful to data scientists and newcomers to the field alike and for me has spurred several ideas for how to approach teaching statistics.

          Given the number of authors who contributed to this volume, it should come as no surprise that the tone, writing styles, and tools used vary greatly among the chapters, sometimes wandering into technical minutia, but only infrequently. The book holds together remarkably well, regardless, and was a pleasure to read.

          (1 of 2 customers found this review helpful)

           
          4.0

          Nice Compilation

          By yanyo

          from Santo Domingo

          Verified Reviewer

          Pros

          • Helpful examples

          Cons

            Best Uses

              Comments about oreilly Bad Data Handbook:

              Nice compilation about dealing with the ugly part of data proyects

              Displaying reviews 1-5

              Back to top

               
              Buy 2 Get 1 Free Free Shipping Guarantee
              Buying Options
              Immediate Access - Go Digital what's this?
              Ebook: $33.99
              Formats:  DAISY, ePub, Mobi, PDF
              Print & Ebook: $43.99
              Print: $39.99