Data Wrangling with Python

Book description

How do you take your data analysis skills beyond Excel to the next level? By learning just enough Python to get stuff done. This hands-on guide shows non-programmers like you how to process information that’s initially too messy or difficult to access. You don't need to know a thing about the Python programming language to get started.

Through various step-by-step exercises, you’ll learn how to acquire, clean, analyze, and present data efficiently. You’ll also discover how to automate your data process, schedule file- editing and clean-up tasks, process larger datasets, and create compelling stories with data you obtain.

  • Quickly learn basic Python syntax, data types, and language concepts
  • Work with both machine-readable and human-consumable data
  • Scrape websites and APIs to find a bounty of useful information
  • Clean and format data to eliminate duplicates and errors in your datasets
  • Learn when to standardize data and when to test and script data cleanup
  • Explore and analyze your datasets with new Python libraries and techniques
  • Use Python solutions to automate your entire data-wrangling process

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book
    2. Who Should Not Read This Book
    3. How This Book Is Organized
    4. What Is Data Wrangling?
    5. What to Do If You Get Stuck
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Safari
    9. How to Contact Us
    10. Acknowledgments
  2. 1. Introduction to Python
    1. Why Python
    2. Getting Started with Python
      1. Which Python Version
      2. Setting Up Python on Your Machine
      3. Test Driving Python
      4. Install pip
      5. Install a Code Editor
      6. Optional: Install IPython
    3. Summary
  3. 2. Python Basics
    1. Basic Data Types
      1. Strings
      2. Integers and Floats
    2. Data Containers
      1. Variables
      2. Lists
      3. Dictionaries
    3. What Can the Various Data Types Do?
      1. String Methods: Things Strings Can Do
      2. Numerical Methods: Things Numbers Can Do
      3. List Methods: Things Lists Can Do
      4. Dictionary Methods: Things Dictionaries Can Do
    4. Helpful Tools: type, dir, and help
      1. type
      2. dir
      3. help
    5. Putting It All Together
    6. What Does It All Mean?
    7. Summary
  4. 3. Data Meant to Be Read by Machines
    1. CSV Data
      1. How to Import CSV Data
      2. Saving the Code to a File; Running from Command Line
    2. JSON Data
      1. How to Import JSON Data
    3. XML Data
      1. How to Import XML Data
    4. Summary
  5. 4. Working with Excel Files
    1. Installing Python Packages
    2. Parsing Excel Files
    3. Getting Started with Parsing
    4. Summary
  6. 5. PDFs and Problem Solving in Python
    1. Avoid Using PDFs!
    2. Programmatic Approaches to PDF Parsing
      1. Opening and Reading Using slate
      2. Converting PDF to Text
    3. Parsing PDFs Using pdfminer
    4. Learning How to Solve Problems
      1. Exercise: Use Table Extraction, Try a Different Library
      2. Exercise: Clean the Data Manually
      3. Exercise: Try Another Tool
    5. Uncommon File Types
    6. Summary
  7. 6. Acquiring and Storing Data
    1. Not All Data Is Created Equal
    2. Fact Checking
    3. Readability, Cleanliness, and Longevity
    4. Where to Find Data
      1. Using a Telephone
      2. US Government Data
      3. Government and Civic Open Data Worldwide
      4. Organization and Non-Government Organization (NGO) Data
      5. Education and University Data
      6. Medical and Scientific Data
      7. Crowdsourced Data and APIs
    5. Case Studies: Example Data Investigation
      1. Ebola Crisis
      2. Train Safety
      3. Football Salaries
      4. Child Labor
    6. Storing Your Data: When, Why, and How?
    7. Databases: A Brief Introduction
      1. Relational Databases: MySQL and PostgreSQL
      2. Non-Relational Databases: NoSQL
      3. Setting Up Your Local Database with Python
    8. When to Use a Simple File
      1. Cloud-Storage and Python
      2. Local Storage and Python
    9. Alternative Data Storage
    10. Summary
  8. 7. Data Cleanup: Investigation, Matching, and Formatting
    1. Why Clean Data?
    2. Data Cleanup Basics
      1. Identifying Values for Data Cleanup
      2. Formatting Data
      3. Finding Outliers and Bad Data
      4. Finding Duplicates
      5. Fuzzy Matching
      6. RegEx Matching
      7. What to Do with Duplicate Records
    3. Summary
  9. 8. Data Cleanup: Standardizing and Scripting
    1. Normalizing and Standardizing Your Data
    2. Saving Your Data
    3. Determining What Data Cleanup Is Right for Your Project
    4. Scripting Your Cleanup
    5. Testing with New Data
    6. Summary
  10. 9. Data Exploration and Analysis
    1. Exploring Your Data
      1. Importing Data
      2. Exploring Table Functions
      3. Joining Numerous Datasets
      4. Identifying Correlations
      5. Identifying Outliers
      6. Creating Groupings
      7. Further Exploration
    2. Analyzing Your Data
      1. Separating and Focusing Your Data
      2. What Is Your Data Saying?
      3. Drawing Conclusions
      4. Documenting Your Conclusions
    3. Summary
  11. 10. Presenting Your Data
    1. Avoiding Storytelling Pitfalls
      1. How Will You Tell the Story?
      2. Know Your Audience
    2. Visualizing Your Data
      1. Charts
      2. Time-Related Data
      3. Maps
      4. Interactives
      5. Words
      6. Images, Video, and Illustrations
    3. Presentation Tools
    4. Publishing Your Data
      1. Using Available Sites
      2. Open Source Platforms: Starting a New Site
      3. Jupyter (Formerly Known as IPython Notebooks)
    5. Summary
  12. 11. Web Scraping: Acquiring and Storing Data from the Web
    1. What to Scrape and How
    2. Analyzing a Web Page
      1. Inspection: Markup Structure
      2. Network/Timeline: How the Page Loads
      3. Console: Interacting with JavaScript
      4. In-Depth Analysis of a Page
    3. Getting Pages: How to Request on the Internet
    4. Reading a Web Page with Beautiful Soup
    5. Reading a Web Page with LXML
      1. A Case for XPath
    6. Summary
  13. 12. Advanced Web Scraping: Screen Scrapers and Spiders
    1. Browser-Based Parsing
      1. Screen Reading with Selenium
      2. Screen Reading with Ghost.Py
    2. Spidering the Web
      1. Building a Spider with Scrapy
      2. Crawling Whole Websites with Scrapy
    3. Networks: How the Internet Works and Why It’s Breaking Your Script
    4. The Changing Web (or Why Your Script Broke)
    5. A (Few) Word(s) of Caution
    6. Summary
  14. 13. APIs
    1. API Features
      1. REST Versus Streaming APIs
      2. Rate Limits
      3. Tiered Data Volumes
      4. API Keys and Tokens
    2. A Simple Data Pull from Twitter’s REST API
    3. Advanced Data Collection from Twitter’s REST API
    4. Advanced Data Collection from Twitter’s Streaming API
    5. Summary
  15. 14. Automation and Scaling
    1. Why Automate?
    2. Steps to Automate
    3. What Could Go Wrong?
    4. Where to Automate
    5. Special Tools for Automation
      1. Using Local Files, argv, and Config Files
      2. Using the Cloud for Data Processing
      3. Using Parallel Processing
      4. Using Distributed Processing
    6. Simple Automation
      1. CronJobs
      2. Web Interfaces
      3. Jupyter Notebooks
    7. Large-Scale Automation
      1. Celery: Queue-Based Automation
      2. Ansible: Operations Automation
    8. Monitoring Your Automation
      1. Python Logging
      2. Adding Automated Messaging
      3. Uploading and Other Reporting
      4. Logging and Monitoring as a Service
    9. No System Is Foolproof
    10. Summary
  16. 15. Conclusion
    1. Duties of a Data Wrangler
    2. Beyond Data Wrangling
      1. Become a Better Data Analyst
      2. Become a Better Developer
      3. Become a Better Visual Storyteller
      4. Become a Better Systems Architect
    3. Where Do You Go from Here?
  17. A. Comparison of Languages Mentioned
    1. C, C++, and Java Versus Python
    2. R or MATLAB Versus Python
    3. HTML Versus Python
    4. JavaScript Versus Python
    5. Node.js Versus Python
    6. Ruby and Ruby on Rails Versus Python
  18. B. Python Resources for Beginners
    1. Online Resources
    2. In-Person Groups
  19. C. Learning the Command Line
    1. Bash
      1. Navigation
      2. Modifying Files
      3. Executing Files
      4. Searching with the Command Line
      5. More Resources
    2. Windows CMD/Power Shell
      1. Navigation
      2. Modifying Files
      3. Executing Files
      4. Searching with the Command Line
      5. More Resources
  20. D. Advanced Python Setup
    1. Step 1: Install GCC
    2. Step 2: (Mac Only) Install Homebrew
    3. Step 3: (Mac Only) Tell Your System Where to Find Homebrew
    4. Step 4: Install Python 2.7
    5. Step 5: Install virtualenv (Windows, Mac, Linux)
    6. Step 6: Set Up a New Directory
    7. Step 7: Install virtualenvwrapper
      1. Installing virtualenvwrapper (Mac and Linux)
      2. Installing virtualenvwrapper-win (Windows)
      3. Testing Your Virtual Environment (Windows, Mac, Linux)
    8. Learning About Our New Environment (Windows, Mac, Linux)
    9. Advanced Setup Review
  21. E. Python Gotchas
    1. Hail the Whitespace
    2. The Dreaded GIL
    3. = Versus == Versus is, and When to Just Copy
    4. Default Function Arguments
    5. Python Scope and Built-Ins: The Importance of Variable Names
    6. Defining Objects Versus Modifying Objects
    7. Changing Immutable Objects
    8. Type Checking
    9. Catching Multiple Exceptions
    10. The Power of Debugging
  22. F. IPython Hints
    1. Why Use IPython?
    2. Getting Started with IPython
    3. Magic Functions
    4. Final Thoughts: A Simpler Terminal
  23. G. Using Amazon Web Services
    1. Spinning Up an AWS Server
      1. AWS Step 1: Choose an Amazon Machine Image (AMI)
      2. AWS Step 2: Choose an Instance Type
      3. AWS Step 7: Review Instance Launch
      4. AWS Extra Question: Select an Existing Key Pair or Create a New One
    2. Logging into an AWS Server
      1. Get the Public DNS Name of the Instance
      2. Prepare Your Private Key
      3. Log into Your Server
      4. Summary
  24. Index

Product information

  • Title: Data Wrangling with Python
  • Author(s): Jacqueline Kazil, Katharine Jarmul
  • Release date: February 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491948774