How do you take your data analysis skills beyond Excel to the next level? By learning just enough Python to get stuff done. This hands-on guide shows non-programmers like you how to process information that’s initially too messy or difficult to access. You don't need to know a thing about the Python programming language to get started.
Through various step-by-step exercises, you’ll learn how to acquire, clean, analyze, and present data efficiently. You’ll also discover how to automate your data process, schedule file- editing and clean-up tasks, process larger datasets, and create compelling stories with data you obtain.
Quickly learn basic Python syntax, data types, and language concepts
Work with both machine-readable and human-consumable data
Scrape websites and APIs to find a bounty of useful information
Clean and format data to eliminate duplicates and errors in your datasets
Learn when to standardize data and when to test and script data cleanup
Explore and analyze your datasets with new Python libraries and techniques
Use Python solutions to automate your entire data-wrangling process
Chapter 1Introduction to Python
Getting Started with Python
Chapter 2Python Basics
Basic Data Types
What Can the Various Data Types Do?
Helpful Tools: type, dir, and help
Putting It All Together
What Does It All Mean?
Chapter 3Data Meant to Be Read by Machines
Chapter 4Working with Excel Files
Installing Python Packages
Parsing Excel Files
Getting Started with Parsing
Chapter 5PDFs and Problem Solving in Python
Avoid Using PDFs!
Programmatic Approaches to PDF Parsing
Parsing PDFs Using pdfminer
Learning How to Solve Problems
Uncommon File Types
Chapter 6Acquiring and Storing Data
Not All Data Is Created Equal
Readability, Cleanliness, and Longevity
Where to Find Data
Case Studies: Example Data Investigation
Storing Your Data: When, Why, and How?
Databases: A Brief Introduction
When to Use a Simple File
Alternative Data Storage
Chapter 7Data Cleanup: Investigation, Matching, and Formatting
Why Clean Data?
Data Cleanup Basics
Chapter 8Data Cleanup: Standardizing and Scripting
Normalizing and Standardizing Your Data
Saving Your Data
Determining What Data Cleanup Is Right for Your Project
Scripting Your Cleanup
Testing with New Data
Chapter 9Data Exploration and Analysis
Exploring Your Data
Analyzing Your Data
Chapter 10Presenting Your Data
Avoiding Storytelling Pitfalls
Visualizing Your Data
Publishing Your Data
Chapter 11Web Scraping: Acquiring and Storing Data from the Web
What to Scrape and How
Analyzing a Web Page
Getting Pages: How to Request on the Internet
Reading a Web Page with Beautiful Soup
Reading a Web Page with LXML
Chapter 12Advanced Web Scraping: Screen Scrapers and Spiders
Spidering the Web
Networks: How the Internet Works and Why It’s Breaking Your Script
The Changing Web (or Why Your Script Broke)
A (Few) Word(s) of Caution
A Simple Data Pull from Twitter’s REST API
Advanced Data Collection from Twitter’s REST API
Advanced Data Collection from Twitter’s Streaming API
Chapter 14Automation and Scaling
Steps to Automate
What Could Go Wrong?
Where to Automate
Special Tools for Automation
Monitoring Your Automation
No System Is Foolproof
Duties of a Data Wrangler
Beyond Data Wrangling
Where Do You Go from Here?
Appendix Comparison of Languages Mentioned
C, C++, and Java Versus Python
R or MATLAB Versus Python
HTML Versus Python
Node.js Versus Python
Ruby and Ruby on Rails Versus Python
Appendix Python Resources for Beginners
Appendix Learning the Command Line
Windows CMD/Power Shell
Appendix Advanced Python Setup
Step 1: Install GCC
Step 2: (Mac Only) Install Homebrew
Step 3: (Mac Only) Tell Your System Where to Find Homebrew
Step 4: Install Python 2.7
Step 5: Install virtualenv (Windows, Mac, Linux)
Step 6: Set Up a New Directory
Step 7: Install virtualenvwrapper
Learning About Our New Environment (Windows, Mac, Linux)
Advanced Setup Review
Appendix Python Gotchas
Hail the Whitespace
The Dreaded GIL
= Versus == Versus is, and When to Just Copy
Default Function Arguments
Python Scope and Built-Ins: The Importance of Variable Names
Jacqueline Kazil is a data lover. In her career, she has worked in technology focusing in finance, government, and journalism. Most notably, she is a former Presidential Innovation Fellow and co-founded a technology organization in government called 18F. Her career has consisted of many data science and wrangling projects including Geoq, an open source mapping workflow tool, Congress.gov remake, and Top Secret America. She is active in the Python and data related communities -- Python Software Foundation, PyLadies, Women Data Science DC, and more. She teaches Python in Washington, D.C. at meetups, conferences, and mini bootcamps. She often pair programs with her sidekick, Ellie (@ellie_the_brave). You can find her on Twitter @jackiekazil or follow her blog, The coderSnorts (https://medium.com/coder-snorts).
Katharine Jarmul is a Python developer who enjoys data analysis and acquisition, web scraping, teaching Python and all things Unix. She has worked at small and large start ups before starting her consulting career overseas. Originally from Los Angeles, she learned Python while working at the Washington Post in 2008. As one of the founders of PyLadies (http://pyladies.org/), Katharine hopes to promote diversity in Python and other open source languages through education and training. She has led numerous workshops and tutorials ranging from beginner to advanced topics in Python. For more information on upcoming trainings, reach out to her on Twitter (http://twitter.com/kjam) or her her web site (http://kjamistan.com/).
The animal on the cover of Data Wrangling with Python is a blue-lipped tree lizard (Plica umbra). Members of the Plica genus are of moderate size and, though they belong to a family commonly known as neotropical ground lizards, live mainly in trees in South America and the Caribbean. Blue-lipped tree lizards predominantly consume ants and are the only species in their genus not characterized by bunches of spines on the neck.
Many of the animals on O'Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com.
The cover image is from Lydekker's Natural History. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag's Ubuntu Mono.
Comments about oreilly Data Wrangling with Python:
This book provides a great intro to getting and munging data from a wide variety of sources: text files, Excel files, PDFs, and web-scraping. It also introduces many valuable tools for working with data using Python, both as part of the main text and in the appendices. This book is written to be approachable from everyone from beginner to novice Python users. While I have been using Python in science for 4 years, I found the sections on web scraping to be a great resource.
Bottom Line Yes, I would recommend this to a friend