Web Scraping with Python

Book description

Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.

Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice.

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. What Is Web Scraping?
    2. Why Web Scraping?
    3. About This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments
  2. I. Building Scrapers
  3. 1. Your First Web Scraper
    1. Connecting
    2. An Introduction to BeautifulSoup
      1. Installing BeautifulSoup
      2. Running BeautifulSoup
      3. Connecting Reliably
  4. 2. Advanced HTML Parsing
    1. You Don’t Always Need a Hammer
    2. Another Serving of BeautifulSoup
      1. find() and findAll() with BeautifulSoup
      2. Other BeautifulSoup Objects
      3. Navigating Trees
    3. Regular Expressions
    4. Regular Expressions and BeautifulSoup
    5. Accessing Attributes
    6. Lambda Expressions
    7. Beyond BeautifulSoup
  5. 3. Starting to Crawl
    1. Traversing a Single Domain
    2. Crawling an Entire Site
      1. Collecting Data Across an Entire Site
    3. Crawling Across the Internet
    4. Crawling with Scrapy
  6. 4. Using APIs
    1. How APIs Work
    2. Common Conventions
      1. Methods
      2. Authentication
    3. Responses
      1. API Calls
    4. Echo Nest
      1. A Few Examples
    5. Twitter
      1. Getting Started
      2. A Few Examples
    6. Google APIs
      1. Getting Started
      2. A Few Examples
    7. Parsing JSON
    8. Bringing It All Back Home
    9. More About APIs
  7. 5. Storing Data
    1. Media Files
    2. Storing Data to CSV
    3. MySQL
      1. Installing MySQL
      2. Some Basic Commands
      3. Integrating with Python
      4. Database Techniques and Good Practice
      5. “Six Degrees” in MySQL
    4. Email
  8. 6. Reading Documents
    1. Document Encoding
    2. Text
      1. Text Encoding and the Global Internet
    3. CSV
      1. Reading CSV Files
    4. PDF
    5. Microsoft Word and .docx
  9. II. Advanced Scraping
  10. 7. Cleaning Your Dirty Data
    1. Cleaning in Code
      1. Data Normalization
    2. Cleaning After the Fact
      1. OpenRefine
  11. 8. Reading and Writing Natural Languages
    1. Summarizing Data
    2. Markov Models
      1. Six Degrees of Wikipedia: Conclusion
    3. Natural Language Toolkit
      1. Installation and Setup
      2. Statistical Analysis with NLTK
      3. Lexicographical Analysis with NLTK
    4. Additional Resources
  12. 9. Crawling Through Forms and Logins
    1. Python Requests Library
    2. Submitting a Basic Form
    3. Radio Buttons, Checkboxes, and Other Inputs
    4. Submitting Files and Images
    5. Handling Logins and Cookies
      1. HTTP Basic Access Authentication
    6. Other Form Problems
  13. 10. Scraping JavaScript
    1. A Brief Introduction to JavaScript
      1. Common JavaScript Libraries
    2. Ajax and Dynamic HTML
      1. Executing JavaScript in Python with Selenium
    3. Handling Redirects
    4. A Final Note on JavaScript
  14. 11. Image Processing and Text Recognition
    1. Overview of Libraries
      1. Pillow
      2. Tesseract
      3. NumPy
    2. Processing Well-Formatted Text
      1. Scraping Text from Images on Websites
    3. Reading CAPTCHAs and Training Tesseract
      1. Training Tesseract
    4. Retrieving CAPTCHAs and Submitting Solutions
  15. 12. Avoiding Scraping Traps
    1. A Note on Ethics
    2. Looking Like a Human
      1. Adjust Your Headers
      2. Handling Cookies
      3. Timing Is Everything
    3. Common Form Security Features
      1. Hidden Input Field Values
      2. Avoiding Honeypots
    4. The Human Checklist
  16. 13. Testing Your Website with Scrapers
    1. An Introduction to Testing
      1. What Are Unit Tests?
    2. Python unittest
      1. Testing Wikipedia
    3. Testing with Selenium
      1. Interacting with the Site
    4. Unittest or Selenium?
  17. 14. Scraping Remotely
    1. Why Use Remote Servers?
      1. Avoiding IP Address Blocking
      2. Portability and Extensibility
    2. Tor
      1. PySocks
    3. Remote Hosting
      1. Running from a Website Hosting Account
      2. Running from the Cloud
    4. Additional Resources
    5. Moving Forward
  18. A. Python at a Glance
    1. Installation and “Hello, World!”
  19. B. The Internet at a Glance
  20. C. The Legalities and Ethics of Web Scraping
    1. Trademarks, Copyrights, Patents, Oh My!
      1. Copyright Law
    2. Trespass to Chattels
    3. The Computer Fraud and Abuse Act
    4. robots.txt and Terms of Service
    5. Three Web Scrapers
      1. eBay versus Bidder’s Edge and Trespass to Chattels
      2. United States v. Auernheimer and The Computer Fraud and Abuse Act
      3. Field v. Google: Copyright and robots.txt
  21. Index

Product information

  • Title: Web Scraping with Python
  • Author(s): Ryan Mitchell
  • Release date: July 2015
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491910290