Getting Started with Beautiful Soup

Book description

Learn how to extract information from websites using Beautiful Soup and the Python urllib2 module. This practical, hands-on guide covers everything you need to know to get a head start in website scraping.

In Detail

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need without writing excess code for an application. It doesn't take much code to write an application using Beautiful Soup.

Getting Started with Beautiful Soup is a practical guide to Beautiful Soup using Python. The book starts by walking you through the installation of each and every feature of Beautiful Soup using simple examples which include sample Python codes as well as diagrams and screenshots wherever required for better understanding. The book discusses the problems of how exactly you can get data out of a website and provides an easy solution with the help of a real website and sample code.

Getting Started with Beautiful Soup goes over the different methods to install Beautiful Soup in both Linux and Windows systems. You will then learn about searching, navigating, content modification, encoding support, and output formatting with the help of examples and sample Python codes for each example so that you can try them out to get a better understanding. This book is a practical guide for scraping information from any website. If you want to learn how to efficiently scrape pages from websites, then this book is for you.

What You Will Learn

  • Learn how to scrape HTML pages from websites
  • Implement a simple method to scrape any website with the help of developer tools, the Python urllib2 module, and Beautiful Soup
  • Learn how to search for information within an HTML/XML page
  • Modify the contents of an HTML tree
  • Understand encoding support in Beautiful Soup
  • Learn about the different types of output formatting

Table of contents

  1. Getting Started with Beautiful Soup
    1. Table of Contents
    2. Getting Started with Beautiful Soup
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Installing Beautiful Soup
      1. Installing Beautiful Soup
        1. Installing Beautiful Soup in Linux
          1. Installing Beautiful Soup using package manager
          2. Installing Beautiful Soup using pip or easy_install
          3. Installing Beautiful Soup using pip
          4. Installing Beautiful Soup using easy_install
        2. Installing Beautiful Soup in Windows
          1. Verifying Python path in Windows
        3. Installing Beautiful Soup using setup.py
      2. Using Beautiful Soup without installation
      3. Verifying the installation
      4. Quick reference
      5. Summary
    9. 2. Creating a BeautifulSoup Object
      1. Creating a BeautifulSoup object
        1. Creating a BeautifulSoup object from a string
        2. Creating a BeautifulSoup object from a file-like object
        3. Creating a BeautifulSoup object for XML parsing
          1. Understanding the features argument
      2. Tag
        1. Accessing the Tag object from BeautifulSoup
        2. Name of the Tag object
        3. Attributes of a Tag object
      3. The NavigableString object
      4. Quick reference
      5. Summary
    10. 3. Search Using Beautiful Soup
      1. Searching in Beautiful Soup
        1. Searching with find()
          1. Finding the first producer
          2. Explaining find()
            1. Searching for tags
            2. Searching for text
            3. Searching based on regular expressions
            4. Searching based on attribute values of a tag
              1. Finding the first primary consumer
              2. Searching based on custom attributes
              3. Searching based on the CSS class
            5. Searching using functions defined
            6. Applying searching methods in combination
        2. Searching with find_all()
          1. Finding all tertiary consumers
          2. Understanding parameters used with find_all()
        3. Searching for Tags in relation
          1. Searching for the parent tags
          2. Searching for siblings
          3. Searching for next
          4. Searching for previous
      2. Using search methods to scrape information from a web page
      3. Quick reference
      4. Summary
    11. 4. Navigation Using Beautiful Soup
      1. Navigation using Beautiful Soup
        1. Navigating down
          1. Using the name of the child tag
          2. Using predefined attributes
            1. The .contents attribute
            2. The .children attribute
            3. The .descendants attribute
          3. Special attributes for navigating down
            1. The .string attribute
            2. The .strings attribute
        2. Navigating up
          1. The .parent attribute
          2. The .parents attribute
        3. Navigating sideways to the siblings
          1. The .next_sibling attribute
          2. The .previous_sibling attribute
        4. Navigating to the previous and next objects parsed
      2. Quick reference
      3. Summary
    12. 5. Modifying Content Using Beautiful Soup
      1. Modifying Tag using Beautiful Soup
        1. Modifying the name property of Tag
        2. Modifying the attribute values of Tag
          1. Updating the existing attribute value of Tag
          2. Adding new attribute values to Tag
        3. Deleting the tag attributes
        4. Adding a new tag
          1. Adding a new producer using new_tag() and append()
          2. Creating a new tag using new_tag()
          3. Adding a new tag using append()
          4. Adding a new div tag to the li tag using insert()
      2. Modifying string contents
        1. Using .string to modify the string content
        2. Adding strings using .append(), insert(), and new_string()
      3. Deleting tags from the HTML document
        1. Deleting the producer using decompose()
        2. Deleting the producer using extract()
        3. Deleting the contents of a tag using Beautiful Soup
      4. Special functions to modify content
      5. Quick reference
      6. Summary
    13. 6. Encoding Support in Beautiful Soup
      1. Encoding in Beautiful Soup
        1. Understanding the original encoding of the HTML document
        2. Specifying the encoding of the HTML document
      2. Output encoding
      3. Quick reference
      4. Summary
    14. 7. Output in Beautiful Soup
      1. Formatted printing
      2. Unformatted printing
      3. Output formatters in Beautiful Soup
        1. The minimal formatter
        2. The html formatter
        3. The None formatter
        4. The function formatter
      4. Using get_text()
      5. Quick reference
      6. Summary
    15. 8. Creating a Web Scraper
      1. Getting book details from PacktPub.com
        1. Finding pages with a list of books
        2. Finding book details
      2. Getting selling prices from Amazon
      3. Getting the selling price from Barnes and Noble
      4. Summary
    16. Index

Product information

  • Title: Getting Started with Beautiful Soup
  • Author(s): Vineeth G. Nair
  • Release date: January 2014
  • Publisher(s): Packt Publishing
  • ISBN: 9781783289554