Books & Videos

Table of Contents

  1. Chapter 1 Walking Softly

    1. Hacks #1-7

    2. A Crash Course in Spidering and Scraping

    3. Best Practices for You and Your Spider

    4. Anatomy of an HTML Page

    5. Registering Your Spider

    6. Preempting Discovery

    7. Keeping Your Spider Out of Sticky Situations

    8. Finding the Patterns of Identifiers

  2. Chapter 2 Assembling a Toolbox

    1. Hacks #8-32

    2. Perl Modules

    3. Resources You May Find Helpful

    4. Installing Perl Modules

    5. Simply Fetching with LWP::Simple

    6. More Involved Requests with LWP::UserAgent

    7. Adding HTTP Headers to Your Request

    8. Posting Form Data with LWP

    9. Authentication, Cookies, and Proxies

    10. Handling Relative and Absolute URLs

    11. Secured Access and Browser Attributes

    12. Respecting Your Scrapee's Bandwidth

    13. Respecting robots.txt

    14. Adding Progress Bars to Your Scripts

    15. Scraping with HTML::TreeBuilder

    16. Parsing with HTML::TokeParser

    17. WWW::Mechanize 101

    18. Scraping with WWW::Mechanize

    19. In Praise of Regular Expressions

    20. Painless RSS with Template::Extract

    21. A Quick Introduction to XPath

    22. Downloading with curl and wget

    23. More Advanced wget Techniques

    24. Using Pipes to Chain Commands

    25. Running Multiple Utilities at Once

    26. Utilizing the Web Scraping Proxy

    27. Being Warned When Things Go Wrong

    28. Being Adaptive to Site Redesigns

  3. Chapter 3 Collecting Media Files

    1. Hacks #33-42

    2. Detective Case Study: Newgrounds

    3. Detective Case Study: iFilm

    4. Downloading Movies from the Library of Congress

    5. Downloading Images from Webshots

    6. Downloading Comics with dailystrips

    7. Archiving Your Favorite Webcams

    8. News Wallpaper for Your Site

    9. Saving Only POP3 Email Attachments

    10. Downloading MP3s from a Playlist

    11. Downloading from Usenet with nget

  4. Chapter 4 Gleaning Data from Databases

    1. Hacks #43-89

    2. Archiving Yahoo! Groups Messages with yahoo2mbox

    3. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups

    4. Gleaning Buzz from Yahoo!

    5. Spidering the Yahoo! Catalog

    6. Tracking Additions to Yahoo!

    7. Scattersearch with Yahoo! and Google

    8. Yahoo! Directory Mindshare in Google

    9. Weblog-Free Google Results

    10. Spidering, Google, and Multiple Domains

    11. Scraping Product Reviews

    12. Receive an Email Alert for Newly Added Reviews

    13. Scraping Customer Advice

    14. Publishing Associates Statistics

    15. Sorting Recommendations by Rating

    16. Related Products with Alexa

    17. Scraping Alexa's Competitive Data with Java

    18. Finding Album Information with FreeDB and

    19. Expanding Your Musical Tastes

    20. Saving Daily Horoscopes to Your iPod

    21. Graphing Data with RRDTOOL

    22. Stocking Up on Financial Quotes

    23. Super Author Searching

    24. Mapping O'Reilly Best Sellers to Library Popularity

    25. Using All Consuming to Get Book Lists

    26. Tracking Packages with FedEx

    27. Checking Blogs for New Comments

    28. Aggregating RSS and Posting Changes

    29. Using the Link Cosmos of Technorati

    30. Finding Related RSS Feeds

    31. Automatically Finding Blogs of Interest

    32. Scraping TV Listings

    33. What's Your Visitor's Weather Like?

    34. Trendspotting with Geotargeting

    35. Getting the Best Travel Route by Train

    36. Geographic Distance and Back Again

    37. Super Word Lookup

    38. Word Associations with Lexical Freenet

    39. Reformatting Bugtraq Reports

    40. Keeping Tabs on the Web via Email

    41. Publish IE's Favorites to Your Web Site

    42. Spidering Game Prices

    43. Bargain Hunting with PHP

    44. Aggregating Multiple Search Engine Results

    45. Robot Karaoke

    46. Searching the Better Business Bureau

    47. Searching for Health Inspections

    48. Filtering for the Naughties

  5. Chapter 5 Maintaining Your Collections

    1. Hacks #90-93

    2. Using cron to Automate Tasks

    3. Scheduling Tasks Without cron

    4. Mirroring Web Sites with wget and rsync

    5. Accumulating Search Results Over Time

  6. Chapter 6 Giving Back to the World

    1. Hacks #94-100

    2. Using XML::RSS to Repurpose Data

    3. Placing RSS Headlines on Your Site

    4. Making Your Resources Scrapable with Regular Expressions

    5. Making Your Resources Scrapable with a REST Interface

    6. Making Your Resources Scrapable with XML-RPC

    7. Creating an IM Interface

    8. Going Beyond the Book

  1. Colophon