Spidering Hacks

Book description

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:

  • Aggregate and associate data from disparate locations, then store and manipulate the data as you like
  • Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
  • Integrate third-party data into your own applications or web sites
  • Make your own site easier to scrape and more usable to others
  • Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

Table of contents

  1. A Note Regarding Supplemental Files
  2. Credits
    1. About the Authors
    2. Contributors
      1. Acknowledgments
        1. Kevin
        2. Tara
  3. Preface
    1. Why Spidering Hacks?
    2. How This Book Is Organized
    3. How to Use This Book
    4. Conventions Used in This Book
    5. How to Contact Us
    6. Got a Hack?
  4. 1. Walking Softly
    1. Hack 1. Hacks #1-7
    2. Hack #1. A Crash Course in Spidering and Scraping
      1. Why Spider?
    3. Hack #2. Best Practices for You and Your Spider
      1. Be Liberal in What You Accept
      2. Don’t Limit Your Dataset
      3. Don’t Reinvent the Wheel
      4. Best Practices for You
        1. Choose the most structured format available
        2. If you must scrape HTML, do so sparingly
        3. Use the right tool for the job
        4. Don’t go where you’re not wanted
        5. Choose a good identifier
        6. Make information on your spider readily available
        7. Don’t demand unlimited site access or support
      5. Best Practices for Your Spider
        1. Respect robots.txt
        2. Go light on the bandwidth
        3. Take just enough, and don’t take too often
    4. Hack #3. Anatomy of an HTML Page
      1. Anatomy of an HTML Page
      2. Header Information with the H Tags
      3. List Information with Special HTML Tags
      4. Non-HTML Files
    5. Hack #4. Registering Your Spider
      1. Naming Your Spider
      2. A Web Page About Your Spider
      3. Places to Register Your Spider
    6. Hack #5. Preempting Discovery
      1. Making Contact
      2. Making the Arguments for Your Spider
      3. Making Your Spider Easy to Find and Learn About
      4. Considering Legal Issues
    7. Hack #6. Keeping Your Spider Out of Sticky Situations
      1. Bad Spider, No Biscuit!
      2. Violating Copyright
      3. Aggregating Data
      4. Competitive Intelligence
      5. Possible Consequences of Misbehaving Spiders
      6. Tracking Legal Issues
    8. Hack #7. Finding the Patterns of Identifiers
      1. Arbitrary Classification Systems Within a Collection
      2. Classification Systems that Use an Established Universal Taxonomy Within a Collection
      3. Classification Systems that Identify Documents Across a Wide Number of Collections
      4. Some Large Collections with ID Numbers
  5. 2. Assembling a Toolbox
    1. Hack 9. Hacks #8-32
    2. Hack 10. Perl Modules
    3. Hack 11. Resources You May Find Helpful
    4. Hack #8. Installing Perl Modules
      1. Example: Installing LWP
        1. Unix and Mac OS X installation via CPAN
        2. Unix and Mac OS X installation by hand
        3. Windows installation via PPM
    5. Hack #9. Simply Fetching with LWP::Simple
    6. Hack #10. More Involved Requests with LWP::UserAgent
    7. Hack #11. Adding HTTP Headers to Your Request
    8. Hack #12. Posting Form Data with LWP
    9. Hack #13. Authentication, Cookies, and Proxies
      1. Authentication
      2. Enabling Cookies
      3. Using Proxies
    10. Hack #14. Handling Relative and Absolute URLs
    11. Hack #15. Secured Access and Browser Attributes
      1. Other Browser Attributes
    12. Hack #16. Respecting Your Scrapee’s Bandwidth
      1. If-Modified-Since
      2. ETags
      3. Compressed Data
    13. Hack #17. Respecting robots.txt
    14. Hack #18. Adding Progress Bars to Your Scripts
      1. The Code
    15. Hack #19. Scraping with HTML::TreeBuilder
      1. Hacking the Hack
    16. Hack #20. Parsing with HTML::TokeParser
      1. The Code
      2. Running the Hack
      3. See Also
    17. Hack #21. WWW::Mechanize 101
      1. Introducing WWW::Mechanize
      2. Using Mech’s Navigation Tools
      3. The Code
      4. Running the Hack
    18. Hack #22. Scraping with WWW::Mechanize
      1. The Code
      2. Running the Hack
    19. Hack #23. In Praise of Regular Expressions
      1. Using Modules to Parse HTML
      2. Watching the Printers: Score One for Regular Expressions
      3. The Code
      4. Not Fragile, but Probably Not Permanent Either
    20. Hack #24. Painless RSS with Template::Extract
    21. Hack #25. A Quick Introduction to XPath
      1. Using LibXML’s xmllint
      2. The Code
      3. Running the Hack
    22. Hack #26. Downloading with curl and wget
    23. Hack #27. More Advanced wget Techniques
    24. Hack #28. Using Pipes to Chain Commands
      1. Browsing for Links with lynx
      2. grepping for Patterns
      3. wgetting the Files
      4. Hacking the Hack
    25. Hack #29. Running Multiple Utilities at Once
      1. Shell Scripts
      2. Perl Equivalence
    26. Hack #30. Utilizing the Web Scraping Proxy
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    27. Hack #31. Being Warned When Things Go Wrong
    28. Hack #32. Being Adaptive to Site Redesigns
  6. 3. Collecting Media Files
    1. Hack 37. Hacks #33-42
    2. Hack #33. Detective Case Study: Newgrounds
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    3. Hack #34. Detective Case Study: iFilm
      1. The Code
      2. Running the Hack
    4. Hack #35. Downloading Movies from the Library of Congress
      1. Directory Indexes
      2. An Example: Origins of American Animation
      3. Another Example: America at Work, America at Leisure
    5. Hack #36. Downloading Images from Webshots
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
        1. Starting on a given page
        2. Downloading from other areas
        3. Modifying filenames
        4. Bypassing the adult content warning
    6. Hack #37. Downloading Comics with dailystrips
      1. Getting the Code
      2. Running the Hack
      3. Hacking the Hack
        1. Defining strips by URL
        2. Finding strips with a search
        3. Gathering strips into a group
    7. Hack #38. Archiving Your Favorite Webcams
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    8. Hack #39. News Wallpaper for Your Site
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
        1. Picture limits
        2. RSS version
        3. Image::Size
    9. Hack #40. Saving Only POP3 Email Attachments
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
        1. Changing the hardcoded file extensions
        2. Shortening or eliminating the subject line
        3. Saving attachments to the current directory
        4. Specifying the size of saved messages
    10. Hack #41. Downloading MP3s from a Playlist
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    11. Hack #42. Downloading from Usenet with nget
  7. 4. Gleaning Data from Databases
    1. Hack 48. Hacks #43-89
    2. Hack #43. Archiving Yahoo! Groups Messages with yahoo2mbox
      1. Running the Hack
      2. Hacking the Hack
    3. Hack #44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    4. Hack #45. Gleaning Buzz from Yahoo!
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    5. Hack #46. Spidering the Yahoo! Catalog
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
      4. See Also
    6. Hack #47. Tracking Additions to Yahoo!
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    7. Hack #48. Scattersearch with Yahoo! and Google
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    8. Hack #49. Yahoo! Directory Mindshare in Google
      1. The Code
      2. Running The Hack
      3. Hacking the Hack
    9. Hack #50. Weblog-Free Google Results
      1. The Code
      2. Hacking the Hack
    10. Hack #51. Spidering, Google, and Multiple Domains
      1. Example: Top 20 Searching on Google
      2. The Code
      3. Running the Hack
      4. Hacking the Hack
    11. Hack #52. Scraping Amazon.com Product Reviews
      1. The Code
      2. Running the Hack
      3. See Also
    12. Hack #53. Receive an Email Alert for Newly Added Amazon.com Reviews
      1. The Code
      2. Running the Hack
      3. See Also
    13. Hack #54. Scraping Amazon.com Customer Advice
      1. The Code
      2. Running the Hack
      3. See Also
    14. Hack #55. Publishing Amazon.com Associates Statistics
      1. The Code
      2. Running the Hack
      3. See Also
    15. Hack #56. Sorting Amazon.com Recommendations by Rating
      1. The Code
      2. Running the Hack
      3. See Also
    16. Hack #57. Related Amazon.com Products with Alexa
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    17. Hack #58. Scraping Alexa’s Competitive Data with Java
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    18. Hack #59. Finding Album Information with FreeDB and Amazon.com
      1. Getting Started
      2. Checking Your Disc ID
      3. Digging Up the FreeDB Details
      4. Rocking with Amazon.com
      5. Presenting the Results
      6. Hacking the Hack
    19. Hack #60. Expanding Your Musical Tastes
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
        1. Changing the number of results returned
        2. Looking up artists
      4. See Also
    20. Hack #61. Saving Daily Horoscopes to Your iPod
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
      4. See Also
    21. Hack #62. Graphing Data with RRDTOOL
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    22. Hack #63. Stocking Up on Financial Quotes
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    23. Hack #64. Super Author Searching
      1. Gathering Tools
      2. Hacking the Library of Congress
      3. Perusing Project Gutenberg
      4. Navigating the Amazon
      5. Presenting the Results
      6. Running the Hack
      7. Hacking the Hack
    24. Hack #65. Mapping O’Reilly Best Sellers to Library Popularity
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    25. Hack #66. Using All Consuming to Get Book Lists
      1. The SOAP Code
        1. Most-mentioned lists
        2. Personal book lists
        3. Book metadata and weblog mentions
        4. Friends and recommendations
      2. The REST Code
        1. Most-mentioned lists
        2. Personal book lists
        3. Book metadata and weblog mentions
        4. Friends and recommendations
      3. Running the Hack
      4. The XML Results
      5. Hacking the Hack
    26. Hack #67. Tracking Packages with FedEx
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    27. Hack #68. Checking Blogs for New Comments
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    28. Hack #69. Aggregating RSS and Posting Changes
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
      4. See Also
    29. Hack #70. Using the Link Cosmos of Technorati
      1. Need Some REST?
      2. A Skeleton Key for Words
    30. Hack #71. Finding Related RSS Feeds
      1. Filling Up the Toolbox
      2. Getting the Dirt on Feeds
      3. Reporting on Our Findings
      4. Hacking the Hack
    31. Hack #72. Automatically Finding Blogs of Interest
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    32. Hack #73. Scraping TV Listings
      1. The Code
      2. Running the Hack
    33. Hack #74. What’s Your Visitor’s Weather Like?
      1. The Code
      2. Running the Hack
      3. Using and Hacking the Hack
    34. Hack #75. Trendspotting with Geotargeting
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    35. Hack #76. Getting the Best Travel Route by Train
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    36. Hack #77. Geographic Distance and Back Again
      1. The Latitude/Longitude Question
      2. Hacking the Latitude Out of MapPoint
      3. The Code
      4. Running the Hack
      5. Hacking the Hack
    37. Hack #78. Super Word Lookup
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
        1. Using specific dictionaries
        2. Clarifying the thesaurus
    38. Hack #79. Word Associations with Lexical Freenet
      1. The Code
      2. Running the Hack
    39. Hack #80. Reformatting Bugtraq Reports
      1. The Code
      2. Running The Hack
      3. Hacking the Hack
    40. Hack #81. Keeping Tabs on the Web via Email
      1. Planning for Change
      2. Calling In Outside Help
      3. Send Out the News
      4. Hacking the Hack
    41. Hack #82. Publish IE’s Favorites to Your Web Site
      1. IE’s Favorites
      2. What It Does and How It Works
      3. The Code
      4. Running the Hack
      5. Hacking the Hack
    42. Hack #83. Spidering GameStop.com Game Prices
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
        1. GameStop by keyword
        2. Putting the results in a different format
    43. Hack #84. Bargain Hunting with PHP
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    44. Hack #85. Aggregating Multiple Search Engine Results
      1. The Code
      2. Running the Hack
    45. Hack #86. Robot Karaoke
      1. The Code
      2. Running the Hack
    46. Hack #87. Searching the Better Business Bureau
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    47. Hack #88. Searching for Health Inspections
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
    48. Hack #89. Filtering for the Naughties
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
  8. 5. Maintaining Your Collections
    1. Hack 96. Hacks #90-93
    2. Hack #90. Using cron to Automate Tasks
      1. See Also
    3. Hack #91. Scheduling Tasks Without cron
      1. Do You Really Need Anything cron-Like?
      2. Running Scripts on the Client Side
      3. Using Perl’s sleep Function
      4. Scheduling with Something Besides cron
      5. Using Hosted cron Services
    4. Hack #92. Mirroring Web Sites with wget and rsync
      1. Mirroring via the Web
      2. Mirroring Directly with the Server
      3. Hacking the Hack
    5. Hack #93. Accumulating Search Results Over Time
      1. The Code
      2. Running the Hack
      3. Hacking the Hack
      4. See Also
  9. 6. Giving Back to the World
    1. Hack 101. Hacks #94-100
    2. Hack #94. Using XML::RSS to Repurpose Data
      1. See Also
    3. Hack #95. Placing RSS Headlines on Your Site
      1. The Code
      2. Running the Hack
    4. Hack #96. Making Your Resources Scrapable with Regular Expressions
      1. The Challenge of Web Scraping
        1. Navigating between web resources
        2. Extracting specific information
      2. How to Be Nicer to Scrapers
        1. Make resources easier to locate and acquire
        2. Making data easier to extract
      3. Hacking the Hack
    5. Hack #97. Making Your Resources Scrapable with a REST Interface
      1. Navigating One URI at a Time
      2. Negotiating Better Content
      3. See Also
    6. Hack #98. Making Your Resources Scrapable with XML-RPC
      1. Enter Web Services
        1. Building the service
        2. Making the service useful
        3. Using the service from the client side
        4. Hacking a scrape together with a service
      2. Hacking the Hack
    7. Hack #99. Creating an IM Interface
      1. The Code
      2. Running the Hack
    8. Hack #100. Going Beyond the Book
      1. Using Google and Other Search Engines
      2. Mailing Lists
      3. Web Sites
  10. Index
  11. About the Authors
  12. Colophon
  13. Copyright

Product information

  • Title: Spidering Hacks
  • Author(s): Morbus Iff, Tara Calishain
  • Release date: October 2003
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491951675