The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:
Aggregate and associate data from disparate locations, then store and manipulate the data as you like
Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
Integrate third-party data into your own applications or web sites
Make your own site easier to scrape and more usable to others
Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.
Chapter 1 Walking Softly
A Crash Course in Spidering and Scraping
Best Practices for You and Your Spider
Anatomy of an HTML Page
Registering Your Spider
Keeping Your Spider Out of Sticky Situations
Finding the Patterns of Identifiers
Chapter 2 Assembling a Toolbox
Resources You May Find Helpful
Installing Perl Modules
Simply Fetching with LWP::Simple
More Involved Requests with LWP::UserAgent
Adding HTTP Headers to Your Request
Posting Form Data with LWP
Authentication, Cookies, and Proxies
Handling Relative and Absolute URLs
Secured Access and Browser Attributes
Respecting Your Scrapee's Bandwidth
Adding Progress Bars to Your Scripts
Scraping with HTML::TreeBuilder
Parsing with HTML::TokeParser
Scraping with WWW::Mechanize
In Praise of Regular Expressions
Painless RSS with Template::Extract
A Quick Introduction to XPath
Downloading with curl and wget
More Advanced wget Techniques
Using Pipes to Chain Commands
Running Multiple Utilities at Once
Utilizing the Web Scraping Proxy
Being Warned When Things Go Wrong
Being Adaptive to Site Redesigns
Chapter 3 Collecting Media Files
Detective Case Study: Newgrounds
Detective Case Study: iFilm
Downloading Movies from the Library of Congress
Downloading Images from Webshots
Downloading Comics with dailystrips
Archiving Your Favorite Webcams
News Wallpaper for Your Site
Saving Only POP3 Email Attachments
Downloading MP3s from a Playlist
Downloading from Usenet with nget
Chapter 4 Gleaning Data from Databases
Archiving Yahoo! Groups Messages with yahoo2mbox
Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
Gleaning Buzz from Yahoo!
Spidering the Yahoo! Catalog
Tracking Additions to Yahoo!
Scattersearch with Yahoo! and Google
Yahoo! Directory Mindshare in Google
Weblog-Free Google Results
Spidering, Google, and Multiple Domains
Scraping Amazon.com Product Reviews
Receive an Email Alert for Newly Added Amazon.com Reviews
Scraping Amazon.com Customer Advice
Publishing Amazon.com Associates Statistics
Sorting Amazon.com Recommendations by Rating
Related Amazon.com Products with Alexa
Scraping Alexa's Competitive Data with Java
Finding Album Information with FreeDB and Amazon.com
Expanding Your Musical Tastes
Saving Daily Horoscopes to Your iPod
Graphing Data with RRDTOOL
Stocking Up on Financial Quotes
Super Author Searching
Mapping O'Reilly Best Sellers to Library Popularity
Using All Consuming to Get Book Lists
Tracking Packages with FedEx
Checking Blogs for New Comments
Aggregating RSS and Posting Changes
Using the Link Cosmos of Technorati
Finding Related RSS Feeds
Automatically Finding Blogs of Interest
Scraping TV Listings
What's Your Visitor's Weather Like?
Trendspotting with Geotargeting
Getting the Best Travel Route by Train
Geographic Distance and Back Again
Super Word Lookup
Word Associations with Lexical Freenet
Reformatting Bugtraq Reports
Keeping Tabs on the Web via Email
Publish IE's Favorites to Your Web Site
Spidering GameStop.com Game Prices
Bargain Hunting with PHP
Aggregating Multiple Search Engine Results
Searching the Better Business Bureau
Searching for Health Inspections
Filtering for the Naughties
Chapter 5 Maintaining Your Collections
Using cron to Automate Tasks
Scheduling Tasks Without cron
Mirroring Web Sites with wget and rsync
Accumulating Search Results Over Time
Chapter 6 Giving Back to the World
Using XML::RSS to Repurpose Data
Placing RSS Headlines on Your Site
Making Your Resources Scrapable with Regular Expressions
Making Your Resources Scrapable with a REST Interface
Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love.
Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects. The tool on the cover of Spidering Hacks is a flex scraper. Flex scrapers are sometimes referred to as putty knives or push scrapers. These rugged tools are commonly used for light-duty construction or home projects, such as wallpapering, painting, or woodworking. Flex scrapers are usually three inches wide, with steel blades ground thinner than a typical putty knife to give maximum flexibility. Thus, they are the perfect choice for applying lighter compounds over broader areas and at a faster rate than putty knives. High-end flex scrapers have ergonomic handles designed to fit the hand and reduce fatigue. Just as a well-designed flex scraper gives improved blade control, so too does a well-designed spidering or scraping hack give greater control and and flexibility when gathering information from the Web and automating and speeding complex tasks. Genevieve d'Entremont was the production editor for Spidering Hacks. Brian Sawyer was the copyeditor. Matt Hutchinson proofread the book. Derek Di Matteo, Marlowe Shaeffer, and Claire Cloutier provided quality control. Julie Hawks wrote the index.Emma Colby designed the cover of this book, based on a series design by Edie Freedman. The cover image is an original photograph by Emma Colby. Emma Colby produced the cover layout with QuarkXPress 4.1 using Adobe's Helvetica Neue and ITC Garamond fonts.David Futato designed the interior layout. This book was converted from Microsoft Word to FrameMaker 5.5.6 by Andrew Savikas. The text font is Linotype Birka; the heading font is Adobe Helvetica Neue Condensed; and the code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert Romano and Jessamyn Read using Macromedia FreeHand 9 and Adobe Photoshop 6. This colophon was written by Derek Di Matteo.
Considering this book is from 2003, I have still found it very useful in ramping up my knowledge when working programatically with web sites. The basics included, as well as detailed examples, help provide an understanding of the basic concepts, while arming you with plenty of good example scripts. There is more updated information out on the Internet; however, reading this book has helped me find the info I need more efficiently.
That said, it would still be great to see an updated version!!!
Bottom Line Yes, I would recommend this to a friend
Overview: This book is a mashup of scripts that can be used to gather information from a number of resources on the web and put them in a format of your choosing. The book relies heavily on the Perl scripting language. As such, many of these scripts are not very complicated because they use Perl modules to do the heavy lifting in their various tasks, so the scripts become front-ends to more complicated processes. As member of a club, I was given a copy of the book to read and review. My goal in reading the book was to look for ways spidering could be used in the corporate world. Specifically, ways I could aggregate data from various reports and management services to provide data in a format that was more useful to me.
Pros: The breadth of scripts is impressive. A reader would be hard pressed to come up with a scenario that involves getting data off of a website that is not covered in this book at some level. The examples are fairly generic, but the author not only explains how you might use it in your situation, but in many cases, the author gives advanced tips and examples that go beyond the basic ideas that he presented. Since most of the scripts are based on one or more Perl modules, the scripts are fairly simple. A (trained) beginner's level of understanding is all that is necessary to copy a few of these scripts and modify a few key lines to make it work in other situations.
Cons: The book is starting to get a little dated. That being said, the basic technologies that I could identify remain applicable. Still, while aggregating web details is a nice idea, most of the aggregating suggestions they used have already been done by someone somewhere on the web. If you have a particular need, I would suggest a serious Google search for a turnkey solution before I embarked on one of these projects. The main issue I have with this book is a flip-side to one of its strengths. By using Perl, most of the complicated work is done in the background by modules that are hidden from the view of the user. If you are planning to use any other language, you are suddenly faced with not only translating the basic script functionality presented in the book, but now you must dissect and replicate the Perl modules as well. Experienced programmers can figure out how to replicate these modules into other scripting languages, but that is a fairly advanced task. This turns a fairly simple to moderate project into a daunting one. As I mentioned above, my focus was using these techniques in a corporate environment. Since I deal exclusively with MS OSes, I use Powershell as my script language of choice. I was able to use Powershell to replicate one of the more basic ideas, but due to the differences in code, I basically had to start from scratch. Beyond this one project, the task of replicating many of the Perl services has been too complicated for me to do in my limited time. I was looking for quick and simple ways of repackaging data, and that is not what I found give the code-language translation issues.
Bottom Line Yes, I would recommend this to a friend
I enjoy the hacks series a load! The toys you can use immediately are great fun. I immediately borrowed the idea from the "automatically find blogs of your interest" chapter, and modified it to find "friends of friends" for a blog-happy girlfriend.
What I liked most about the book, is that it really broadened my perl horizons. Especially the section "building a toolkit". A great start to using some perl modules that help you get the job done -fast-.
Being someone who has built a variety of spiders/scrapers, I appreciated the insight from the authors, and appreciate finding the info in a consise condensed reference... something unknown to builders (and would-be builders) of crawlers in the past.
4.5 stars (5 star scale). This book is not perfect, the authors may have tried to cover too much material. The material is very time sensitive, hence the book needed to be rushed together, it will have little value in 5 years. I wanted to give the book a higher rating, I tried to think of a better way to present the material in 400 pages and couldn't. There are just too many rough edges for a 5 star book.
As a member of O'Reilly's "Hacks" series, "Spidering Hacks" is different than the typical O'Reilly book. This book presents breadth of topic rather than depth. The format is 100 hacks (mostly Perl on Linux with an odd Python, Java, or Windows hack), some written by Hemenway & Calishain, many written by guest authors organized into 6 chapters. The number of authors leads to a variety of styles in both English and Perl. If you treat the book as a super magazine (time sensitive short articles), you won't be disappointed.
Chapter 1 – Walking Softly (Hacks 1-7)
Chapter 1 provides general guidelines on spider/scraper etiquette and good practices, which the rest of the book seems to ignore.
Chapter 2 – Assembling a toolkit (Hacks 8-32)
An overview of several modules and techniques with working examples. More experienced Perl mongers may find this material remedial.
Chapter 3 – Collecting media files (Hacks 33-42)
The hacks on POP3 attachments and Usenet may be worth the price of the book for those trying to solve a particular problem.
Chapter 4 – Gleaning data from databases (Hacks 43-89)
Over 1/2 the book is dedicated to this chapter. Initially it appears that these are very specific solutions for a narrow audience. Closer reading reveals a variety of techniques that can be used in many circumstances.
Chapter 5 – Maintaining your collections (Hacks 90-93)
Not much here. Cron is covered much better in other works.
Chapter 6 – Giving back to the world (Hacks 94-100)
Essentially how to be nice to spiders. Why Net::AIM is covered here seems arbitrary. Hack #100 "Going beyond the book" is nothing but fluff.
An example of how I used the book may be illustrative. I wanted to scrape TV listings, but hack #73 "Scraping TV Listings" has been made obsolete by a modification to tvguide.com. I was able to quickly use the toolkit presented in chapter 2 to scrape one of the many other web sites with TV listings. I expect this to be typical, sites change, spiders and scrapers need to adapt.
Spider Hacks is an odd collection of articles that seem to cover the remedial to intermediate skill ranges. Nobody will benefit from all 100 hacks, but most of us will find $24.95 of value in the hacks that cause us to go "How cool!".
I have been trying to find a Java book that offered me tips and tricks on how to scrape the Internet, glean the most tasty bits of it, and put them to good use. I ran across "Spidering Hacks", by Kevin Hemenway and Tara Calishain, which was exactly what I wanted - only it's base language is Perl.
To my delight, the authors' writing is so lucid, their support and encouragement so welcome, and their examples so closely matched to my needs - that I immediately picked up this book, and dove headlong into the vast and beautiful world that is Perl.
Despite my preference for programming in Java for Internet-related tasks, I highly recommend this book, even for those unfamiliar with the Perl programming language, as this book is written so well that you can get up and running purely on the strength of the authors' talents. I am very impressed with this book.
Excellent job in explaining the realworld solutions to data spidering, scraping and manipulation of the data. I have educated the Internet community about the positive benefits of bots for years and this book does an extraordinary job of giving industrial strength tips, tools and hacks highighted in a easy to understand format with concrete step by step instructions on the code, running the hack and hacking the hack. Great job Kevin and Tara!