Spidering Hacks
100 Industrial-Strength Tips & Tools
Publisher: O'Reilly Media
Final Release Date: October 2003
Pages: 428

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:

  • Aggregate and associate data from disparate locations, then store and manipulate the data as you like
  • Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
  • Integrate third-party data into your own applications or web sites
  • Make your own site easier to scrape and more usable to others
  • Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.
Table of Contents
Product Details
About the Author
Colophon
Recommended for You
Customer Reviews

REVIEW SNAPSHOT®

by PowerReviews
oreillySpidering Hacks
 
4.1

(based on 7 reviews)

Ratings Distribution

  • 5 Stars

     

    (2)

  • 4 Stars

     

    (4)

  • 3 Stars

     

    (1)

  • 2 Stars

     

    (0)

  • 1 Stars

     

    (0)

Reviewed by 7 customers

Sort by

Displaying reviews 1-7

Back to top

 
4.0

Dated, But Still Relevant and Helpful

By LK the Web Mistress :-)

from Central New Jersey

About Me Developer, Sys Admin

Verified Buyer

Pros

  • Accurate
  • Helpful examples
  • Well-written

Cons

    Best Uses

    • Intermediate

    Comments about oreilly Spidering Hacks:

    Considering this book is from 2003, I have still found it very useful in ramping up my knowledge when working programatically with web sites. The basics included, as well as detailed examples, help provide an understanding of the basic concepts, while arming you with plenty of good example scripts. There is more updated information out on the Internet; however, reading this book has helped me find the info I need more efficiently.

    That said, it would still be great to see an updated version!!!

    (1 of 1 customers found this review helpful)

     
    3.0

    Not as helpful as I would like

    By garthm9

    from Atlanta, GA

    About Me Sys Admin

    Verified Reviewer

    Pros

    • Accurate
    • Concise
    • Easy to understand
    • Helpful examples

    Cons

      Best Uses

      • Expert
      • Intermediate

      Comments about oreilly Spidering Hacks:

      Overview:
      This book is a mashup of scripts that can be used to gather information from a number of resources on the web and put them in a format of your choosing. The book relies heavily on the Perl scripting language. As such, many of these scripts are not very complicated because they use Perl modules to do the heavy lifting in their various tasks, so the scripts become front-ends to more complicated processes. As member of a club, I was given a copy of the book to read and review. My goal in reading the book was to look for ways spidering could be used in the corporate world. Specifically, ways I could aggregate data from various reports and management services to provide data in a format that was more useful to me.

      Pros:
      The breadth of scripts is impressive. A reader would be hard pressed to come up with a scenario that involves getting data off of a website that is not covered in this book at some level. The examples are fairly generic, but the author not only explains how you might use it in your situation, but in many cases, the author gives advanced tips and examples that go beyond the basic ideas that he presented. Since most of the scripts are based on one or more Perl modules, the scripts are fairly simple. A (trained) beginner's level of understanding is all that is necessary to copy a few of these scripts and modify a few key lines to make it work in other situations.

      Cons:
      The book is starting to get a little dated. That being said, the basic technologies that I could identify remain applicable. Still, while aggregating web details is a nice idea, most of the aggregating suggestions they used have already been done by someone somewhere on the web. If you have a particular need, I would suggest a serious Google search for a turnkey solution before I embarked on one of these projects. The main issue I have with this book is a flip-side to one of its strengths. By using Perl, most of the complicated work is done in the background by modules that are hidden from the view of the user. If you are planning to use any other language, you are suddenly faced with not only translating the basic script functionality presented in the book, but now you must dissect and replicate the Perl modules as well. Experienced programmers can figure out how to replicate these modules into other scripting languages, but that is a fairly advanced task. This turns a fairly simple to moderate project into a daunting one. As I mentioned above, my focus was using these techniques in a corporate environment. Since I deal exclusively with MS OSes, I use Powershell as my script language of choice. I was able to use Powershell to replicate one of the more basic ideas, but due to the differences in code, I basically had to start from scratch. Beyond this one project, the task of replicating many of the Perl services has been too complicated for me to do in my limited time. I was looking for quick and simple ways of repackaging data, and that is not what I found give the code-language translation issues.

       
      5.0

      Good book

      By garyamort

      from Undisclosed

      Comments about oreilly Spidering Hacks:

      Gives a lot of great ideas for spidering. Emphasis is on perl, with some occasionaly diversions into other languages for specific functions.

      Personally, I'd prefer either a broader mix of languages, or restriction to one language. Still, overall a great book to give you a lot of ideas.

       
      4.0

      Spidering Hacks Review

      By Doug Smith

      from Undisclosed

      Comments about oreilly Spidering Hacks:

      I enjoy the hacks series a load! The toys you can use immediately are great fun. I immediately borrowed the idea from the "automatically find blogs of your interest" chapter, and modified it to find "friends of friends" for a blog-happy girlfriend.

      What I liked most about the book, is that it really broadened my perl horizons. Especially the section "building a toolkit". A great start to using some perl modules that help you get the job done -fast-.

      Being someone who has built a variety of spiders/scrapers, I appreciated the insight from the authors, and appreciate finding the info in a consise condensed reference... something unknown to builders (and would-be builders) of crawlers in the past.

       
      4.0

      Spidering Hacks Review

      By Bill Day

      from Undisclosed

      Comments about oreilly Spidering Hacks:

      Spidering Hacks

      Authors: Kevein Hemenway & Tara Calishain

      Publisher: O'Reilly & Associates

      Price: $24.95

      Pages: 402

      Web site:

      Reviewed by Bill Day,

      Grand Rapids (Michigan) PerlMongers

      4.5 stars (5 star scale). This book is not perfect, the authors may have tried to cover too much material. The material is very time sensitive, hence the book needed to be rushed together, it will have little value in 5 years. I wanted to give the book a higher rating, I tried to think of a better way to present the material in 400 pages and couldn't. There are just too many rough edges for a 5 star book.

      As a member of O'Reilly's "Hacks" series, "Spidering Hacks" is different than the typical O'Reilly book. This book presents breadth of topic rather than depth. The format is 100 hacks (mostly Perl on Linux with an odd Python, Java, or Windows hack), some written by Hemenway & Calishain, many written by guest authors organized into 6 chapters. The number of authors leads to a variety of styles in both English and Perl. If you treat the book as a super magazine (time sensitive short articles), you won't be disappointed.

      Chapter 1 – Walking Softly (Hacks 1-7)

      Chapter 1 provides general guidelines on spider/scraper etiquette and good practices, which the rest of the book seems to ignore.

      Chapter 2 – Assembling a toolkit (Hacks 8-32)

      An overview of several modules and techniques with working examples. More experienced Perl mongers may find this material remedial.

      Chapter 3 – Collecting media files (Hacks 33-42)

      The hacks on POP3 attachments and Usenet may be worth the price of the book for those trying to solve a particular problem.

      Chapter 4 – Gleaning data from databases (Hacks 43-89)

      Over 1/2 the book is dedicated to this chapter. Initially it appears that these are very specific solutions for a narrow audience. Closer reading reveals a variety of techniques that can be used in many circumstances.

      Chapter 5 – Maintaining your collections (Hacks 90-93)

      Not much here. Cron is covered much better in other works.

      Chapter 6 – Giving back to the world (Hacks 94-100)

      Essentially how to be nice to spiders. Why Net::AIM is covered here seems arbitrary. Hack #100 "Going beyond the book" is nothing but fluff.

      An example of how I used the book may be illustrative. I wanted to scrape TV listings, but hack #73 "Scraping TV Listings" has been made obsolete by a modification to tvguide.com. I was able to quickly use the toolkit presented in chapter 2 to scrape one of the many other web sites with TV listings. I expect this to be typical, sites change, spiders and scrapers need to adapt.

      Spider Hacks is an odd collection of articles that seem to cover the remedial to intermediate skill ranges. Nobody will benefit from all 100 hacks, but most of us will find $24.95 of value in the hacks that cause us to go "How cool!".

       
      4.0

      Spidering Hacks Review

      By Mike Sipin

      from Undisclosed

      Comments about oreilly Spidering Hacks:

      I have been trying to find a Java book that offered me tips and tricks on how to scrape the Internet, glean the most tasty bits of it, and put them to good use. I ran across "Spidering Hacks", by Kevin Hemenway and Tara Calishain, which was exactly what I wanted - only it's base language is Perl.

      To my delight, the authors' writing is so lucid, their support and encouragement so welcome, and their examples so closely matched to my needs - that I immediately picked up this book, and dove headlong into the vast and beautiful world that is Perl.

      Despite my preference for programming in Java for Internet-related tasks, I highly recommend this book, even for those unfamiliar with the Perl programming language, as this book is written so well that you can get up and running purely on the strength of the authors' talents. I am very impressed with this book.

      Kudos to the authors.

       
      5.0

      Spidering Hacks Review

      By Marcus P. Zillman, M.S., A.M.H.A.

      from Undisclosed

      Comments about oreilly Spidering Hacks:

      Excellent job in explaining the realworld solutions to data spidering, scraping and manipulation of the data. I have educated the Internet community about the positive benefits of bots for years and this book does an extraordinary job of giving industrial strength tips, tools and hacks highighted in a easy to understand format with concrete step by step instructions on the code, running the hack and hacking the hack. Great job Kevin and Tara!

      Displaying reviews 1-7

      Back to top

       
      Buy 2 Get 1 Free Free Shipping Guarantee
      Buying Options
      Immediate Access - Go Digital what's this?
      Ebook: $23.99
      Formats:  DAISY, PDF
      Print & Ebook: $32.99
      Print: $29.99