 
			
		
		
	
		
		
		
		
		
		
			
				 
			
		
		
	
		
		
		
		
			
				 
			
		
		
	
		
		
		
		
			
				 
			
		
		
	
		
		
		
		
			
				 
			
		
		
	
		
		
		
		
			
				 
			
		
		
	
		
		
		
		
			
				 
			
		
		
	
    
	Spidering Hacks 121
| Spidering Hacks | |
| author | Kevin Hemenway and Tara Calishain | 
| pages | 402 | 
| publisher | O'Reilly | 
| rating | 8 | 
| reviewer | Jeff Martin | 
| ISBN | 0596005776 | 
| summary | A wide-ranging collection of hacks detailing how to be more productive in Internet research and data retrieval | 
Introduction
Spidering Hacks (SH), by Kevin Hemenway and Tara Calishain, is a practical guide to performing Internet research that goes beyond a simple Google search. SH demonstrates how scripting and other techniques can increase the power and efficiency of your Internet searching, allowing the computer to obtain data, leaving the user free to spend more time on analysis.SH's language of choice is Perl, and while there are a few guest appearances by Java and Python, some basic Perl fluency will serve the reader well in reading the Hack's source code. However, regardless of your language preference, SH is still a useful resource. The authors discuss ethics and guidelines for writing polite and properly behaved spiders as well as the concepts and reasoning behind the scripts they present. For this reason, non-Perl coders can still stand to learn a lot of useful tips that will help them with their own projects.
Overview
Chapter 1, Walking Softly, covers the basics of spiders and scrapers, and includes tips on proper etiquette for Web robots as well as some resources for identifying and registering the many Web robots/spiders that exist on the Internet. Hemenway and Calishain should be credited for taking the time to be civically responsible and giving their readers appreciation for the power they are utilizing.
Chapter 2, "Assembling a Toolbox," covers how to obtain the Perl modules used by the book, respecting robots.txt, and various topics (Perls LWP and WWW::Mechanize modules for example) that will provide the reader with a solid foundation throughout the rest of the book. SH does a great job introducing some topics that not all members in its target audience may be familiar with (i.e., regular expressions, the use of pipes, XPath).
Chapter 3, "Collecting Media Files," deals with obtaining files from POP3 email attachments, the Library of Congress, and Web cams, among other sources. While individual sites described here may not appeal to everyone, the idea is to provide a specific example demonstrating each of certain general concepts, which can be applied to sites of the reader's choosing.
Chapter 4, "Gleaning Data from Databases," approaches various online databases. There are some interesting hacks here, such as those that leverage Google and Yahoo together. This chapter is the longest, and provides the greatest variety of hacks. It also discusses locating, manipulating, and generating RSS feeds, as well as other miscellaneous tasks such as downloading horoscopes to an iPod.
Hack #48, Super Word Lookup, is a good example of why SH is so intriguing. While utilizing a dictionary or thesaurus via a browser is simple, having the ability to do so with a command-line program allows the user an automated approach, reducing distractions.
Chapter 5, "Maintaining Your Collections," discusses ways to automate retrieval using cron and practical alternatives for Windows users.
Chapter 6, "Giving Back to the World," ends SH by covering practical ways the reader can give back to the Internet and avoid the ignominious leech designation. This chapter provides information on creating public RSS feeds, making an organization's resources available for easy retrieval by spiders, and using instant messaging with a spider.
Conclusion
There are extensive links provided throughout the book, and this indirectly contributes to SH's worth. The usual O'Reilly site for source code is available and Hemenway also provides some additional code on his site. A detailed listing of the hacks covered in SH is also available online from SH's table of contents.
The Hacks series is a relatively new genre for O'Reilly, but it is rapidly maturing and this growth is reflected in Spidering Hacks. Hemenway and Calishain have done good work in assembling a wide variety of tips that cover a broad spectrum of interests and applications. This is a solid effort, and I can easily recommend it to those looking to perform more effective Internet research as well as those looking for new scripting projects to undertake.
You can purchase Spidering Hacks from bn.com. Slashdot welcomes readers' book reviews -- to submit a review for consideration, read the book review guidelines, then visit the submission page.
Why spider when you can deepweb? (Score:2, Informative)
That's not what this is used for... (Score:5, Informative)
You want to track your rank on www.alexa.com and the ranking of some of your key competitors. You build a spider that goes out each night and scrapes the info you want and stores it localy. now you have history on your and you competitiors ranking over time.
This way you can see that when your traffic is down so is your competitor or maybe when yours is down theirs is up...
This also happens to be one of the examples in the book.
Re:That's not what this is used for... (Score:4, Informative)
Re:That's not what this is used for... (Score:2)
I agree with you to some extent, but if I am grabbing a dozen freely available pages from a site and storing the information for my own use (not selling it or publishing it), no foul. Otherwise they could go after people who print/write down/copy-paste the info under the same act.
PS: Thanks for showing me that, now my head hurts.
Re:That's not what this is used for... (Score:1)
Section 3(a) describes "a quantitatively substantial part of the information in a database". The information gathered from a single web page would almost certainly not be the "substantial part" of data held by Alexa or any other indexing service.
Also, I doubt that it could be seriously, let alone successfully, argued that making a record of publicly available information falls under the definition in Section 3(a)(2) that talks abou
Re:That's not what this is used for... (Score:2)
you're misusing this buzzword. what you just described is called 'collecting data'
Re:That's not what this is used for... (Score:1)
Re:Why spider when you can deepweb? (Score:1)
I bet I find some real dirt.
Table of content is packed with great stuff! (Score:5, Informative)
I wonder if Tracking Packages with FedEx [google.com] is using the new google feature. That would be too simple  :)
Does anyone know the name of a small utility to query search engines on the command line? It think it was a 2-letter program, but I couldn't find it anymore  :(
Re:Table of content is packed with great stuff! (Score:2)
The funny thing is, the first page of results for the sample patent search they give is a bunch of pages about Google's ability to search via patent numbers.
Time to rethink a ranking algorithm there...
Re:Table of content is packed with great stuff! (Score:2)
Re:Table of content is packed with great stuff! (Score:5, Informative)
Re:Table of content is packed with great stuff! (Score:3, Interesting)
Looking at the page, and I'm pretty sure it's the little program I had lost. Thanks for finding it again!
Pssst! Mod parent up!
cousin of spam? (Score:4, Interesting)
I don't have any solutions in mind. I don't want anti-spidering legislation, for example, because *I* want to be able to spider. I just don't want *you* to do it.
Really, I'm just observing that as the Web evolves we could see another spam-like problem emerge, at least for the more interesting sites.
Re:cousin of spam? (Score:2)
And like spam, the perp's natural first step in the battle is to start using anonymous proxies to help avoid detection / retribution.
Re:cousin of spam? (Score:3, Insightful)
This might be obvious or just a non-issue, but ignoring IMG tags in your bots (saves on bandwidth costs). You're probably not effecting their bandwidth by downloading text.
Incidently, most spammers are glorified script kiddies, not data miners or AI people. The kind of "hard-earned" money in data mining isn't the kind of money spammers are looking for.
The real problem with data mining is increased server load. Perhaps running your
Re:cousin of spam? (Score:2)
I once thought about how neat it would be to start a spider running that would just go and go and go. It didn't take me long to get bored with it, just thinking about it. I do automate a lot of HTTP with Perl and LWP, and it's incredibly useful
Re:cousin of spam? (Score:2, Insightful)
If "bad" spiders became so common that businesses began needing to weigh the pros of page ranking against the cons of data theft then the indexing services (those that wanted to remain relevant) would develop other methods for accessing web content.
On a side note: I actually bought this book a couple of weeks ago
Ok I admit it (Score:5, Funny)
Oh the shame  ...
Re:Ok I admit it (Score:5, Funny)
MOD UP! (Score:2, Funny)
Re:Ok I admit it (Score:5, Funny)
Re:Ok I admit it (Score:1)
RE: Booble (Score:5, Informative)
http://www.booble.com/ [booble.com]
Re: Booble (Score:2, Funny)
Re: Booble (Score:2)
Re:Ok I admit it (Score:2)
Click on the video tab.
Turn Family Filter off.
Fap away.
Yeecchhh! (Score:4, Funny)
That thing's going to build one nasty sticky web!
Re:Ok I admit it (Score:5, Informative)
Re:Ok I admit it (Score:2, Funny)
What kind of pr0n have YOU been looking at....
Re:Ok I admit it (Score:1)
> I wish there was more adult open-sores software
If adults have open-sores from harvesting pr0n, then I think they need medical (or possibly psychological) attention, not software. At least buy yourself some lotion, buddy!
Re:Ok I admit it (Score:2)
Wow! Automated porn collection is one thing, but actually automating porn consumption - that's something!
Re:Ok I admit it (Score:1)
I sure don't!
Error in post. (Score:1, Redundant)
XML interop? (Score:5, Interesting)
However, looking at these hacks:
68. Checking Blogs for New Comments
69. Aggregating RSS and Posting Changes
70. Using the Link Cosmos of Technorati
71. Finding Related RSS Feeds
Do they offer any hacks on working with XML, perhaps XML::RSS or other parsing engines from CPAN? Or is most of the XML handled through regexp?
Re:XML interop? (Score:4, Informative)
Re:XML interop? (Score:5, Interesting)
It's actually a good read. They try to stay away from regex parsing as it tends to be fragile. They do cover it in one of the hacks though.
Most of the hacks have to do with using various methods to walk the doc tree to look for what you want like a certain cell in a table (think header with names) then jumping up one to get that row then grabbing the next row to get your data cells.
Re:XML interop? (Score:4, Informative)
There are basically two styles of XML parser, event-based (SAX) and document-based (DOM). I find DOM-types easier to use.
Re:XML interop? (Score:1)
One of the nicest features of the book is that it promotes the use of appropriate parsers over random regexing.
Use of "hacker" (Score:3, Insightful)
Be sure to get LWP & Perl (Score:5, Informative)
Re:Be sure to get LWP & Perl (Score:2)
perldoc LWP (Score:3, Informative)
Tracking yahoo popularity. (Score:3, Interesting)
This [yahoo.com] page is used to source the data.
Is LWP the correct/new way to do this kind of stuff? I started with curl and hacked regex's to get the data.
Re:Tracking yahoo popularity. (Score:3, Informative)
In general, most people use LWP, and if you write very many programs that use the web, you're going
Re:Tracking yahoo popularity. (Score:2, Informative)
To get the data from the page you can either use a bunch of regexps (as you've done, apparently) or a parser like HTML::TokeParser::Simple. The advantage of a parser is that it makes it more robust and immune to site changes. You also get higher quality data, for example if something subtle changes in the site's html sourc
Spidering and exceeding ISP bandwidth limits (Score:5, Insightful)
Re:Spidering and exceeding ISP bandwidth limits (Score:5, Insightful)
More likely though, you leave the big jobs to the big boys, and you want to do very specific things, maybe even building on top of google... eg. find porn movies, copying edmunds' database [edmunds.com] so you can sort cars by their power/weight ratio (or list all RWD cars, or find the lightest RWD car, or...), or make your own third-party feed of slashdot from their homepage since they watch you like a hawk when you download their  .rss too often, but not when you download their homepage too often.
Little custom jobs like that can take a minimal amount of code (especially if you're a regex wizard), take minimal bandwidth, and take enough skill that target sites aren't likely to track you down because there's only three of you doing it.
Re:Spidering and exceeding ISP bandwidth limits (Score:2)
More likely though, you leave the big jobs to the big boys, and you want to do very specific things, maybe even building on top of google.
Very good point. You are right that many people will use spiders in a naturally limited way -- a one-shot or infrequently repeated project to gather information on a very limited domain or l
Alternative ways of searching and spidering (Score:3, Informative)
Re:Alternative ways of searching and spidering (Score:2)
Agents, anyone? (Score:5, Interesting)
The problem comes more in the last assertation of the story; that pulling in all of this data will free up more time for people to spend on the work of analysis. I want to say this isn't accurate, but it probably boils down to what you call "analysis" work.
The problem with spiders, agents, and their like -- yes, even those that are going out and fetching porn -- is that they are able to provide content without context, much as a modern search engine does. I can take Google and get super specific with a query (say, `pirates carribean history -movie -"johnny depp"`). That will probably fetch me back some data that has my keywords in it, much as any script or agent could do.
Unfortunately, while the engine could rank based on keyword visibility and recurrance, as well as applying some algorithms to try and guess whether the data might be good or not (encylcopedias look this way, weblogs about Johnny Depp look that way), the engine itself still has on way to physically read the information and decide if it's at all useful. A high-school website's page with a tidbit of information and some cute animated
The most tedious part of data analysis these days is not providing content (as spiders, scripts, and search engines all do)
What comes after that sorting process - the assimilation of good data and the drawing of conclusions there-from - that's what I call data analysis. A shame that scripts, spiders, agents, and robots haven't found a way to do that for us.
Re:Agents, anyone? (Score:5, Insightful)
I know that my Popfile spam filter is getting pretty good (with 35,000 messages processed) at not only spam vs. ham type comparisons, but also work vs. personal and other categories.
Bayesian filters are just one type of learning algorithm, but they work fairly well for textual comparisons. I've personally been toying with seeing how well a toolbar/proxy combination would work for predicting the relative "value" of a site to me. Run all browsing through a Bayesian web proxy that analyses all sites visited. Then, with a browser toolbar, sites can be moderated into a series of categories.
That same database could be used by spiders to look for new content, and, if it fits into a "positive" category according to the analysis, add it to a personal content page of some sort that could be used as a browser's home page.
With sufficient data sources (and with a book like this, it shows that there ARE plenty of sources), it could really bring the content you want to read together.
Re:Agents, anyone? (Score:1)
Microsoft jumped on this idea and invented the Office Assistants like Clippy.
Re:Agents, anyone? (Score:1)
Given the initial failure of the Newton, would you want $1 for every PDA sold in the last 3 years? If the Newton was your indicator, the answer would be no.
Re:Agents, anyone? (Score:2)
In other news... (Score:1, Funny)
Re:In other news... (Score:5, Funny)
Sample hacks (Score:3, Informative)
Perl script to query the library (Score:4, Interesting)
I got tired of having to go to all 3 websites to see what to take back each day, so I wrote a small bash/curl script so I could do it at the command line.
There are *lots* of things like this that could be done if the web were more semantic.
An alternative (Score:2, Interesting)
buying it now (Score:1)
all music dot com (Score:2)
WWW::Mechanize is your friend (Score:5, Informative)
- #21: WWW::Mechanize 101
- #22: Scraping with WWW::Mechanize
- #36: Downloading Images from Webshots
- #44: Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups (which uses Mech)
- #64: Super Author Searching
- #73: Scraping TV Listings
here are some other online resources to look at:A random bunch of examples submitted by users, included with the Mechanize distribution.
Chris Ball's article about using WWW::Mechanize for scraping TV listings. (repurposed into hack #73 above)
Randal Schwartz's article on scraping Yahoo News for images.
WWW::Mechanize on the Perl Advent Calendar 2002, by Mark Fowler.
How much can you screen-scrape legally ? (Score:5, Interesting)
In the USA, trading information that has cost somebody else time and money to build up can be caught under a doctrine of "misappropriation of trade values" or "unfair competition", dating from the INS case in 1918.
Meanwhile here in Europe, a collection of data has full authorial copyright (life + 70) under the EU Database Directive (1996), if the collecting involved personal intellectual creativity; or special database rights (last update + 15 years) if it did not.
I've done a little screen-scraping for a "one name" family history project. Presumably that is in the clear, as it was for personal non-commmercial research, or (at most) quite limited private circulation.
But where are the limits ?
How much screen-scraping can one do (or advertise), before legally it becomes a "significant taking" ?
Simple (Score:2)
If it doesn't allow you to gather information, then don't.
Re:Simple (Score:2)
I'm not sure that's the whole answer.
Many sites may not have a robots.txt file, yet may still value their copyright and/or database rights.
On the other hand, for some purposes it may be legitimate to take some amount of data (obviously not the whole site), even in contravention of the wishes of a robots.txt
So I think the question is deeper than just "look for robots.txt"
Spidering Google Illegal? (Score:3, Interesting)
No Automated Querying You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:
using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries; "meta-searching" Google; and performing "offline" searches on Google.
Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.
Re:Spidering Google Illegal? (Score:2)
Re:Spidering Google Illegal? (Score:2)
Google doesn't check referring urls, btw.
Re:Spidering Google Illegal? (Score:1)
Re:Spidering Google Illegal? (Score:2, Interesting)
p
Re:Spidering Google Illegal? (Score:2, Informative)
idkey = "insert your key here!"
AFAIK, this is standard practice for most sites with API access. (If you're interested, do it yourself at google.com/apis [google.com].) If you try to pull Google info down with an HTTP object programatically, Google will just return a 403 and tell you to read its terms of service. (Unless you spoof the header, but that requires do
Is there an OSS search engine/WWW snapshot (Score:1)
The biggest problem is th
Re:Is there an OSS search engine/WWW snapshot (Score:1)
They've received money from some high profile backers such Mitch Kapor and Overture.
Same people that created the open source indexer Lucene. Haven't downloaded the code yet but I am following the project closely.
Man Holmes
Not Likely (Score:2, Funny)
Maybe the spammers will read the ethics section and have a change of heart!
Re:Techniques used by spammers? (Score:3, Insightful)
There's a lot more information on the Web than just e-mail addresses. Besides, why be reliant on search engines when you can do it yourself?
Re:Techniques used by spammers? (Score:5, Informative)
1. Archiving data on the web
2. Getting your files back when you forget your FTP password
3. Researching the link structure of the Internet and how it changes over time
4. Playing a joke on a friend by scraping his site and reposting the content, filtered in your favorite dialect [rinkworks.com]
5. Reading your favorite site in an RSS reader, even if they don't provide an RSS feed
6. Counting how often certain words on used on the net
7. Checking to see if you have any broken links on your site
8. Testing to make sure every link is reachable on your site, and finding out how deep the deepest link is
9. Taking data from a public website and compiling useful statistics, such as GPA calculations, average completion times for cross country races, or the total number of points scored last night in the NHL.
10. Showing people that the Internet can be more than just a web browser
Re:Techniques used by spammers? (Score:1)
http://redheadedleague.com/df.html
The Double Feature Finder goes to moviefone and finds movies in a row you can see.
Enjoy
Uncle Highbrow
Re:New book: Hacking your way into a Spider Hole (Score:2, Offtopic)