How To Build a Web Spider On Linux 104
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
Hardly linux-specific (Score:5, Insightful)
Oh sweet Jesus! (Score:3, Insightful)
Quality of article? (Score:2, Insightful)
"Iterate through response hash"
Why would somebody want to do that?
A quick net search "reveals": A simple resp["server"] is all you need.
Maybe the article was meant to be posted on thedailywtf.com?
Re-inventing a square wheel (Score:5, Insightful)
Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.
The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:
HEAD slashdot.org | grep 'Server: '
But it gets worse. To extract a quote from a page, the second script suggests this:
You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.
Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".
I suppose the only point of that article were the IBM links at the end:
And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...
Re:Crawling efficiently (Score:2, Insightful)
Re:Re-inventing a square wheel (Score:5, Insightful)
It's a (perl) script which comes with libwww-perl [linpro.no] which either is now part of the standard Perl distribution, or is installed by default in any decent Linux distribution.
If you don't have HEAD, you can type a bit more and get the server with LWP::Simple's head() method (then you don't need grep):
$ perl -MLWP::Simple -e '$s=(head "http://slashdot.org/" )[4]; print $s'
Either way is better than those useless 12 lines of ruby (I'm sure ruby can also do the same in a similarly simple way, but that author just doesn't have a clue)