How To Build a Web Spider On Linux - Slashdot

Catch up on stories from the past week (and beyond) at the Slashdot story archive

×

How To Build a Web Spider On Linux 104

Posted by kdawson on Wednesday November 15, 2006 @03:13AM from the five-eyes dept.

IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."

This discussion has been archived. No new comments can be posted.

How To Build a Web Spider On Linux

Search 104 Comments Log In/Create an Account

Comments Filter:

Hardly linux-specific (Score:5, Insightful)

by h_benderson ( 928114 ) writes: on Wednesday November 15, 2006 @03:57AM (#16849384)

All my love for linux aside, this has to do nothing with linux, the kernel (or even the GNU/Linux, the OS). It works just as well on any other unix-derivate or even windows.

Share
twitter facebook
Oh sweet Jesus! (Score:3, Insightful)

by msormune ( 808119 ) writes: on Wednesday November 15, 2006 @04:25AM (#16849474)

Pull the article out. The last thing we need is more indexing bots.

Share
twitter facebook
Quality of article? (Score:2, Insightful)

by interp ( 815933 ) writes: on Wednesday November 15, 2006 @04:36AM (#16849514)

I've never programmed in Ruby, but I think the comment in Listing 1 says it all:
"Iterate through response hash"

Why would somebody want to do that?
A quick net search "reveals": A simple resp["server"] is all you need.
Maybe the article was meant to be posted on thedailywtf.com?

Share
twitter facebook
Re-inventing a square wheel (Score:5, Insightful)

by rduke15 ( 721841 ) writes: <rduke15@gm[ ].com ['ail' in gap]> on Wednesday November 15, 2006 @04:48AM (#16849582)

Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.

The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:

HEAD slashdot.org | grep 'Server: '

But it gets worse. To extract a quote from a page, the second script suggests this:

stroffset = resp.body =~ /class="price">/ subset = resp.body.slice(stroffset+14, 10) limit = subset.index('<') print ARGV[0] + " current stock price " + subset[0..limit-1] + " (from stockmoney.com)\n"

You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.

Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".

I suppose the only point of that article were the IBM links at the end:

Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...

Share
twitter facebook
Re:Crawling efficiently (Score:2, Insightful)

by Anonymous Coward writes: on Wednesday November 15, 2006 @05:11AM (#16849714)

Python has a builtin set type. Have no idea why they did not use it.

Parent Share
twitter facebook
Re:Re-inventing a square wheel (Score:5, Insightful)

by rduke15 ( 721841 ) writes: <rduke15@gm[ ].com ['ail' in gap]> on Wednesday November 15, 2006 @06:42AM (#16849994)

what exactly is HEAD slashdot.org

It's a (perl) script which comes with libwww-perl [linpro.no] which either is now part of the standard Perl distribution, or is installed by default in any decent Linux distribution.

If you don't have HEAD, you can type a bit more and get the server with LWP::Simple's head() method (then you don't need grep):

$ perl -MLWP::Simple -e '$s=(head "http://slashdot.org/" )[4]; print $s'

Either way is better than those useless 12 lines of ruby (I'm sure ruby can also do the same in a similarly simple way, but that author just doesn't have a clue)

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

321 commentsShould a Variable's Type Come After Its Name?
293 commentsAre Scrums a Cancer?
258 commentsC++ Creator Rebuts White House Warning
228 commentsWhite House Urges Devs To Switch To Memory-Safe Programming Languages
226 comments34% of AP CS Students Couldn't Solve This Java-Based 2D Array Question

And it should be the law: If you use the word `paradigm' without knowing what the dictionary says it means, you go to jail. No exceptions. -- David Jones