How To Build a Web Spider On Linux 104
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
Hmm... (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
And I generally write 'em in PHP. Makes 'em nice and lightweight to redistribute (php.exe and php5ts.dll are usually all that's needed. Sometimes php_http.dll as well.)
Re: (Score:2)
A modern crawler has to overcome very annoying problems like nslookup delays and network lags that are caused by a third party. If you can write it in a threaded environment, good for you, if you can drop the "single scope" at all and go for an select or even better, epolled version that can crawl thousand sites at a time, even better.
For simple tasks even the ithreads of perl would do. But i'd suggest a language that supports
Re: (Score:1)
What do you mean "natively"? Ruby 1.8, at least, doesn't use OS threads. Perl ithreads map to native threads, where they're available.
Re: (Score:2)
Perl's threads may be native, but they don't acquire the "share everything" model, but only share very certain types and that makes them a pita to use. I'm subscribed to the ithreads mailing list and it's a "deja vu" all the time. Thankfully there is Liz who is still helping the people o
Re: (Score:2)
You can make any program difficult hard by increasing the generality and performance requirements, but there's nothing inherently difficult about screen-scraping from a web site. I've written a few scripts to extract data from web sites, and they're quite simple if your aims are modest. The first crawler I wrote was also my first Perl project, my first time using HTTP, and my first time dealing with HTML. Given a date, it generated a URL,
Re: (Score:2)
Still, they do take a while to run. 8 hours for the basketball box score scraper, and that's with cheater threading (using start
Re: (Score:1)
For pulling data from Web sites, for use when testing, I would typically use Wget (Win32) and parse it with findstr using a b
Re: (Score:3, Interesting)
The PHP interpreter is over 5 megabytes in size. And it isn't thread-safe. That's a lot of memory overhead for a program that's going to be blocking on I/O most of the time, seeing how you'll have to fork() a new process for each new "thread" you want.
Also, languages like Perl and Python have binaries that are about 1 megabyte in size. Now, while they'll probably need to load in extra files for most practical applications, these extra files are typically small. Most importantly, Per
Re: (Score:2)
Crawling efficiently (Score:5, Informative)
Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n).
Re: (Score:2, Insightful)
Re: (Score:3, Interesting)
Re: (Score:2, Funny)
If you're surprised about programmers not knowing/caring about efficiency, do you actually use a computer?
Re: (Score:1)
Before your queue will get big enough for lookup/insertion time to become an issue, you'll first have to worry about bigger harddisks and more bandwith.
Re: (Score:2)
No, not really interested in the answer, as I'm just pointing out that the code suddenly becomes (unnecessarily) much more complicated.
Re: (Score:1)
add hash, [currentURL]
append array, [currentURL]
That wasn't so hard, was it?
Re: (Score:2)
My favorite method is to use PHP as a backend for mshta; you can be guaranteed it'll run on any Windows machine, and you have the benefit that a linux machine will at least be able to run the back-end.
The 90s called (Score:5, Funny)
Re:Obligatory (Score:1)
Re: (Score:2, Funny)
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
Next question?
What's the point? (Score:2)
Re: (Score:1)
Actually... (Score:4, Interesting)
Regardless, I do, in fact, build spiders. For instance, in an MMO I play, all users can have webpages, so it's very useful to have a spider as part of a clan/guild/whatever to crawl the webpages looking for users who have illegal items and such. In a more general way, there is a third-party site which collects vital statistics of everyone who puts those in their user page, so you can get lists of the most powerful people in the game, the richest people, etc.
Re: (Score:1)
Re: (Score:2)
So, it's actually much more efficient to scan for a specific string that I know will be there for a particular item -- it's literally impossible for them to try to mask it with, say, leetspeak.
Re: (Score:2)
Re: (Score:1)
in fact, i'm going to have to do this fairly soon. i've already written a search for articles, but now customers are complaining that they cant search for "customer service." bah!
unfortunately, IBM's spider example is pretty pathetic.
Re: (Score:2)
You're right. Web 2.0 changes everything. Some people are just conservative, though. My parents are still using bookshelves even though maglev trains made bookshelves obsolete decades ago.
You mean follow links and *gasp* do something with t
downloads (Score:5, Informative)
for those of us who don't have them, here are the basics:
Wget: http://www.gnu.org/software/wget/ [gnu.org].
Curl http://curl.haxx.se/ [curl.haxx.se]
yes, I did RTFA (Score:2)
Re:yes, I did RTFA (Score:4, Funny)
Re: (Score:1)
A partially better alternative is httrack, it has more features but also tends to be less table
Re: (Score:1)
Hardly linux-specific (Score:5, Insightful)
some points (Score:5, Interesting)
Re: (Score:1)
[sarcasm] Why? Google doesn't. [/sarcasm]
And once I even repeatedly voted on an online poll and changed the course of history.
So did I! Back in 2000 I got the Underwear Gnomes episode of South Park aired.
I think the best use of a spider in an online poll was by whatever Red Sox fan voted a million times for Nomar Garciapara to make the all star team back in 2000.
Re: (Score:2)
Firefox's automation capabilities don't need to match those of IE, for pretty much the same reason. The only thing Mechanize can't do is JavaScript, and there are vague plans about that.
Re: (Score:2)
Re: (Score:1)
Re:some points (Score:4, Informative)
It's always had it. Look up XUL some day. The entire browser is written in xul.
Re: (Score:1)
You have that wrong. It's when will IE's capabilities (automation and otherwise) catch up with FireFox.
Re: (Score:2)
Yeah, 'cos I really miss having my machine automatically turned into a Zombie.......
Re: (Score:2)
Now. http://www.openqa.org/selenium-ide/ [openqa.org]
'Steve? Send the web spiders.' (Score:2)
Re: (Score:1)
It's a web spider, man; not a killer robot spider, but I'll tell you it's a web spider from South Jersey if it'll make you feel any better.
KFG
MORE CORN!!! (Score:2)
Oh sweet Jesus! (Score:3, Insightful)
crawling is not so trivial (Score:2, Interesting)
That reminds me. (Score:3, Informative)
Unfortunately, many web developers still ignore the inevitable, leaving their sites vulnerable to the dreaded Googlebot "attack". While most of the spider developer manuals (TFA included) stress the importance of being polite (respect robots.txt & friend
Quality of article? (Score:2, Insightful)
"Iterate through response hash"
Why would somebody want to do that?
A quick net search "reveals": A simple resp["server"] is all you need.
Maybe the article was meant to be posted on thedailywtf.com?
Re: (Score:1)
KFG
Re-inventing a square wheel (Score:5, Insightful)
Basically, the article gives you ruby and python examples of how to get web pages, and (badly) parse them for information. The same thing everyone has been doing for at least a decade with Perl and the appropriate modules, or whatever other tools, except that most know how to do it correctly.
The first script is merely ridiculous: 12 lines of code (not counting empty and comment lines) to do:
HEAD slashdot.org | grep 'Server: '
But it gets worse. To extract a quote from a page, the second script suggests this:
You don't need to know ruby to see what it does: it looks for the first occurence of 'class="price">' and just takes the 10 characters that follow. The author obviously never used that sort of thing for more than a couple of days, or he would know how quickly that will break and spit out rubbish.
Finally, there is a Python script. At first glance, it looks slightly better. It uses what appears to be the Python equivalent of HTML::Parse to get links. But a closer look reveals that, to find links, it just gets the first attribute of any a tag and uses that as the link. Never mind if the 1st attribute doesn't happen to be "href".
I suppose the only point of that article were the IBM links at the end:
And that is in a section for Linux developers on the IBM site? Maybe the did copy stuff from SCO after all?...
Re: (Score:1)
Re: (Score:1)
Re:Re-inventing a square wheel (Score:5, Insightful)
It's a (perl) script which comes with libwww-perl [linpro.no] which either is now part of the standard Perl distribution, or is installed by default in any decent Linux distribution.
If you don't have HEAD, you can type a bit more and get the server with LWP::Simple's head() method (then you don't need grep):
$ perl -MLWP::Simple -e '$s=(head "http://slashdot.org/" )[4]; print $s'
Either way is better than those useless 12 lines of ruby (I'm sure ruby can also do the same in a similarly simple way, but that author just doesn't have a clue)
Re: (Score:2)
I always used lynx -source -head http://slashdot.org/ [slashdot.org] wish is a lot more typing...
Thanks,
X.
Okay kids... (Score:5, Informative)
Example to find all links in a document: Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.
Re: (Score:2)
For example:
Re:Okay kids...(in Ruby) (Score:2, Interesting)
I couldn't resist - in Ruby, using the beautiful (but much understated) hpricot [whytheluckystiff.net] library:
doc = Hpricot(open(html_document))
(doc/"a").each { |a| puts a.attributes['href'] }
Check it out - I've been using it for a project, and it's really fast and really easy to use (supports both xpath and css for parsing links). For spidering you should check out the ruby mechanize [rubyforge.org] library (which is like perl's www-mechanize, but also uses hpricot, making parsing the returned document much easier).
Re: (Score:2)
What bugs me the most about this article is that the author keeps using the most generic libraries he can find instead of something written for this exact task. He should h
Re: (Score:1)
Re: (Score:2)
This code won't catch 404s and other errors. Theirs will. Furthermore, assuming the Ruby library is conformant, their code can deal with multi-line headers, while yours would break.
Things like grep aren't suitable for parsing HTTP responses. You might get results for simple cases, but there are all kinds of corner cases out there that require a proper script
Re: (Score:1)
Re: (Score:2)
It's a trap! (Score:2, Funny)
1. Post to Slashdot a decoy article(it includes Linux in the subjest) with new spam tricks
2. Watch if spam increases 30% next days
3. Bribe Cowboy Neal with 10G midget lesbian pr0n and get IP adresses of the art. readers
4. Load shotgun and make the world a better place!
Been there, done that (Score:1)
I guess most male CS students will have coded something similar at least once to D/L pr0n.
I did one in shell and one in TCL/TK.
Re: (Score:2)
User-Agent (Score:1, Troll)
A similar application (Score:2)
Checking links with LinkCheck
http://world.std.com/~swmcd/steven/perl/pm/lc/lin
Reinventing the wheel (Score:1, Interesting)
Incorrect Title (Score:2)
Should be: "How Not
I don't think I am alone in my thinking
Nostarch press book (Score:1)
Nutch (Score:2)
http://lucene.apache.org/nutch/ [apache.org]
Walk the dom directly (Score:1)
screen-scraper (Score:1)
I did similar things in college... (Score:2)
I did similar things in college with Perl. (shudders*) The programs were OS-neutral; I think I developed mine in Windows under Cygwin.
*Yes, I know Slashdot is written in Perl.
Re: (Score:3, Informative)
Re: (Score:2)