How To Build a Web Spider On Linux 104
IdaAshley writes, "Web spiders are software agents that traverse the Internet gathering, filtering, and potentially aggregating information for a user. This article shows you how to build spiders and scrapers for Linux to crawl a Web site and gather information, stock data, in this case. Using common scripting languages and their collection of Web modules, you can easily develop Web spiders."
Crawling efficiently (Score:5, Informative)
Better to use an associative array to cache the links since lookup is O(1). The Queue's lookup time is O(n) and if n gets large, so does the lookup time, not to mention that since you are checking each link the worst case scenario is a lookup time of O(n^2). A hash (associative array) will perform the same check in O(n).
downloads (Score:5, Informative)
for those of us who don't have them, here are the basics:
Wget: http://www.gnu.org/software/wget/ [gnu.org].
Curl http://curl.haxx.se/ [curl.haxx.se]
Re:Just what the internet needs... (Score:3, Informative)
That reminds me. (Score:3, Informative)
For a good chuckle, see The Spider of Doom [thedailywtf.com] on the Daily WTF.
And please use robots.txt.
And go see Google Webmaster tools [google.com].
And don't wear socks with sandals. Well, ok, this one is optional.
Re:some points (Score:4, Informative)
It's always had it. Look up XUL some day. The entire browser is written in xul.
Okay kids... (Score:5, Informative)
Example to find all links in a document: Yes, it's that simple. For an URL opener that also handles proxies, cookies, HTTP auth, SSL and so on, look into the urllib2 module that ships natively with Python.