How Hard Is It To Write Your Own Search Engine? 23
kha0z writes "Anna Patterson, from Stanford University, overviews the difficulties that have to be overcome when attempting to develop and/or implement a search engine solution in this article in the ACM Queue Magazine. The article covers many issues dealing from data sources, to indexing, to ranking. How does Google make it look so easy?"
Well (Score:1)
Search engine != entire web (Score:5, Insightful)
While writing a local search engine isn't trivial, it's a lot easier than writing a web search engine since all the scaling issues disappear -- I know: I wrote one [mac.com].
Re:Search engine != entire web (Score:2, Interesting)
it is not complicated at all.
How does Google make it look so easy? (Score:4, Interesting)
Google has hundreds of millions of dollars. Google treats their 2000 (2500+?) employees pretty well, so those employees work hard and smart and put in 40-60 hour work weeks. Google started earlier then the other modern engines, and had some very good ideas.
They focus on a small set of goals-- make it easy to search through a ton of information.
Compare this to Yahoo and MSN, where search is really just one part of their business model (there is no Google Singles! or Google Games).
Re:How does Google make it look so easy? (Score:2)
Originally this was the case, but recently even Google has started forays into other services. Just click the "more>>" link above the search field on their front page.
Froogle(shopping), groups(news groups), blogger, and of course there is the even popular
Unfortunately it seems they may be losing focus.
Re:How does Google make it look so easy? (Score:4, Insightful)
Froogle is still a search product, but with a focus on shopping.
Groups is mostly still a search product (You can post also, so it's also about creating information). The service has been around for years (I think it's their second big project after web search). If I have a technical question, I often find the answers in Google groups. Blogger is new, but is similar to Groups in it's goals.
Gmail is also largely about search. With search they can place ads in your email.
Actually, I guess you can really say that Google is about using a good search technology to place highly targeted ads with the information.
Re:How does Google make it look so easy? (Score:2)
Groups is mostly still a search product (You can post also, so it's also about creating information). The service has been around for years (I think it's their second big project after web search).
Actually, Google Groups used to be DejaNews, and they bought the technology.
Re:How does Google make it look so easy? (Score:2)
This is an excellent opportunity to point out that 8 hours of those every week, mandatory, must be spent on a personal project not related to google's line of business.
That one perk is the absolute best.
Picture here. (Score:3, Offtopic)
I mean, be for real - who gives a damn about the article?
Re:Picture here. (Score:1)
--HC
Have you Read ast's Computer Networks Book? (Score:2, Interesting)
Ask Tim Bray (Score:4, Informative)
Thanks for the link (Score:2, Funny)
Re:Thanks for the link (Score:2, Funny)
Here's the FreeCache [freecache.org] version in case of slashdotting.
not too hard (Score:1)
The crawl is hard, too (Score:5, Insightful)
You have to deal with 404s, robots.txt, politeness (don't bring down someone's site by crawling too fast), redirects, content you can't handle (Flash, Javascript).
The list goes on.
Re:The crawl is hard, too (Score:2)
I wrote my own crawler once. A just-for-fun-how-do-the-spammers-do-it kind of thing...
I got around redirects and content by going straight via the network socket and looking at the response as pure text. Anything that fit the regex pattern for an email address got harvested.
As far as politeness... I kept a circular growable queue (technically a linked list) of sites to visit. Eac
Nutch (Score:1)
There is one open source search engine that seems to be up-and-coming. Nutch [nutch.org] is now powering Mozdex [mozdex.com], and it looks fairly impressive so far.
Now, instead of the previous free-will donations [mozdex.com], you can support the project through purchasing very cheap sponsered listings [mozdex.com] that appear to the right of the results (similar to Google)
Author @ Google? (Score:1)
4 parts to a search engine (Score:2)
1) The crawler - goes out and retrieves pages
2) The parser - parses the pages, finding links and text.
3) The indexer - indexes the pages.
4) The searcher - interprets the index in the context of a user's search request.
None of these are especially hard to do simple versions of, but all of them are hard to do well.