How to Build a Search Engine 270
CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast.
In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"
Lol (Score:5, Funny)
P2P? (Score:5, Interesting)
That way, I could share the load with people with similar interests as myself.
For example, I would like a search engine that was more up-to-date crawling the PR of my competitors, but couldn't care less about most other companies. If I were running my own node of a P2P engine, I could set my node to focus on that, and anyone else who shared my interests could tap into it.
Re:P2P? (Score:4, Informative)
One benefit of it is that you can keep the index of your website up to the minute if you really want. I guess they just never got enough people running the indexing software.
Re:P2P? (Score:2, Interesting)
Re:P2P? (Score:5, Interesting)
Personally, I've wanted a Google toolbar that indexes the sites that you surf, and adds additional positive weight to the sites that you linger on. It may not know what you liked there, but it knows that you liked it.
Completely offtopic, but does anyone know of a screensaver on Windows that displays random (or spidered) web pages? I've been looking for an equivalent to the XWindows version for years.
Re:P2P? (Score:5, Interesting)
Now if there were only a way to open said site and continue reading in non-screensaver mode...
Re:P2P? (Score:2, Informative)
Re:Lol (Score:5, Insightful)
Re:Lol (Score:2)
Both DogPile [dogpile.com] and MetaCrawler [metacrawler.com] are owned by InfoSpace [infospaceinc.com]. There may be more than five companies, but not as much diversity as one would think.
Re:Lol (Score:4, Informative)
Gigablast... (Score:4, Interesting)
Re:Gigablast... (Score:2, Insightful)
i never heard of them either, but heard of all others there.
what a load of shit, this guy works on one search engine, then compares his engine to the other top 4 competitor. What about alltheweb.com, for instance? I've at least heard of that one, it ain't there.
It's like Linux One (remember them) claiming there are four main linux distibutions. red hat, debian, slackware, and linux one.
Re:Gigablast... (Score:2, Informative)
You don't even have to RTFA. Read the summary.
Even SearchKing (Score:2)
Gigablast... 2 years old and nobody's heard of it! (Score:5, Interesting)
It's a fairly old-school engine: indexes whatever it can and favors pages that are keyword-heavy. It's almost too easy to spam. I don't think there's anything PageRank-like in the algorithm, otherwise, it wouldn't be able to add pages to the index "instantly". (PageRank is too computationally intensive for that.) Gigablast still thinks meta-tags are a great idea! While the hardware setup might be innovative (I'll leave that to the hardware experts to decide), the engine software itself seems about ten years behind the times.
Like many posters here, I doubt a one-man outfit is going to take down Google (although many search engine optimizers would like it to). Gigablast has had two years to make an impression, and it hasn't. A company on an acquisistion binge might be crazy enough to buy it, but I wouldn't hold my breath.
Re:Gigablast... (Score:2)
No this year at least. I remember when no one thought that a bunch of college student with this Google thing would be able to unseat Yahoo.
LK
Re:Gigablast... (Score:2, Informative)
Re:Gigablast... (Score:2)
Not *that* complicated (Score:5, Funny)
Re:SCORE -1 ORACLE SYNTAX (Score:2)
Nope, SQL Server handles that syntax just fine. However, unlike C, the ; is unnecessary unless you're stringing multiple commands together on the same line. This is not SQL Server syntax, but ANSI SQL syntax. Most (all?) SQL developers don't bother with semicolons unless they're doing multiple commands on a single line. And since any good DB developer is not writing dynamic SQL (ie, "SELECT * from foo" from PHP, ASP, Perl, etc), but calli
Hmmm.... (Score:5, Interesting)
Google: "Searching 4,285,199,774 web pages" That's quite a big difference.
Re:Hmmm.... (Score:5, Informative)
Re:Hmmm....Talk About Stealing... (Score:2)
Re:Hmmm....Talk About Stealing...Completed Post (Score:2)
How long before other search engines start considering this stealing? I mean, I could have a search engine running tomorrow, if all it did was link to Google and return hits to my own bannered page.
Re:Hmmm....Talk About Stealing...Completed Post (Score:2)
That's a poetic way to put it.
My way of looking at it is that for the first time, everyman has a microphone and a soapbox from which to speak to anyone in the entire world who wishes to listen. While I realize that absolutely has to upset many people entrenched in power, I feel it is the finest example of free, equal, and unfettered sp
Re:Hmmm.... (Score:4, Funny)
Google: "Searching 4,285,199,774 web pages" That's quite a big difference.
At least this Gigablast name is closer to the truth. They are only exaggerating their page count by a factor of 3.7 : 1.
By my math, Google comes up short by 2.3x10^90 : 1.
Re:Hmmm.... (Score:4, Informative)
Re:Hmmm.... (Score:2)
And Yahoo Search returns 309,000 results.
It's not the number of results, it's how they are arranged.
One guy, eight computers (Score:2)
Given that, plus the fact that he's spidered my worthless blog, I'm pretty impressed. Definately something to watch.
Re:Hmmm.... (Score:2)
Google: "Searching 4,285,199,774 web pages" That's quite a big difference.
Yes, and noticeable to me. I tried to search for a site I know, and regardless how many terms I entered, it didn't spot it... In the end, the results was down to 2 hits (with only three common keywords) and it wasn't among the sites.
Heck, it doesn't put www.slashdot.org first when searching for Slashdot.
Re:Given his resources, (Score:2)
I notice he doesn't mention the speed of the internet connection (40 queries a second isn't a lot - I expect google handles 10 times that at least).
That list makes no sense (Score:5, Insightful)
Re:That list makes no sense (Score:2)
Whatever happened to (Score:5, Interesting)
ahh the dotcomfallout
at least www.cowboynealsproncollection.com is doing well
Re:Whatever happened to (Score:2)
Re:Whatever happened to (Score:2)
only 5? (Score:5, Informative)
and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page
Re:only 5? (Score:2)
It doesn't matter one bit. That is, until you want page 2 million + 1. Then, all of a sudden, having a few more billion pages to index is a good thing.
Humph (Score:5, Funny)
Money. Lots and lots of money.
Re:Humph (Score:2, Funny)
Not in every instance. Some lawyers suck so much the money comes right to them.
Microsoft at the party (Score:2, Funny)
"Pass the dip, guys!"
Isn't yahoo powered by google? (Score:2, Insightful)
Re:Isn't yahoo powered by google? (Score:5, Informative)
Re:Isn't yahoo powered by google? (Score:2)
Yahoo used to use Google, but they bought Inktomi and have switched to their search engine. MSN also uses the Inktomi search engine, but tweak the results.
Right before google became a big name, yahoo used inktomi (inktomi used to be a really big name in the search engine industry). Yahoo used to use google more recently, but now they don't.
It looks like yahoo bought inktomi about a year ago [silicon.com], so I guess that's what they are using again.
Re:Isn't yahoo powered by google? (Score:3, Informative)
Matt's a good guy (Score:3, Informative)
Re:Matt's a good guy (Score:5, Funny)
Voting methods and search engines.... (Score:5, Informative)
For example, there is the Kemeny order (named after the same guy who came up with BASIC, John G. Kemeny). Using a version of ranked ballots and sorting websites by the mean Kemeny order gives you a method that is surprisingly good at putting authoritative sites at the top and spam sites at the bottom. For those of you who like in-depth analysis and don't mind math, the following is a good site:
http://www10.org/cdrom/papers/577/
Other Search Engines (Score:2, Insightful)
In my opinion..... (Score:5, Funny)
Whoa, hold on. Wrong site. Never mind.
Thinking you were... (Score:2)
BOOBLE! (Score:4, Funny)
Heh... (Score:5, Interesting)
Then again, it hardly needs to most of the time...
Competition, in this case... a good thing (Score:4, Insightful)
I think the guy just expanded his database (Score:2, Interesting)
(("Slashdot serves 50 million pages per month [slashdot.org]"/(# users actually checking out this story))*number of searches tried) + a residual amount that might actually use this search engine more
And what they might be interested in.
Re:I think the guy just expanded his database (Score:2)
Think about how much that number has changed in four years.
AV (Score:2, Informative)
Re:AV (Score:2)
What about patents? (Score:5, Interesting)
So how do you make a search engine and not get sued for infringement, or at least be able to win in the lawsuits?
Re:What about patents? (Score:2)
what timing for this /. article! (Score:5, Informative)
just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.
Actually what is more interesting is Nutch [nutch.org] and Mozdex [mozdex.com], which seems to be based around Lucene [apache.org] (what I am using to build my own search engine embedded into a Horde [horde.org] framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.
Searching from the server's perspective (Score:5, Insightful)
I'd just prefer it if search engines would have enhanced rules for the robot.txt file so a webmaster could tell them more specifically how they want to be searched.
Yes, I know you can put in a delay between page searches, and you can deny access to parts or all of the site, and you can even tell some or all crawlers to take a flying leap, but I'd like to tell them at the front door, "Search on Wednesday, make it fast, do a thorough job, and don't come back for a week."
Too much to ask, right?
Re:Searching from the server's perspective (Score:2)
A few years ago, a number of Scientology-critical sites were getting hammered by bots from machoproducts.com, which seems like a weird link. (Rumours of a martial-arts cult, but no direct Hubbard connect
Re:Searching from the server's perspective (Score:2)
Uhm No (Score:5, Insightful)
Oh yeah real nervous. They're getting on the bandwagon late; too late to monopolize this particular free (as in shut the fuck up) service. If by some miracle they produce something 'threatening', it will be because it's good or because the others have slacked off.
Re:Uhm No (Score:3, Insightful)
Re:Uhm No (Score:3, Insightful)
Usually they try to buy a competing company or hire the brains behind it.
Re:Uhm No (Score:2)
Gigabooo (Score:2, Funny)
Google : My site is the first !!!
And of course I refuse to believr that anyone in the world would be interested in anything but my home page.
Re:Gigabooo (Score:2)
The value of pagerank (Score:5, Interesting)
He said that his internal tests at Infoseek showed that pagerank didn't substantially improve the value of searches over simpler link analysis algorithms. I find that interesting, because I've worked with that algorithm and I know it's a stone bitch to compute.
He might well be right. I like Google over the other search engines because the interface is simple and clean, and I find it pleasant to use. I'm reminded of Donald Norman's book on Emotional Design, about how we can get really attached to things that work for us.
Google sells itself on pagerank, but at the very least it's insufficient against "search engine spam". If pagerank is less important than speed and utility, maybe I'll have something else programmed in to my Firefox seach bar. But not today.
Obligatory..... (Score:2)
How to build a search engine? (Score:3, Funny)
Search engines could replace a query language? (Score:4, Interesting)
However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL.
Discuss.
Microsoft Party Crashing (Score:4, Insightful)
Everybody knows what Microsoft is bringing. Well almost everybody. Okay, I'll spell it out:
1: Bring lots of money.
2: Buy out a competitor.
3: Rename it Microsoft Search.
4: Attempt to trademark the word "Search".
5: Bind it tightly into Windows as an essential service.
6: Don't get it right until version 3.0.
7: Profit!
less commercialism (Score:5, Interesting)
I've noticed lately that Google seems to be filling up with websites wanting to sell you stuff (even if they don't use spamming techniques). Perhaps these little guys can put the pressure on Google to get some better algorithms. Or perhaps its time for Google to fade into the past like Altavista did a couple years ago and make way for the new.
Re:less commercialism (Score:3, Interesting)
C'mon, yes Google's interface is cool and stuff, Google's success isn't just it's interface. Their search algorithms are rock solid, their are continually improving them, and Google resturns the most relevant
What I'd Like To See In A Search Engine (Score:5, Insightful)
You my license my patent on this idea for reasonable terms in exchange for shares of your company's stock.
Re:What I'd Like To See In A Search Engine (Score:3, Insightful)
Respidering a website doesn't just take up bandwidth but also a lot of CPU cycles. That's especially true if you're running extensive algorithm-based computations (like Google does) and not just doing a quick-and-dirty instant-add to a database. It would also allow webmasters to cheat the system: temporarily mirror some relevant, high-traffic site, have it reindexed, change the contents (porn, spam, you name it). After a while, the bot will rein
Re:What I'd Like To See In A Search Engine (Score:3, Informative)
The index is usually updated only once every couple of weeks. Recomputing PageRank (or whatever everybody else uses) takes its time. That's why more or less immediate updates are reserved only to the best-known sites.
You can report 'false' results with the Dissatisfied? [google.com] link at the bott
In other news... (Score:5, Funny)
Only five? (Score:5, Interesting)
I am a Chinese speaker and the tradition of east asian writing is character-based and no alphabets. That means we don't separate words with blank spaces but rather dosomethinglikethis. The language we use is having this characteristic and caused many problem for search services because you never know you interpret that thing into dos ome thingli keth is is right or not. We have to introduce some dictionary into the search engine and it is different from many western languages. So I don't believe there is only five search engine providers in this world. At least I know a list of more search engines developed to support east asian languages.
Re:Only five? (Score:4, Funny)
I'm ready to change (Score:5, Interesting)
I keep getting high rankings from sites like bizrate and kelkoo, which don't have any content whatsoever, but have convinced google to show pages that say "search for best prices on xxxx" where xxxx is my search term. Often the problem is so bad that I don't see any sites with content until page 2 of google.
Another issue is with searches for song lyrics. There are dozens of identikit advert sites which drown a tiny (and often inaccurate) text payload is a swarm of adverts. Finding a site written by someone who cares about accuracy is getting impossible.
What I want is sites ranked by volume of relvant content, with a negative ranking element for duplicate sites and a stronger negative ranking for multiple adverts.
Oh, and what I would also find useful is a 'go (after blocking adservers)' button instead of a 'go' button.
Re:I'm ready to change (Score:2)
I wonder if they pay to get these results, or if this is just a confusion of Google caused by the fact that kelkoo has similar sites in many different domains that all link to eachother.
So, Google thinks there are lots of links to a certain page and thus gives a high ranking.
Check out my search engine! (Score:2, Funny)
Make it like a human brain... (Score:5, Funny)
The protocol used in the brain? That can't be a good direction to go. I mean, if it's anything like my memory and honestly, the memory of most people I know, it's definitely going to be a step backwards. Human brains can hold a lot of information, but retreival is definitely not its specialty. I can see it now. Type in my search terms and the engine comes back with, "ummm, it's right on the tip of my tongue. Okay, I don't have a tongue, but I just about remember it. Give me just a minute to think about it. umm... umm... Nope, it's gone. Nevermind."
How to create a web search engine... (Score:3, Funny)
2. ???
3. Profit!
a9.com is Amazon's web search entry (Score:3, Informative)
Re:Lycos anyone (Score:5, Interesting)
Re:Interesting (Score:3, Insightful)
Right now, one difference between Gigablast and Google is that Gigablast doesn't seem to index PDF files. This makes me sad, since I run a web site whose sole purpose is to serve up big PDF files.
There are also some minor usability problems compared to Google. If your search returns more than 10 results, you can't tell how many there are. You have to understand how to do "+keyword" a
Re:Interesting (Score:3, Informative)
except of course, for the advanced search [gigablast.com] form
Re:Interesting (Score:3, Interesting)
http://www.gigablast.com/search?k3v=898090&s=10
Pretty Nice if you ask me. I hate openning PDF links by accident. Sometimes in google I accidentally click them before I realize they are going to be opened by some stupid browser plugin or (more often than not) Adobe's bloated Reader.
Re:Interesting (Score:5, Interesting)
Not really. I was impressed with the power of a good slashdotting until we made the slashdot frontpage a few weeks ago (we also made it to the frontpage a few years ago but at that time we were serving static htmls).
An article was pulled out of a mysql database, xsl transformed, sent to the webserver via SOAP and finally send about 150k of html and images to the user. Repeat 80,000 times over a 5 hour period.
This is hardly an impressive feat. I expected more, but it turns out that slashdot really only sends about 20-30k unique visitors to your site.
Yes, I used to be impressed with the power of a slashdotting, but now I realize that it's just the result of very crappy sites run on very crappy desktop machines pretending to be servers.
So, no, them withstanding a slashdot link isn't a good sign, it's the very least we can expect of a commercial entity.
Just wait until the duplicate... (Score:2)
Re:Interesting (Score:2)
Friday, 1PM MST, 304 comments.
Re:Open Source Search Engine? (Score:5, Informative)
Re:The Key to Winning the Web (Score:2)
and if I want "www.groklaw.net" rather than "www.google.com" it's "gr" rather than "go".
So I doubt I'd every really type "www.vivisimo.com" in its entirety but only "v" or "vi".
Re:Nice (Score:2)
I often wonder if this is sometimes done on purpose. One example is the band Live, given that searching for "Live" of course turns up millions of unrelated results... though Live was around (and so-named) quite a bit before the P2P thing exploded...
Re:originality (Score:2)
Looks ok in IE, but in other browsers looks like crap.