How to Build a Search Engine 270

Posted by michael on Saturday April 17, 2004 @11:30PM from the some-assembly-required dept.

CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast. In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"

This discussion has been archived. No new comments can be posted.

How to Build a Search Engine

Load All Comments

Search 270 Comments Log In/Create an Account

Comments Filter:

Lol (Score:5, Funny)

by SugoiMonkey ( 648879 ) writes: on Saturday April 17, 2004 @11:32PM (#8895303) Homepage Journal

"even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast " Gigawho? You silly goose.

Share
twitter facebook
- P2P? (Score:5, Interesting)
  
  by ron_ivi ( 607351 ) writes: <sdotno@NOsPAM.cheapcomplexdevices.com> on Saturday April 17, 2004 @11:39PM (#8895340)
  
  I always thought P2P would be a good infrastructure for a search engine.
  That way, I could share the load with people with similar interests as myself.
  For example, I would like a search engine that was more up-to-date crawling the PR of my competitors, but couldn't care less about most other companies. If I were running my own node of a P2P engine, I could set my node to focus on that, and anyone else who shared my interests could tap into it.
  
  Parent Share
  twitter facebook
  - Re:P2P? (Score:4, Informative)
    
    by lakeland ( 218447 ) writes: <lakeland@acm.org> on Saturday April 17, 2004 @11:53PM (#8895407) Homepage
    
    There was one a while back. Everybody installed a program kinda like glimpse on your server and indexed your own web site and a few others. IIRC it would automatically work out by IP address any sites that were nearby and not already over-indexed. They all then kinda pooled the results.
    
    One benefit of it is that you can keep the index of your website up to the minute if you really want. I guess they just never got enough people running the indexing software.
    
    Parent Share
    twitter facebook
  - Re:P2P? (Score:2, Interesting)
    
    by Anonymous Coward writes:
    
    how about an "open" search engine? any takers? post below....
  - Re:P2P? (Score:5, Interesting)
    
    by cgenman ( 325138 ) writes: on Sunday April 18, 2004 @01:47AM (#8895755) Homepage
    
    The closest thing to what you're talking about is Grub [grub.org], which is run by Looksmart [looksmart.com] as a dead-link checker and also feeds to WiseNut [wisenut.com]. While it doesn't allow you to crawl sites that you don't have control over, it does allow you to crawl your own site.
    
    Personally, I've wanted a Google toolbar that indexes the sites that you surf, and adds additional positive weight to the sites that you linger on. It may not know what you liked there, but it knows that you liked it.
    
    Completely offtopic, but does anyone know of a screensaver on Windows that displays random (or spidered) web pages? I've been looking for an equivalent to the XWindows version for years.
    
    Parent Share
    twitter facebook
    - Re:P2P? (Score:5, Interesting)
      
      by cgenman ( 325138 ) writes: on Sunday April 18, 2004 @02:31AM (#8895867) Homepage
      
      ...Just answered my own question. Combining A+ Web Screensaver (nonfree) with a random web page URL (www.uroulette.com/visit.php) gets a random web page display on idle. Yay! Now I'll never know if I'm going to a polynesian community church or a poorly written Raiders fansite.
      
      Now if there were only a way to open said site and continue reading in non-screensaver mode...
      
      Parent Share
      twitter facebook
  - Re:P2P? (Score:2, Informative)
    
    by toddler99 ( 626625 ) writes:
    
    there has been work in this direction already from lehigh university check it out here http://wume.cse.lehigh.edu/
- Re:Lol (Score:5, Insightful)
  
  by SphericalCrusher ( 739397 ) writes: on Saturday April 17, 2004 @11:49PM (#8895393) Journal
  
  That sounds a lot like self-advertisement to me. And there are A LOT more than just five companies! Take MetaCrawler and DogPile for instance -- they aren't on his list.
  
  Parent Share
  twitter facebook
  - Re:Lol (Score:2)
    
    by great throwdini ( 118430 ) writes:
    
    Take MetaCrawler and DogPile for instance -- they aren't on his list.
    
    Both DogPile [dogpile.com] and MetaCrawler [metacrawler.com] are owned by InfoSpace [infospaceinc.com]. There may be more than five companies, but not as much diversity as one would think.
    - Re:Lol (Score:4, Informative)
      
      by Nasarius ( 593729 ) writes: on Sunday April 18, 2004 @01:55AM (#8895768)
      
      And they're not search engines. They're just meta-search engines that compile the results of Google, Yahoo, etc.
      
      Parent Share
      twitter facebook
Gigablast... (Score:4, Interesting)

by vosbert ( 544192 ) writes: on Saturday April 17, 2004 @11:34PM (#8895318)

Am I the only one who's never heard of Gigablast... but then not too many years ago, I remember a time when I've never heard of Google. Kinda makes one wonder how secure a lead from its competition any search engine ever hope to obtain, and what kind of chances Microsoft stand in usurping the search engine market.

Share
twitter facebook
- Re:Gigablast... (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  Am I the only one who's never heard of Gigablast...
  
  i never heard of them either, but heard of all others there.
  
  what a load of shit, this guy works on one search engine, then compares his engine to the other top 4 competitor. What about alltheweb.com, for instance? I've at least heard of that one, it ain't there.
  
  It's like Linux One (remember them) claiming there are four main linux distibutions. red hat, debian, slackware, and linux one.
  - Re:Gigablast... (Score:2, Informative)
    
    by cliffy2000 ( 185461 ) writes:
    
    "Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast."
    You don't even have to RTFA. Read the summary.
  - Even SearchKing (Score:2)
    
    by PlatinumInitiate ( 768660 ) writes:
    
    Even SearchKing [searchking.com] is better known than Gigablast... and SearchKing pretty much faded into obscurity after the Google/SearchKing problems a while back.
- Gigablast... 2 years old and nobody's heard of it! (Score:5, Interesting)
  
  by mbauser2 ( 75424 ) writes: on Sunday April 18, 2004 @05:21AM (#8896161) Homepage
  
  I have heard of Gigablast, but I've never been impressed by it. (I wrote a review [bauser.com] back in 2002.) Most search engine optimizers love Gigablast, however, because it's such an easy engine to game.
  
  It's a fairly old-school engine: indexes whatever it can and favors pages that are keyword-heavy. It's almost too easy to spam. I don't think there's anything PageRank-like in the algorithm, otherwise, it wouldn't be able to add pages to the index "instantly". (PageRank is too computationally intensive for that.) Gigablast still thinks meta-tags are a great idea! While the hardware setup might be innovative (I'll leave that to the hardware experts to decide), the engine software itself seems about ten years behind the times.
  
  Like many posters here, I doubt a one-man outfit is going to take down Google (although many search engine optimizers would like it to). Gigablast has had two years to make an impression, and it hasn't. A company on an acquisistion binge might be crazy enough to buy it, but I wouldn't hold my breath.
  
  Parent Share
  twitter facebook
- - Re:Gigablast... (Score:2)
    
    by Lord Kano ( 13027 ) writes:
    
    Gigablast is in no way going to usurp Google.
    
    No this year at least. I remember when no one thought that a bunch of college student with this Google thing would be able to unseat Yahoo.
    
    LK
- - Re:Gigablast... (Score:2, Informative)
    
    by mikis ( 53466 ) writes:
    
    A9 serves Google results, so you can't quite call them "search company". But I'm shure there are at least a dozen as big and famous as "Gigablast"
    - Re:Gigablast... (Score:2)
      
      by funky womble ( 518255 ) writes:
      
      But the URL format is so much quicker to type if you want to do a quick search from a machine with no fast way to search google.
Not *that* complicated (Score:5, Funny)

by Anonymous Coward writes: on Saturday April 17, 2004 @11:35PM (#8895322)

This will cover about 50% of your job:

select * from internet where keywords like '%asian sex free pics%';

Share
twitter facebook
- - - Re:SCORE -1 ORACLE SYNTAX (Score:2)
      
      by Osty ( 16825 ) writes:
      
      surprise, surprise: seems like SQL server is the odd one out.
      
      Nope, SQL Server handles that syntax just fine. However, unlike C, the ; is unnecessary unless you're stringing multiple commands together on the same line. This is not SQL Server syntax, but ANSI SQL syntax. Most (all?) SQL developers don't bother with semicolons unless they're doing multiple commands on a single line. And since any good DB developer is not writing dynamic SQL (ie, "SELECT * from foo" from PHP, ASP, Perl, etc), but calli
Hmmm.... (Score:5, Interesting)

by elid ( 672471 ) writes: <eli.ipod@gmai[ ]om ['l.c' in gap]> on Saturday April 17, 2004 @11:35PM (#8895323)

Gigablast: "273,384,720 pages indexed"
Google: "Searching 4,285,199,774 web pages" That's quite a big difference.

Share
twitter facebook
- Re:Hmmm.... (Score:5, Informative)
  
  by ixplodestuff8 ( 699898 ) writes: on Saturday April 17, 2004 @11:44PM (#8895368)
  
  I've never heard of gigablast either, but it seems to have some intresting features, it links to the wayback machine's page on the site so you can see past versions of the site. And it also says the most common phrases in which the search term was found. It also archives pages like google and goes as far as to link to OTHER search engines to help out your search
  
  Parent Share
  twitter facebook
  - Re:Hmmm....Talk About Stealing... (Score:2)
    
    by Nom du Keyboard ( 633989 ) writes:
    
    goes as far as to link to OTHER search engines to help out your search
    - Re:Hmmm....Talk About Stealing...Completed Post (Score:2)
      
      by Nom du Keyboard ( 633989 ) writes:
      
      goes as far as to link to OTHER search engines to help out your search
      How long before other search engines start considering this stealing? I mean, I could have a search engine running tomorrow, if all it did was link to Google and return hits to my own bannered page.
      - Re:Hmmm....Talk About Stealing...Completed Post (Score:2)
        
        by Nom du Keyboard ( 633989 ) writes:
        
        These days too many companies benefiting from access to our Internet are like guests who, having brought passable wine, wish to claim credit for the success of the feast.
        That's a poetic way to put it.
        My way of looking at it is that for the first time, everyman has a microphone and a soapbox from which to speak to anyone in the entire world who wishes to listen. While I realize that absolutely has to upset many people entrenched in power, I feel it is the finest example of free, equal, and unfettered sp
- Re:Hmmm.... (Score:4, Funny)
  
  by Waffle Iron ( 339739 ) writes: on Sunday April 18, 2004 @12:12AM (#8895469)
  
  Gigablast: "273,384,720 pages indexed"
  Google: "Searching 4,285,199,774 web pages" That's quite a big difference.
  At least this Gigablast name is closer to the truth. They are only exaggerating their page count by a factor of 3.7 : 1.
  By my math, Google comes up short by 2.3x10^90 : 1.
  
  Parent Share
  twitter facebook
- Re:Hmmm.... (Score:4, Informative)
  
  by trenton ( 53581 ) writes: <trentonlNO@SPAMgmail.com> on Sunday April 18, 2004 @12:15AM (#8895478) Homepage
  
  Have you tried searching, though? Google pulls back more (quantity adn accuracy) than Gigablast for the same terms. For example, search for "larry wall interview" and get 77,300 [google.com] vs 9,759 [gigablast.com] . I'm certainly not saying Google doesn't have its share of problems (seems to steadily be declining in quality). And I do like the categories/tags that Gigablast provides, but overall quality I'll give to Google.
  
  Parent Share
  twitter facebook
  - Re:Hmmm.... (Score:2)
    
    by RzUpAnmsCwrds ( 262647 ) writes:
    
    ...
    
    And Yahoo Search returns 309,000 results.
    
    It's not the number of results, it's how they are arranged.
- One guy, eight computers (Score:2)
  
  by crisco ( 4669 ) writes:
  
  Did you read the article? Gigablast is one guy with eight computers. He thinks he can approach the size of Google's index (5 billion pages) this year if he invests all of his earnings into hardware and bandwidth. He's also well aware of the search engine spam problem and has built anti abuse features into it.
  Given that, plus the fact that he's spidered my worthless blog, I'm pretty impressed. Definately something to watch.
- Re:Hmmm.... (Score:2)
  
  by Jugalator ( 259273 ) writes:
  
  Gigablast: "273,384,720 pages indexed"
  Google: "Searching 4,285,199,774 web pages" That's quite a big difference.
  
  Yes, and noticeable to me. I tried to search for a site I know, and regardless how many terms I entered, it didn't spot it... In the end, the results was down to 2 hits (with only three common keywords) and it wasn't among the sites.
  
  Heck, it doesn't put www.slashdot.org first when searching for Slashdot. :-P Actually, I couldn't even find a link to the main page when searching for Slashdot.
- - Re:Given his resources, (Score:2)
    
    by Tony Hoyle ( 11698 ) writes:
    
    No RAID, no failover, relatively slow processors.
    I notice he doesn't mention the speed of the internet connection (40 queries a second isn't a lot - I expect google handles 10 times that at least).
That list makes no sense (Score:5, Insightful)

by jonman_d ( 465049 ) writes: <nemilar@optonlin[ ]et ['e.n' in gap]> on Saturday April 17, 2004 @11:36PM (#8895325) Homepage Journal

I have to say, that list makes no sense. Maybe if you'd switch "Gigablast" with "MSN", you'd have a list of the some of the major search engines, but it sounds like this guy is just tooting his own horn (and without the proper credentials).

Share
twitter facebook
- Re:That list makes no sense (Score:2)
  
  by K-Man ( 4117 ) writes:
  
  He said "search engine companies", not search engines. Companies which do other things don't qualify. MSN, for instance, is affiliated with some company that makes computer mice.
Whatever happened to (Score:5, Interesting)

by nevek ( 196925 ) writes: on Saturday April 17, 2004 @11:36PM (#8895326) Homepage

Hotbot, Lycos, Mamma.com, Iwon.com, wisenut.com, looksmart,com teoma.com, alltheweb.com, deja.com, direchit.com, excite.com, go.com, infoseek.com, invisibleweb.com, flipper.com, messageking.com, magellan.com, nbci.com, snap.com, northernlight.com, openfind.com, webcrawler.com

ahh the dotcomfallout

at least www.cowboynealsproncollection.com is doing well

Share
twitter facebook
- Re:Whatever happened to (Score:2)
  
  by Cyno01 ( 573917 ) writes:
  
  Dont forget dogpile, heh...
  - Re:Whatever happened to (Score:2)
    
    by cubic6 ( 650758 ) writes:
    
    Dogpile has to be the worst name for anything.
only 5? (Score:5, Informative)

by micker ( 668555 ) writes: on Saturday April 17, 2004 @11:38PM (#8895334) Homepage

The poster left out vivisimo.... lately its been all I use...

and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page

Share
twitter facebook
- Re:only 5? (Score:2)
  
  by nacturation ( 646836 ) writes:
  
  and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page
  
  It doesn't matter one bit. That is, until you want page 2 million + 1. Then, all of a sudden, having a few more billion pages to index is a good thing.
Humph (Score:5, Funny)

by SlamMan ( 221834 ) writes: on Saturday April 17, 2004 @11:39PM (#8895336)

"and everyone's a little bit nervous to see what it's bringing.'"

Money. Lots and lots of money.

Share
twitter facebook
- - - Re:Humph (Score:2, Funny)
      
      by Anonymous Coward writes:
      
      >> It's implied. Lawyers go wherever the money is.
      
      Not in every instance. Some lawyers suck so much the money comes right to them.
Microsoft at the party (Score:2, Funny)

by bigberk ( 547360 ) writes:

Microsoft at the party would probably look something like this [somethingawful.com]
"Pass the dip, guys!"
Isn't yahoo powered by google? (Score:2, Insightful)

by Toxygen ( 738180 ) writes:

I mean, I know they're different sites and all, but isn't the yahoo site just the google search bar with all those category links added?
- Re:Isn't yahoo powered by google? (Score:5, Informative)
  
  by levram2 ( 701042 ) writes: on Saturday April 17, 2004 @11:59PM (#8895428)
  
  Yahoo used to use Google, but they bought Inktomi and have switched to their search engine. MSN also uses the Inktomi search engine, but tweak the results.
  
  Parent Share
  twitter facebook
  - Re:Isn't yahoo powered by google? (Score:2)
    
    by endx7 ( 706884 ) writes:
    
    Yahoo used to use Google, but they bought Inktomi and have switched to their search engine. MSN also uses the Inktomi search engine, but tweak the results.
    
    Right before google became a big name, yahoo used inktomi (inktomi used to be a really big name in the search engine industry). Yahoo used to use google more recently, but now they don't.
    It looks like yahoo bought inktomi about a year ago [silicon.com], so I guess that's what they are using again.
- Re:Isn't yahoo powered by google? (Score:3, Informative)
  
  by tvh2k ( 738947 ) writes:
  
  Nah, they dropped google on Feb 17th. Get with the program :-D
Matt's a good guy (Score:3, Informative)

by Thanatopsis ( 29786 ) writes: <despain.brian@ g m a i l . c om> on Saturday April 17, 2004 @11:46PM (#8895378) Homepage

We use Gigablast as a back fill for one of our search engines. His stuff is very speedy and he's good guy to work with.

Share
twitter facebook
- Re:Matt's a good guy (Score:5, Funny)
  
  by cybermace5 ( 446439 ) writes: <g.ryan@macetech.com> on Sunday April 18, 2004 @12:06AM (#8895453) Homepage Journal
  
  I'm glad you told everyone he's a good guy, for a minute there I just assumed he was an evil, scheming villain.
  
  Parent Share
  twitter facebook
Voting methods and search engines.... (Score:5, Informative)

by Anonymous Coward writes: on Saturday April 17, 2004 @11:47PM (#8895384)

...have a lot in common. Different search engines allow sites to "vote" on which ones are the most authoritative, and the best methods in one field can give insight into the best method in the other.

For example, there is the Kemeny order (named after the same guy who came up with BASIC, John G. Kemeny). Using a version of ranked ballots and sorting websites by the mean Kemeny order gives you a method that is surprisingly good at putting authoritative sites at the top and spam sites at the bottom. For those of you who like in-depth analysis and don't mind math, the following is a good site:

http://www10.org/cdrom/papers/577/

Share
twitter facebook
Other Search Engines (Score:2, Insightful)

by GSPride ( 763993 ) writes:

I know that other people must use search engines other then google, but who? And why? I could see netscape, because it's the default homepage for many browsers, and maybe Ask Jeeve due to the easy syntax, but why would people go out of their way to Gigablast or Looksmart. Who's even heard of those two?
In my opinion..... (Score:5, Funny)

by Kenja ( 541830 ) writes: on Saturday April 17, 2004 @11:51PM (#8895398)

In my opinion the best search engine is a Ford T-Block. Put that into a light weight steel frame and we can search them down and kill em in the street like wild animals.
Whoa, hold on. Wrong site. Never mind.

Share
twitter facebook
- Thinking you were... (Score:2)
  
  by Cyno01 ( 573917 ) writes:
  
  here [mallninja.com] maybe?
BOOBLE! (Score:4, Funny)

by the MaD HuNGaRIaN ( 311517 ) writes: on Saturday April 17, 2004 @11:53PM (#8895404)

What about BOOBLE [booble.com].

Share
twitter facebook
- Heh... (Score:5, Interesting)
  
  by Xenographic ( 557057 ) writes: on Sunday April 18, 2004 @01:43AM (#8895739) Journal
  
  I've often wondered why Google doesn't put up an "unsafe" image search option? (e.g. leave out all the images it deems "safe").
  
  Then again, it hardly needs to most of the time...
  
  Parent Share
  twitter facebook
Competition, in this case... a good thing (Score:4, Insightful)

by Jtoxification ( 678057 ) writes: on Sunday April 18, 2004 @12:05AM (#8895451) Homepage Journal

We all win. With the increasing # of sites, content, web services, spam, popup attacks, and "please allow us to rape your computer" certificates to download, (that's the main reason I use Firefox when on Windows now: because you can't tell I.E. to not accept those damned installation certificates, nor block requests to change the homepage.) it becomes equally more difficult to find what you're looking for, especially when it's not something that everyone else looks for, via Google's site ranking technology [google.com]. Because they fight to be the best, we get cool things like ftp searches, grep and regexp searching of dmoz.org , video, image, and music searches, even linux [google.com] and bsd [google.com] search-specific pages. gMail [slashdot.org], Microsoft's entry, and now Gigablast are all rewards we get to reap from each company attempting to set its roots deeper into the Internet like weeds vying for the same piece of dirt. We are extremely lucky, but then I doubt more than a handful search engines will ever hold top ranks at one time, due to the fact that they are so specialized in what they do. Just hope Gigablast and Google don't decide to create new IM service, too.

Share
twitter facebook
I think the guy just expanded his database (Score:2, Interesting)

by Anonymous Coward writes:

By placing this on /. he got:

(("Slashdot serves 50 million pages per month [slashdot.org]"/(# users actually checking out this story))*number of searches tried) + a residual amount that might actually use this search engine more

And what they might be interested in.
- Re:I think the guy just expanded his database (Score:2)
  
  by Ieshan ( 409693 ) writes:
  
  That 50 mil # is from FOUR YEARS AGO.
  
  Think about how much that number has changed in four years.
AV (Score:2, Informative)

by TSNV ( 725282 ) writes:

I like AV because it's the only one (that I know of) that supports advanced embedded Boolean. Many a time Google fails to produce, and a well-built AV search will pop out what I'm looking for - albeit from a smaller selection.
- Re:AV (Score:2)
  
  by CAIMLAS ( 41445 ) writes:
  
  AV? What is that? ArdVark search engine?
What about patents? (Score:5, Interesting)

by enosys ( 705759 ) writes: on Sunday April 18, 2004 @12:26AM (#8895508) Homepage

What about patents? A lot of the stuff that goes into a search engine must be patented by now. I'm sure that if you create a search engine you'll end up infringing a bunch of these patents. Yes, I'm sure that in many cases it's obvious, and there's probably prior art, but I expect that the patents are still there and it's like a minefield of patents.
So how do you make a search engine and not get sued for infringement, or at least be able to win in the lawsuits?

Share
twitter facebook
- Re:What about patents? (Score:2)
  
  by AmVidia HQ ( 572086 ) writes:
  
  Yes and no. You can't patent "a search engine", only parts of it. Only specific techniques and algorithms are patentable (or at least that's how patents are supposed to work). Google's patented PageRank for example.
what timing for this /. article! (Score:5, Informative)

by whowho ( 706277 ) writes: on Sunday April 18, 2004 @12:34AM (#8895530)

just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.

Actually what is more interesting is Nutch [nutch.org] and Mozdex [mozdex.com], which seems to be based around Lucene [apache.org] (what I am using to build my own search engine embedded into a Horde [horde.org] framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.

Share
twitter facebook
Searching from the server's perspective (Score:5, Insightful)

by no longer myself ( 741142 ) writes: on Sunday April 18, 2004 @12:34AM (#8895533)

Having a webserver hobby, I see the search engines crawl through my site daily. Of course in the beginning they hungrily tripped through the pages, taking in as much as could be found. Of course as time went on it seemed like some of the search engines had a new method of just grabbing a page or two every hour or so. I imagine this was to prevent over-taxing my box, but it made the first glance at my logs look artificially inflated as if people were visiting the site instead of just a crawler working its way through... slowly and painfully.
I'd just prefer it if search engines would have enhanced rules for the robot.txt file so a webmaster could tell them more specifically how they want to be searched.
Yes, I know you can put in a delay between page searches, and you can deny access to parts or all of the site, and you can even tell some or all crawlers to take a flying leap, but I'd like to tell them at the front door, "Search on Wednesday, make it fast, do a thorough job, and don't come back for a week."
Too much to ask, right?

Share
twitter facebook
- Re:Searching from the server's perspective (Score:2)
  
  by AndroidCat ( 229562 ) writes:
  
  There are lists of the various bots used by search engines, and who's naughty/nice. (I've seen one list recently, just don't remember where it was.) You might want to see who the persistant ones belong to. There are also some that check for copyright/trademark violations, and their bots don't always behave.
  A few years ago, a number of Scientology-critical sites were getting hammered by bots from machoproducts.com, which seems like a weird link. (Rumours of a martial-arts cult, but no direct Hubbard connect
- Re:Searching from the server's perspective (Score:2)
  
  by Sirch ( 82595 ) writes:
  
  I'd like to know why robot.txt isn't protected from showing up in results from Google? Search for robot.txt [google.com] on Google and you get a load of actual robot.txt files, which seems to negate its usefulness.
Uhm No (Score:5, Insightful)

by Tedium Unleased ( 764661 ) writes: on Sunday April 18, 2004 @12:36AM (#8895540)

Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.

Oh yeah real nervous. They're getting on the bandwagon late; too late to monopolize this particular free (as in shut the fuck up) service. If by some miracle they produce something 'threatening', it will be because it's good or because the others have slacked off.

Share
twitter facebook
- Re:Uhm No (Score:3, Insightful)
  
  by Anonymous Coward writes:
  
  yeah! they only do well if they are first, you know, like with excel, and internet explorer, and a graphical user interface.
- Re:Uhm No (Score:3, Insightful)
  
  by Keith McClary ( 14340 ) writes:
  
  If by some miracle they produce something 'threatening', it will be because it's good or because the others have slacked off.
  
  Usually they try to buy a competing company or hire the brains behind it.
- Re:Uhm No (Score:2)
  
  by HeghmoH ( 13204 ) writes:
  
  Yeah, they're getting on the search bandwagon late the same way they got on the PC bandwagon late, the office suite bandwagon, the browser bandwagon, the input devices bandwagon, the server OS bandwagon, and the gaming system bandwagon. Obviously they have to hope.
Gigabooo (Score:2, Funny)

by vinit79 ( 740464 ) writes:

Gigablast sucks : Proof - I entered my name and Gigablast says "no results". Did u mean "something thats not my name". No thanx I did not

Google : My site is the first !!!

And of course I refuse to believr that anyone in the world would be interested in anything but my home page.
- Re:Gigabooo (Score:2)
  
  by Rudy Rodarte ( 597418 ) writes:
  
  Me too. Before my page comes up, the page listing me as one of Bethanie's fans [gigablast.com] comes up. What's up with that? Oh, plus it suggests I search for something other than my name. But, with Google [google.com], all is well again.
The value of pagerank (Score:5, Interesting)

by jfengel ( 409917 ) writes: on Sunday April 18, 2004 @01:03AM (#8895617) Homepage Journal

The most interesting assertion in the article was that Pagerank was useless. He says Google's real win is its ability to cache a copy of the page and show you a summary including your search terms. I do use that a lot to quickly exclude irrelevant pages.

He said that his internal tests at Infoseek showed that pagerank didn't substantially improve the value of searches over simpler link analysis algorithms. I find that interesting, because I've worked with that algorithm and I know it's a stone bitch to compute.

He might well be right. I like Google over the other search engines because the interface is simple and clean, and I find it pleasant to use. I'm reminded of Donald Norman's book on Emotional Design, about how we can get really attached to things that work for us.

Google sells itself on pagerank, but at the very least it's insufficient against "search engine spam". If pagerank is less important than speed and utility, maybe I'll have something else programmed in to my Firefox seach bar. But not today.

Share
twitter facebook
Obligatory..... (Score:2)

by CastrTroy ( 595695 ) writes:
1. Write program that crawls the web.
2. Store text of web pages in Access Database
3. Make web interface that allows text of pages to be searched in linear fashion.
4. Host on a Pentium 2, On Personal Web Sever, on windows 98.
5. ......
6. Profit
How to build a search engine? (Score:3, Funny)

by bakawally ( 637407 ) writes: on Sunday April 18, 2004 @01:13AM (#8895653)

I dunno. I better google it.

Share
twitter facebook
Search engines could replace a query language? (Score:4, Interesting)

by JusTyler ( 707210 ) writes: on Sunday April 18, 2004 @01:28AM (#8895695) Homepage

Fave quote from that article..

However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL.

Discuss.

Share
twitter facebook
Microsoft Party Crashing (Score:4, Insightful)

by Nom du Keyboard ( 633989 ) writes: on Sunday April 18, 2004 @01:43AM (#8895743)

Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.
Everybody knows what Microsoft is bringing. Well almost everybody. Okay, I'll spell it out:
1: Bring lots of money.
2: Buy out a competitor.
3: Rename it Microsoft Search.
4: Attempt to trademark the word "Search".
5: Bind it tightly into Windows as an essential service.
6: Don't get it right until version 3.0.
7: Profit!

Share
twitter facebook
less commercialism (Score:5, Interesting)

by dj245 ( 732906 ) writes: on Sunday April 18, 2004 @01:49AM (#8895759)

I did a quick search on Gigablast for "Radio control speed controler". Now normally, on google, you would get a couple million pages of websites wanting to sell you a speed controller. On gigablast, however, The first 10 results were pretty much information about speed controllers, and/or battlebot sites that explained what you would need them for.
I've noticed lately that Google seems to be filling up with websites wanting to sell you stuff (even if they don't use spamming techniques). Perhaps these little guys can put the pressure on Google to get some better algorithms. Or perhaps its time for Google to fade into the past like Altavista did a couple years ago and make way for the new.

Share
twitter facebook
- Re:less commercialism (Score:3, Interesting)
  
  by a.ameri ( 665846 ) writes:
  
  I actually was looking for some daily ISO snapshots of debian sid reopsitory. Nevr heard of Gigablast before, so give it a shot and search for 'daily sid snapshot iso'. Gigablast found no results, Google found 785, and looking at the first 10 results, I was easily able to find what I was looking for.
  
  C'mon, yes Google's interface is cool and stuff, Google's success isn't just it's interface. Their search algorithms are rock solid, their are continually improving them, and Google resturns the most relevant
What I'd Like To See In A Search Engine (Score:5, Insightful)

by Nom du Keyboard ( 633989 ) writes: on Sunday April 18, 2004 @02:04AM (#8895793)

What I'd like to see in a search engine is a page kill or broken link feature to keep it current. If I click a link that is broken or vastly changed (e.g. the link to ancient Chinese pottery is now a porn site), that I could backup to the search results page and click a link to have them immediately re-crawl that page. I think it would make for better results, and am surprised that it's not already common.
You my license my patent on this idea for reasonable terms in exchange for shares of your company's stock.

Share
twitter facebook
- Re:What I'd Like To See In A Search Engine (Score:3, Insightful)
  
  by igrp ( 732252 ) writes:
  
  Well, I think the potential for abuse would just be too great.
  Respidering a website doesn't just take up bandwidth but also a lot of CPU cycles. That's especially true if you're running extensive algorithm-based computations (like Google does) and not just doing a quick-and-dirty instant-add to a database. It would also allow webmasters to cheat the system: temporarily mirror some relevant, high-traffic site, have it reindexed, change the contents (porn, spam, you name it). After a while, the bot will rein
- Re:What I'd Like To See In A Search Engine (Score:3, Informative)
  
  by harmonica ( 29841 ) writes:
  
  If I click a link that is broken or vastly changed (e.g. the link to ancient Chinese pottery is now a porn site), that I could backup to the search results page and click a link to have them immediately re-crawl that page.
  
  The index is usually updated only once every couple of weeks. Recomputing PageRank (or whatever everybody else uses) takes its time. That's why more or less immediate updates are reserved only to the best-known sites.
  
  You can report 'false' results with the Dissatisfied? [google.com] link at the bott
In other news... (Score:5, Funny)

by sydbarrett74 ( 74307 ) writes: <sydbarrett74.gmail@com> on Sunday April 18, 2004 @03:44AM (#8896001)

'In other news, Google announced the buy-out of Gigablast. The newly-formed company will be called Giggle.'

Share
twitter facebook
Only five? (Score:5, Interesting)

by adriantam ( 566025 ) writes: on Sunday April 18, 2004 @04:44AM (#8896092)

Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast.

I am a Chinese speaker and the tradition of east asian writing is character-based and no alphabets. That means we don't separate words with blank spaces but rather dosomethinglikethis. The language we use is having this characteristic and caused many problem for search services because you never know you interpret that thing into dos ome thingli keth is is right or not. We have to introduce some dictionary into the search engine and it is different from many western languages. So I don't believe there is only five search engine providers in this world. At least I know a list of more search engines developed to support east asian languages.

Share
twitter facebook
- Re:Only five? (Score:4, Funny)
  
  by pe1chl ( 90186 ) writes: on Sunday April 18, 2004 @07:26AM (#8896328)
  
  When an American writes "there are only five companies that..." he really means: "there are only five companies IN THE USA that...".
  
  Parent Share
  twitter facebook
I'm ready to change (Score:5, Interesting)

by Andy_R ( 114137 ) writes: on Sunday April 18, 2004 @06:40AM (#8896270) Homepage Journal

Wonderful as Google is, I'm finding more and more searches don't produce useful results.

I keep getting high rankings from sites like bizrate and kelkoo, which don't have any content whatsoever, but have convinced google to show pages that say "search for best prices on xxxx" where xxxx is my search term. Often the problem is so bad that I don't see any sites with content until page 2 of google.

Another issue is with searches for song lyrics. There are dozens of identikit advert sites which drown a tiny (and often inaccurate) text payload is a swarm of adverts. Finding a site written by someone who cares about accuracy is getting impossible.

What I want is sites ranked by volume of relvant content, with a negative ranking element for duplicate sites and a stronger negative ranking for multiple adverts.

Oh, and what I would also find useful is a 'go (after blocking adservers)' button instead of a 'go' button.

Share
twitter facebook
- Re:I'm ready to change (Score:2)
  
  by pe1chl ( 90186 ) writes:
  
  I noticed the "kelkoo problem" too.
  I wonder if they pay to get these results, or if this is just a confusion of Google caused by the fact that kelkoo has similar sites in many different domains that all link to eachother.
  So, Google thinks there are lots of links to a certain page and thus gives a high ranking.
Check out my search engine! (Score:2, Funny)

by Milton Waddams ( 739213 ) writes:

function search(){ grep $1 < The_Internet }
Make it like a human brain... (Score:5, Funny)

by Pedrito ( 94783 ) writes: on Sunday April 18, 2004 @08:44AM (#8896465)

I liked this quote: "Now that the Internet is very large, it makes for some well-developed memory. I would suppose that the amount of information stored on the Internet is around the level of the adult human brain. Now we just need some higher-order functionality to really take advantage of it. At one point we may even discover the protocol used in the brain and extend it with an interface to an Internet search engine."

The protocol used in the brain? That can't be a good direction to go. I mean, if it's anything like my memory and honestly, the memory of most people I know, it's definitely going to be a step backwards. Human brains can hold a lot of information, but retreival is definitely not its specialty. I can see it now. Type in my search terms and the engine comes back with, "ummm, it's right on the tip of my tongue. Okay, I don't have a tongue, but I just about remember it. Give me just a minute to think about it. umm... umm... Nope, it's gone. Nevermind."

Share
twitter facebook
How to create a web search engine... (Score:3, Funny)

by fizban ( 58094 ) writes: <fizban@umich.edu> on Sunday April 18, 2004 @10:27AM (#8896870) Homepage

1. Buy license for existing web search engine.
2. ???
3. Profit!

Share
twitter facebook
a9.com is Amazon's web search entry (Score:3, Informative)

by squashed ( 664265 ) writes: on Sunday April 18, 2004 @12:11PM (#8897395)

Have a look at a9.com [a9.com], which is Amazon's new search entry. Aside from a good web search engine, it provides a "history" of your previous searches and other innovative features.

Share
twitter facebook
- Re:Lycos anyone (Score:5, Interesting)
  
  by Thanatopsis ( 29786 ) writes: <despain.brian@ g m a i l . c om> on Saturday April 17, 2004 @11:56PM (#8895423) Homepage
  
  Lycos search no longer runs it own crawler. Matt's talking about people with their own crawler and algo.
  
  Parent Share
  twitter facebook
- Re:Interesting (Score:3, Insightful)
  
  by bcrowell ( 177657 ) writes:
  
  I guess one can evaluate Gigablast based on what it can do now, or based on what foundation they're building for the future.
  Right now, one difference between Gigablast and Google is that Gigablast doesn't seem to index PDF files. This makes me sad, since I run a web site whose sole purpose is to serve up big PDF files.
  There are also some minor usability problems compared to Google. If your search returns more than 10 results, you can't tell how many there are. You have to understand how to do "+keyword" a
  - Re:Interesting (Score:3, Informative)
    
    by ixplodestuff8 ( 699898 ) writes:
    
    "there doesn't seem to be a form you can fill in like Google's "advanced search" form."
    
    except of course, for the advanced search [gigablast.com] form
  - Re:Interesting (Score:3, Interesting)
    
    by Gherald ( 682277 ) writes:
    
    Here's an example of a search that turned up a PDF link. It is very clearly labled PDF on a red background:
    
    http://www.gigablast.com/search?k3v=898090&s=10& q= %22preston+alexander%22+-%22victoria+ashley%22
    
    Pretty Nice if you ask me. I hate openning PDF links by accident. Sometimes in google I accidentally click them before I realize they are going to be opened by some stupid browser plugin or (more often than not) Adobe's bloated Reader.
- Re:Interesting (Score:5, Interesting)
  
  by prockcore ( 543967 ) writes: on Sunday April 18, 2004 @01:06AM (#8895626)
  
  If it can survive ./, thats a good sign.
  
  Not really. I was impressed with the power of a good slashdotting until we made the slashdot frontpage a few weeks ago (we also made it to the frontpage a few years ago but at that time we were serving static htmls).
  
  An article was pulled out of a mysql database, xsl transformed, sent to the webserver via SOAP and finally send about 150k of html and images to the user. Repeat 80,000 times over a 5 hour period.
  
  This is hardly an impressive feat. I expected more, but it turns out that slashdot really only sends about 20-30k unique visitors to your site.
  
  Yes, I used to be impressed with the power of a slashdotting, but now I realize that it's just the result of very crappy sites run on very crappy desktop machines pretending to be servers.
  
  So, no, them withstanding a slashdot link isn't a good sign, it's the very least we can expect of a commercial entity.
  
  Parent Share
  twitter facebook
  - Just wait until the duplicate... (Score:2)
    
    by Ayanami Rei ( 621112 ) * writes:
    
    is posted on a weekday next week just before lunch hour eastern time.
  - - Re:Interesting (Score:2)
      
      by prockcore ( 543967 ) writes:
      
      at what time during what day was the story posted?
      
      Friday, 1PM MST, 304 comments.
- Re:Open Source Search Engine? (Score:5, Informative)
  
  by idiotfromia ( 657688 ) writes: <chad@NoSPAM.chadbrandos.com> on Sunday April 18, 2004 @01:44AM (#8895746) Homepage
  
  I don't believe it's actually being used in practice, but Nutch [nutch.org] is developing rapidly. The largest test crawl they've completed has been about a hundred million pages. They're asking for donations [nutch.org] to develop a larger demo system.
  
  Parent Share
  twitter facebook
- Re:The Key to Winning the Web (Score:2)
  
  by bhima ( 46039 ) writes:
  
  But in FireFox, the difference between the BBC news service and the Google news service is "news.b" compared to to "news.g" (plus the tab & return)
  and if I want "www.groklaw.net" rather than "www.google.com" it's "gr" rather than "go".
  So I doubt I'd every really type "www.vivisimo.com" in its entirety but only "v" or "vi".
- - Re:Nice (Score:2)
    
    by sfe_software ( 220870 ) * writes:
    
    I've been giving some thought to search engine referencing, and how XxX was a huge mistake, because searching for it would be difficult.
    
    I often wonder if this is sometimes done on purpose. One example is the band Live, given that searching for "Live" of course turns up millions of unrelated results... though Live was around (and so-named) quite a bit before the P2P thing exploded... ... but I do still wonder if this is something that is considered these days when naming a band/movie/whatever (it's searcha
- Re:originality (Score:2)
  
  by Tony Hoyle ( 11698 ) writes:
  
  Not *exactly* like google... he's using a shitty courier font throughout the site.
  
  Looks ok in IE, but in other browsers looks like crap.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Lol (Score:5, Funny)

P2P? (Score:5, Interesting)

Re:P2P? (Score:4, Informative)

Re:P2P? (Score:2, Interesting)

Re:P2P? (Score:5, Interesting)

Re:P2P? (Score:5, Interesting)

Re:P2P? (Score:2, Informative)

Re:Lol (Score:5, Insightful)

Re:Lol (Score:2)

Re:Lol (Score:4, Informative)

Gigablast... (Score:4, Interesting)

Re:Gigablast... (Score:2, Insightful)

Re:Gigablast... (Score:2, Informative)

Even SearchKing (Score:2)

Gigablast... 2 years old and nobody's heard of it! (Score:5, Interesting)

Re:Gigablast... (Score:2)

Re:Gigablast... (Score:2, Informative)

Re:Gigablast... (Score:2)

Not *that* complicated (Score:5, Funny)

Re:SCORE -1 ORACLE SYNTAX (Score:2)

Hmmm.... (Score:5, Interesting)

Re:Hmmm.... (Score:5, Informative)

Re:Hmmm....Talk About Stealing... (Score:2)

Re:Hmmm....Talk About Stealing...Completed Post (Score:2)

Re:Hmmm....Talk About Stealing...Completed Post (Score:2)

Re:Hmmm.... (Score:4, Funny)

Re:Hmmm.... (Score:4, Informative)

Re:Hmmm.... (Score:2)

One guy, eight computers (Score:2)

Re:Hmmm.... (Score:2)

Re:Given his resources, (Score:2)

That list makes no sense (Score:5, Insightful)

Re:That list makes no sense (Score:2)

Whatever happened to (Score:5, Interesting)

Re:Whatever happened to (Score:2)

Re:Whatever happened to (Score:2)

only 5? (Score:5, Informative)

Re:only 5? (Score:2)

Humph (Score:5, Funny)

Re:Humph (Score:2, Funny)

Microsoft at the party (Score:2, Funny)

Isn't yahoo powered by google? (Score:2, Insightful)

Re:Isn't yahoo powered by google? (Score:5, Informative)

Re:Isn't yahoo powered by google? (Score:2)

Re:Isn't yahoo powered by google? (Score:3, Informative)

Matt's a good guy (Score:3, Informative)

Re:Matt's a good guy (Score:5, Funny)

Voting methods and search engines.... (Score:5, Informative)

Other Search Engines (Score:2, Insightful)

In my opinion..... (Score:5, Funny)

Thinking you were... (Score:2)

BOOBLE! (Score:4, Funny)

Heh... (Score:5, Interesting)

Competition, in this case... a good thing (Score:4, Insightful)

I think the guy just expanded his database (Score:2, Interesting)

Re:I think the guy just expanded his database (Score:2)

AV (Score:2, Informative)

Re:AV (Score:2)

What about patents? (Score:5, Interesting)

Re:What about patents? (Score:2)

what timing for this /. article! (Score:5, Informative)

Searching from the server's perspective (Score:5, Insightful)

Re:Searching from the server's perspective (Score:2)

Re:Searching from the server's perspective (Score:2)

Uhm No (Score:5, Insightful)

Re:Uhm No (Score:3, Insightful)

Re:Uhm No (Score:3, Insightful)

Re:Uhm No (Score:2)

Gigabooo (Score:2, Funny)

Re:Gigabooo (Score:2)

The value of pagerank (Score:5, Interesting)

Obligatory..... (Score:2)

How to build a search engine? (Score:3, Funny)

Search engines could replace a query language? (Score:4, Interesting)

Microsoft Party Crashing (Score:4, Insightful)

less commercialism (Score:5, Interesting)

Re:less commercialism (Score:3, Interesting)

What I'd Like To See In A Search Engine (Score:5, Insightful)

Re:What I'd Like To See In A Search Engine (Score:3, Insightful)

Re:What I'd Like To See In A Search Engine (Score:3, Informative)

Not that complicated (Score:5, Funny)