Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Databases Programming Software IT

Open Source Search Engine Benchmarks 62

Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."
This discussion has been archived. No new comments can be posted.

Open Source Search Engine Benchmarks

Comments Filter:
  • by MosesJones ( 55544 ) on Monday July 06, 2009 @09:43AM (#28593795) Homepage

    Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).

    C++ and C both fail to deliver the same level of performance as the Java virtual machine.

    Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

    But hell this is Slashdot and Java is Slooooooow...

  • Re:k (Score:5, Informative)

    by Lord Grey ( 463613 ) * on Monday July 06, 2009 @09:51AM (#28593877)

    Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.

    Lucene is a great search tool. As TFA pointed out, however, if you're looking for a "search solution" rather than "search engine" then you should check out Solr [apache.org] instead. Lucene is a toolkit that you build on top of, not something you really want to deploy by itself. Solr is that thing built on top of Lucene.

    Be aware that while Lucene/Solr has made terrific progress, it is not quite in the "enterprise search" category. For superscale implementations you'll still likely need to look at a high-priced product like FAST [microsoft.com].

  • by Roy van Rijn ( 919696 ) on Monday July 06, 2009 @10:27AM (#28594257) Homepage

    Hrm, this had absolutely nothing to do with the language. It has almost everything to do with the algorithms.

    Its very hard to compare languages, maybe if you use the languages to implement the exact same algorithm and let it run for a long while... But that still doesn't really compare it well enough.

    Like somebody already said: Bubble sort in C++ is (almost) always slower then a quicksort in Java.

  • Re:k (Score:5, Informative)

    by tealwarrior ( 534667 ) on Monday July 06, 2009 @11:37AM (#28595065)
    Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.
  • by bobv-pillars-net ( 97943 ) <bobvin@pillars.net> on Monday July 06, 2009 @12:59PM (#28596187) Homepage Journal

    Last time I had to implement an indexing and searching solution, swish++ [sourceforge.net] was by far the performance winner.

  • Re:k (Score:3, Informative)

    by tealwarrior ( 534667 ) on Monday July 06, 2009 @03:02PM (#28597993)
    Solr/Lucene real-time search (or near real-time) is one of its weaker points. I think it could keep up with the updates but making them appear in the index immediately and having the caching still perform can be tricky.

    We have one index with that's updated every 20 minutes, but only has about 50k documents and a combination of Solr cache auto-warming and squid's stale-while-re-validate logic works there.

    In another system where updates need to be faster, we had to do some custom work to make it perform where there is an in memory index for recent changes, an on-disk index of previous changes, and process for moving from one to another. Hopefully these improvements will make their way back to Lucene in the future.
  • Re:k (Score:3, Informative)

    by JorDan Clock ( 664877 ) <jordanclock@gmail.com> on Monday July 06, 2009 @04:49PM (#28599473)
    Kind of like... CLucene [sourceforge.net]?
  • Re:k (Score:5, Informative)

    by johannesg ( 664142 ) on Monday July 06, 2009 @05:31PM (#28600031)

    Ah, thank you. So indeed, an implementation of the same algorithm turns out to be _three times_ as fast in C++ than it is in Java (see here [sourceforge.net]).

    I wonder if eldavojohn wishes to comment on that?

  • by johannesg ( 664142 ) on Monday July 06, 2009 @05:36PM (#28600093)

    Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).

    C++ and C both fail to deliver the same level of performance as the Java virtual machine.

    Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

    But hell this is Slashdot and Java is Slooooooow...

    Actually if you check here [sourceforge.net], you will find that an implementation of the exact same Lucene done in C++ is about three times faster than Java.

    Sorry for spoiling your moment there...

UNIX is hot. It's more than hot. It's steaming. It's quicksilver lightning with a laserbeam kicker. -- Michael Jay Tucker

Working...