Become a fan of Slashdot on Facebook


Forgot your password?
Databases Programming Software IT

Open Source Search Engine Benchmarks 62

Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."
This discussion has been archived. No new comments can be posted.

Open Source Search Engine Benchmarks

Comments Filter:
  • It may be a bit faster on searching, but it take ~5 times as long to generate the index, and use twice as much memmory when searching so it may just be a different trade off between index time and search time.

    And it's a bad search test, because the total search time is less them 2 seconds, thus not including the cost of the gc for java.

    hint to people doing benchmark: When benchmarking a component which use gc or similary memmory handling methods, remember to have the test dataset be large enough that you cause enough gc cycles to make the performance of any single cycle noise.

    And to be fair to the gc language, set minimum memmory=maximum memmory, so it will use as much memmory as you allow and don't waste time allocating more memmory.

    Gc is more effective, the more memory you allow it to use, because the runtime cost of gc mostly depend on the number of live objects, not the number of allocated objects.

  • Re:k (Score:1, Interesting)

    by Anonymous Coward on Monday July 06, 2009 @12:34PM (#28595829)

    Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.

    I work in a shop that uses fast, despite pressure from some to move to solr. As I understand it, solr can't keep up with the volume of changes we need to make to our data. I'm talking millions of documents of a 100+ fields changed, per day, with any given change visible to the customer within a short timeframe (10 minutes). solr can index that much data easily, but it can't keep with that kind of volume. That's what I've been told anyway.

"What the scientists have in their briefcases is terrifying." -- Nikita Khrushchev