Forgot your password?
typodupeerror
Databases Programming Software IT

Open Source Search Engine Benchmarks 62

Posted by CmdrTaco
from the welcome-to-the-monday dept.
Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."
This discussion has been archived. No new comments can be posted.

Open Source Search Engine Benchmarks

Comments Filter:
  • by MosesJones (55544) on Monday July 06, 2009 @08:43AM (#28593795) Homepage

    Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).

    C++ and C both fail to deliver the same level of performance as the Java virtual machine.

    Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

    But hell this is Slashdot and Java is Slooooooow...

    • by zappepcs (820751)

      You beat me to the comment. I'm sort of surprised that the reaction so far has been the sound of crickets and loud yawning... meh

      • I'm sort of surprised that the reaction so far has been the sound of crickets and loud yawning... meh

        Well, the OP certainly got a loud yawn from me for the remark about indexing twitter posts. They might just as well index cockroach farts.
    • Re: (Score:2, Interesting)

      by TheSunborn (68004)

      It may be a bit faster on searching, but it take ~5 times as long to generate the index, and use twice as much memmory when searching so it may just be a different trade off between index time and search time.

      And it's a bad search test, because the total search time is less them 2 seconds, thus not including the cost of the gc for java.

      hint to people doing benchmark: When benchmarking a component which use gc or similary memmory handling methods, remember to have the test dataset be large enough that you ca

      • hint to people doing benchmark: When benchmarking a component which use gc or similary memmory handling methods, remember to have the test dataset be large enough that you cause enough gc cycles to make the performance of any single cycle noise.

        I have an even better idea. Why don't we just model the benchmark on the real world usage scenarios, and let those decide whether garbage collection and allocation even matter?

    • Finding it easier to code well in Java than C is like finding it easier to drive Automatic than Manual. I stopped driving automatic, it stopped almost getting me into accidents.
      • by Atzanteol (99067)
        Driving an automatic almost got you into accidents? You must *suck* at driving dude.
        • I can't seem to move into a lane of faster traffic in heavy traffic situations on the highway without being able to immediately accelerate. I can't seem to shift into a lower gear without using the knockdown mechanism, which requires me to depress the accelerator the whole way and wait a second for everything to engage. In a manual, I can downshift to fourth or third and control my speed, enter an opening, and accelerate quickly without fear that backing off the accelerator a little (you try keeping contr

      • Re: (Score:1, Insightful)

        by Anonymous Coward

        I stopped driving automatic, it stopped almost getting me into accidents.

        You're a fucking idiot. Get off my road.

    • by cpghost (719344)
      Granted, bubble sort is slower in C/C++ than Quicksort in Java. Then again, we do have qsort(3) in C and std::sort() in C++/STL, and slow C++ code is usually the result of developer newbies misunderstanding the copy semantics of parameter passing.
    • by Roy van Rijn (919696) on Monday July 06, 2009 @09:27AM (#28594257) Homepage

      Hrm, this had absolutely nothing to do with the language. It has almost everything to do with the algorithms.

      Its very hard to compare languages, maybe if you use the languages to implement the exact same algorithm and let it run for a long while... But that still doesn't really compare it well enough.

      Like somebody already said: Bubble sort in C++ is (almost) always slower then a quicksort in Java.

    • C++ and C both fail to deliver the same level of performance as the Java virtual machine.

      Oh wait hang on...

      As was pointed above, the search engines spend >90% of their time in DB/file I/O code.

      In other words, implementation language plays little role - it is I/O optimization algorithms which play bigger role.

      From my experience with number of C/C++ projects, efficiency of the languages/compilers allows developers to remain ignorant. In Java that approach simply doesn't work. Thus I more often see more better algorithms often in less efficient languages.

      Like I recently found in one program people used

    • Re: (Score:1, Flamebait)

      by Wovel (964431)

      Java is slow. If you took the same algorithm and coded it in an efficient compiled language, it would be faster. Much faster.

    • Java is fine for plenty of applications, but there are certain situations where it simply doesn't cut it. Heavy GUI oriented applications tend to take a massive performance hit because all of the objects are dynamically generated at run time -- just load up Eclipse and see how long it takes to start. Scientific and Mathematical applications, as well, rely on high-speed languages like C/FORTRAN. That doesn't mean Java is so slow it's useless -- in many cases the aided clarity and simplicity is worth it.

      T
    • by Lisandro (799651)

      Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform...

      Yes. ...and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

      No.

    • Re: (Score:3, Informative)

      by johannesg (664142)

      Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).

      C++ and C both fail to deliver the same level of performance as the Java virtual machine.

      Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?

      But hell this is Slashdot and Java is Slooooooow...

      Actually if you check here [sourceforge.net], you will find that an implementation of the exact same Lucene done in C++ is about three times faster than Java.

      Sorry for spoiling your moment there...

  • ...have used it on several projects and always gotten good results. Setting it up is easy and the Ruby API is solid, although I needed a tiny bit of additional code for special character escaping [blogs.com]. Highly recommended!

  • Oh wait - seems TFA is saying a lot of sites just use an SQL DB and use like '%FOO%' as a "search engine....

    Ok, this is reasonable, however, I don't see why anyone would choose sqllite as a benchmark. If you are trying to compare search engines, and consider an RDBMS to be a 'search engine' category, then you at least need to include 4 or 5 of the most popular open source RDBMSs in the benchmark (SQL lite, POstgreSQL, MySQL, Derby, Firebird), not just one.

  • CLucene (Score:5, Insightful)

    by drac667 (878093) on Monday July 06, 2009 @09:12AM (#28594121)
    All the other search engines except lucene are written in C/C++. Why didn't Vik Singh test also CLucene (http://sourceforge.net/projects/clucene/)?

    Here is the CLucene's description on SourceForce: "CLucene is a C++ port of Lucene: the high-performance, full-featured text search engine written in Java. CLucene is faster than lucene as it is written in C++."
    • Re: (Score:3, Insightful)

      by samkass (174571)

      CLucene is faster than lucene as it is written in C++.

      XXX is better than YYY as it is written in [my favorite language].

      Haven't we explored this one to death already? Java isn't slow, and there's nothing magic about C/C++. Badly written C/C++ gets trounced by Java any day, and algorithmic efficiency trounces both of those when it comes to complex functions like indexed searches.

      • Re: (Score:3, Insightful)

        by caramelcarrot (778148)

        But if it's a direct port of Lucene presumably it's using the same algorithms and has similar code quality - hence it provides a good direct comparison of the language speeds and such a comment is legit.

      • Haven't we explored this one to death already? Java isn't slow, and there's nothing magic about C/C++. Badly written C/C++ gets trounced by Java any day, and algorithmic efficiency trounces both of those when it comes to complex functions like indexed searches.

        Actually on synthetic benchmarks C/C++ implementation might outperform the Java implementation. Some benchmarks are crafted to essentially test memory bandwidth, where C/C++ easily wins.

        And still, well written C/C++ code scales magnitudes better than Java code. Resource management is a bitch. I have seen that to win a number of deals.

      • CLucene is faster, and uses less memory, from what is basically a direct port. The README includes some benchmarks:

        There are 250 HTML files under $JAVA_HOME/docs/api/java/util for about
        6108kb of HTML text.
        org.apache.lucene.demo.IndexFiles with java and gcj:
        on mac os x 10.3.1 (panther) powerbook g4 1ghz 1gb:
        . running with java 1.4.1_01-99 : 20379 ms
        . running with gcj 3.3.2 -O2 : 17842 ms
        . running clucene 0.8.9's demo : 9930 ms

        I recently did some more tests and came up with these rough tests:
        663mb (797 files) of Guttenberg texts
        on a Pentium 4 running Windows XP with 1 GB of RAM. Indexing max 100,000 fields
        Ã Jlucene: 646453ms. peak mem usage ~72mb, avg ~14mb ram
        Ã Clucene: 232141. peak mem usage ~60, avg ~4mb ram

        Searching indexing using 10,000 single word queries
        Ã Jlucene: ~60078ms and used ~13mb ram
        Ã Clucene: ~48359ms and used ~4.2mb ram

    • by jawahar (541989)
      Wish Vik Singh tested Open Source Implementation of PageRank [aspseek.org]
  • Does anybody know? That'd be a great comparison.
  • Please, can we avoid the "java vs C/CC++" thread again?
  • the lucene based nutch has been a big help to our group. we currently index 60 sites across the company, dive through PDF files and even shockwave flash and powerpoint with ease. the search results are extremely fast and the results are so accurate theyve blown our corporate engine completely out of the water.
  • by bobv-pillars-net (97943) <bobvin@pillars.net> on Monday July 06, 2009 @11:59AM (#28596187) Homepage Journal

    Last time I had to implement an indexing and searching solution, swish++ [sourceforge.net] was by far the performance winner.

  • DBSight uses Lucene's inverted index, and beats any database based B-tree search. And it's dead simple to use. Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes [dbsight.com]

There is no royal road to geometry. -- Euclid

Working...