Open Source Search Engine Benchmarks 62
Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."
Re:k (Score:5, Insightful)
Nothing else to say, really
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
I may have to poke around in the Lucene code after work tonight to figure out what kind of strange majick those Apache developers employ. Hopefully I'll walk away with some extra spells in my bag.
Re: (Score:2)
Re:k (Score:5, Insightful)
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats?
Is it really that big a surprise? Given that some of the largest, most information-heavy sites on the Internet (e.g. Wikipedia) use it for their internal search?
Re:k (Score:5, Insightful)
But Wikipedia's internal search is the suckiest thing that ever sucked! Seriously, does anyone use it, instead of just sticking "wikipedia" into their Google search?
Re: (Score:2)
Sticking "wiki" into it usually suffices. :)
Re: (Score:3, Insightful)
Meh, look at any /. article about Java and you'll see somebody complain about the speed of Java, and a reply explaining that Java isn't particularly slow. It has some weaknesses that mean it isn't as optimal as really good C, but it also has some
Re:k (Score:5, Informative)
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
Lucene is a great search tool. As TFA pointed out, however, if you're looking for a "search solution" rather than "search engine" then you should check out Solr [apache.org] instead. Lucene is a toolkit that you build on top of, not something you really want to deploy by itself. Solr is that thing built on top of Lucene.
Be aware that while Lucene/Solr has made terrific progress, it is not quite in the "enterprise search" category. For superscale implementations you'll still likely need to look at a high-priced product like FAST [microsoft.com].
Re:k (Score:5, Informative)
Re: (Score:1, Interesting)
Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.
I work in a shop that uses fast, despite pressure from some to move to solr. As I understand it, solr can't keep up with the volume of changes we need to make to our data. I'm talking millions of documents of a 100+ fields changed, per day, with any given change visible to the customer within a short timeframe (10 minutes). solr can index that much data easily, but it can't keep with that kind of volume. That's what I've been told anyway.
Re: (Score:3, Informative)
We have one index with that's updated every 20 minutes, but only has about 50k documents and a combination of Solr cache auto-warming and squid's stale-while-re-validate logic works there.
In another system where updates need to be faster, we had to do some custom work to make
Re: (Score:2)
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
Of course you are, fool! Everyone else on slashdot knows exactly how Lucene and sqlite's indexing systems work. I don't know why they bothered to take the benchmarks at all, anyone with half a clue has integrated a Java engine running Lucene into sqlite and hooked it into MyISAM already..
Re: (Score:1)
I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
In the "benchmark," it wasn't just impressive in those areas: it had the lowest search time, the smallest index, and the highest relevance. That makes top honors, in my book.
Re: (Score:2)
Far more likely to be because of the choice of algorithms and the resources behind the project. Would be interesting to see how CLucene [sourceforge.net] performs.
Java is not slow (anymore) (Score:1)
Java can't seem to get past it's reputation for being slow - which quite simply is no longer true. Java can match and even exceed the speed of C/C++ implementations. This often seems like an impossible, even outrageous claim to many C/C++ developers. What they fail to see is, that Javas Hotspot compiler compiles critical code sections at runtime on the client computer. This has the advantage over C/C++ programs that the compiler has detailed info about the system it's running on and therefore can perform sp
Re: (Score:2)
Nothing else to say, really
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
Yes, that's pretty much you yes. Different algorithms, therefore different performance. Reimplement Lucene in C++, then see what the differences are in terms of speed (and if you care, code size, complexity, etc.). Until then the comparison is totally meaningless.
And gee, what's with the defensive attitude...
Re: (Score:3, Informative)
Re:k (Score:5, Informative)
Ah, thank you. So indeed, an implementation of the same algorithm turns out to be _three times_ as fast in C++ than it is in Java (see here [sourceforge.net]).
I wonder if eldavojohn wishes to comment on that?
Re: (Score:2)
It's no surprise to me. Java has long since been the best technology for all things internet. Streaming servers, forum software, indexing/archiving, Web2.0 sites; it's only several dozen times faster than Ruby or PHP, with similar memory usage. And I'm not talking applets here - I mean the backend. Tomcat is even significantly faster than mod_php or fastCGI with their C backends.
Keep in mind that anything Java based has VM overhead. If they included that in the Lucene graphs, then it performed the best whil
Re: (Score:2)
Hear the heads exploding - Java is fastest (Score:5, Informative)
Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).
C++ and C both fail to deliver the same level of performance as the Java virtual machine.
Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?
But hell this is Slashdot and Java is Slooooooow...
Re: (Score:2)
You beat me to the comment. I'm sort of surprised that the reaction so far has been the sound of crickets and loud yawning... meh
Re: (Score:1)
Well, the OP certainly got a loud yawn from me for the remark about indexing twitter posts. They might just as well index cockroach farts.
Re: (Score:2, Interesting)
It may be a bit faster on searching, but it take ~5 times as long to generate the index, and use twice as much memmory when searching so it may just be a different trade off between index time and search time.
And it's a bad search test, because the total search time is less them 2 seconds, thus not including the cost of the gc for java.
hint to people doing benchmark: When benchmarking a component which use gc or similary memmory handling methods, remember to have the test dataset be large enough that you ca
Re: (Score:2)
hint to people doing benchmark: When benchmarking a component which use gc or similary memmory handling methods, remember to have the test dataset be large enough that you cause enough gc cycles to make the performance of any single cycle noise.
I have an even better idea. Why don't we just model the benchmark on the real world usage scenarios, and let those decide whether garbage collection and allocation even matter?
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
I can't seem to move into a lane of faster traffic in heavy traffic situations on the highway without being able to immediately accelerate. I can't seem to shift into a lower gear without using the knockdown mechanism, which requires me to depress the accelerator the whole way and wait a second for everything to engage. In a manual, I can downshift to fourth or third and control my speed, enter an opening, and accelerate quickly without fear that backing off the accelerator a little (you try keeping contr
Re: (Score:1, Insightful)
I stopped driving automatic, it stopped almost getting me into accidents.
You're a fucking idiot. Get off my road.
Re: (Score:2)
Re:Hear the heads exploding - Java is fastest (Score:4, Informative)
Hrm, this had absolutely nothing to do with the language. It has almost everything to do with the algorithms.
Its very hard to compare languages, maybe if you use the languages to implement the exact same algorithm and let it run for a long while... But that still doesn't really compare it well enough.
Like somebody already said: Bubble sort in C++ is (almost) always slower then a quicksort in Java.
Re: (Score:2)
C++ and C both fail to deliver the same level of performance as the Java virtual machine.
Oh wait hang on...
As was pointed above, the search engines spend >90% of their time in DB/file I/O code.
In other words, implementation language plays little role - it is I/O optimization algorithms which play bigger role.
From my experience with number of C/C++ projects, efficiency of the languages/compilers allows developers to remain ignorant. In Java that approach simply doesn't work. Thus I more often see more better algorithms often in less efficient languages.
Like I recently found in one program people used
Re: (Score:1, Flamebait)
Java is slow. If you took the same algorithm and coded it in an efficient compiled language, it would be faster. Much faster.
Re: (Score:1)
T
Re: (Score:2)
Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform...
Yes. ...and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?
No.
Re: (Score:3, Informative)
Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).
C++ and C both fail to deliver the same level of performance as the Java virtual machine.
Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?
But hell this is Slashdot and Java is Slooooooow...
Actually if you check here [sourceforge.net], you will find that an implementation of the exact same Lucene done in C++ is about three times faster than Java.
Sorry for spoiling your moment there...
I've been very happy with Sphinx.... (Score:2)
...have used it on several projects and always gotten good results. Setting it up is easy and the Ruby API is solid, although I needed a tiny bit of additional code for special character escaping [blogs.com]. Highly recommended!
SQLLite is a search engine?!??! (Score:3, Insightful)
Oh wait - seems TFA is saying a lot of sites just use an SQL DB and use like '%FOO%' as a "search engine....
Ok, this is reasonable, however, I don't see why anyone would choose sqllite as a benchmark. If you are trying to compare search engines, and consider an RDBMS to be a 'search engine' category, then you at least need to include 4 or 5 of the most popular open source RDBMSs in the benchmark (SQL lite, POstgreSQL, MySQL, Derby, Firebird), not just one.
Re: (Score:1)
Although they might have full text indexing and searching, databases and search engines/libraries work differently.
E.g. you come to online DVD shop and search for "Tom Criuse" (hint: misspelled surname). Every decent search engine (including Lucene library, not sure of others evaluated here) would yield a result, despite misspelling. I am not sure whether database fulltext thing would spit anything at all. It's simply built do do different job, that's it.
CLucene (Score:5, Insightful)
Here is the CLucene's description on SourceForce: "CLucene is a C++ port of Lucene: the high-performance, full-featured text search engine written in Java. CLucene is faster than lucene as it is written in C++."
Re: (Score:3, Insightful)
CLucene is faster than lucene as it is written in C++.
XXX is better than YYY as it is written in [my favorite language].
Haven't we explored this one to death already? Java isn't slow, and there's nothing magic about C/C++. Badly written C/C++ gets trounced by Java any day, and algorithmic efficiency trounces both of those when it comes to complex functions like indexed searches.
Re: (Score:3, Insightful)
But if it's a direct port of Lucene presumably it's using the same algorithms and has similar code quality - hence it provides a good direct comparison of the language speeds and such a comment is legit.
Re: (Score:2)
Haven't we explored this one to death already? Java isn't slow, and there's nothing magic about C/C++. Badly written C/C++ gets trounced by Java any day, and algorithmic efficiency trounces both of those when it comes to complex functions like indexed searches.
Actually on synthetic benchmarks C/C++ implementation might outperform the Java implementation. Some benchmarks are crafted to essentially test memory bandwidth, where C/C++ easily wins.
And still, well written C/C++ code scales magnitudes better than Java code. Resource management is a bitch. I have seen that to win a number of deals.
Re: (Score:2)
CLucene is faster, and uses less memory, from what is basically a direct port. The README includes some benchmarks:
There are 250 HTML files under $JAVA_HOME/docs/api/java/util for about
6108kb of HTML text.
org.apache.lucene.demo.IndexFiles with java and gcj:
on mac os x 10.3.1 (panther) powerbook g4 1ghz 1gb:
. running with java 1.4.1_01-99 : 20379 ms
. running with gcj 3.3.2 -O2 : 17842 ms
. running clucene 0.8.9's demo : 9930 ms
I recently did some more tests and came up with these rough tests:
663mb (797 files) of Guttenberg texts
on a Pentium 4 running Windows XP with 1 GB of RAM. Indexing max 100,000 fields
à Jlucene: 646453ms. peak mem usage ~72mb, avg ~14mb ram
à Clucene: 232141. peak mem usage ~60, avg ~4mb ram
Searching indexing using 10,000 single word queries
à Jlucene: ~60078ms and used ~13mb ram
à Clucene: ~48359ms and used ~4.2mb ram
Re: (Score:1)
How do these compare to Oracle? (Score:2)
PLEASE (Score:1)
Re: (Score:2)
Swish++ not mentioned? (Score:3, Informative)
Last time I had to implement an indexing and searching solution, swish++ [sourceforge.net] was by far the performance winner.
Try DBSight -- Lucene based Database Search (Score:1)