Vector Space Search Engines in Perl 20
cascadefx writes "There is a great article on writing a Vector Space Search Engine in Perl over at the Orielly Network's Perl.com site. Most search engines, such as Google, use a reverse keyword index algorithm. The article, complete with code, proves yet again that there is (definitely) more than one way to do it. Try your hand. Add improvements and you may have a handy search engine for your site or... the next Google-killer."
Wait till... (Score:1, Interesting)
Yes, but... (Score:2)
no good for large collections of documents (Score:5, Insightful)
It's a pity the article doesn't make this clearer (instead, it just says "Searches take place in RAM, so there's no disk or database access", which makes it sound as if it would be faster!)
There's a reason people invented indexing schemes!
Re:no good for large collections of documents (Score:5, Insightful)
Since when is linear scaling abysmal? Keep in mind that for very large collections, you can use clustering or other techniques to reduce the number of comparisons. You essentially layer an indexing step on top of things, so that you only search the most promising vectors.
Keep in mind also that vector comparisons are very, very, very fast. People have run vector space search engines on collections in the millions without running into performance issues. And the claim that vector search is faster than a database search is true, for as long as you can fit all the document vectors into RAM.
Re:no good for large collections of documents (Score:3, Insightful)
Also, I've worked on SOMs previously and there is no way you can hold all the document vectors in RAM. Think google, when you do a search on 3 million documents. The part that really hurts is the number of possible keywords.
Re:no good for large collections of documents (Score:4, Informative)
You're completely right about web-size collections being too big to hold in RAM. It's a question of defining the problem space. My article was targeted for the hobbyist, who is far more likely to be building a search engine for a website (thousands of pages) than for the entire web.
Presumably, if you're trying to index the web, you won't be writing your search code in Perl
That's not to say you can't scale vector-space search engines. But it makes things complex. There is a lot of handwaving in the article, to try and keep things simple. One example, as you point out, is neglecting to mention practical ways of reducing vector size (number of keywords). That problem goes away if you use latent semantic analysis, which reduces the dimensionality of the vectors by several orders of magnitude. And as a nice side effect, you get a search engine with better recall, that can give you relevant results without a keyword match.
But LSI has its own scaling issues, and those scaling issues have their own workarounds, and so on and so on. Put it all in an introductory article, and it gets overwhelming.
Still, I think there's room for novel algorithmic approaches in the 100-100K document range. To give you one example, we run an LSI search engine on a database of old Civil War articles for the University of Virginia. There's no way the search could scale to ten million documents, but it doesn't have to. The collection is bounded.
I suspect other people have interesting collections in that kind of size range. Not everything has to scale to the billions of documents to be useful.
Re:no good for large collections of documents (Score:2, Insightful)
As far as SOM stuff, Kohonen [cis.hut.fi] is the man.
There are indeed many things that can be done with SOMs, but to get really good, human-like, results is quite difficult. The problem expands when you need to consider different words for the same thing (e.g., cat/feline), and exclude words that appear identical, but aren't. Then the need to consider context becomes increasingly important (e.g., this article is not about cats). The complexity builds quite quickly.
Re:no good for large collections of documents (Score:2)
Now that I think of it, to some extent, both the Google PageRank methodology and some of the emerging weblog and news-tracking tools utilize hyperlink information very effectively to build a sort of loose semantic grouping between pieces of web content. However, from what I understand of the PageRank algorithm, at least, the conceptual model is much closer to basic flow graphs than vector spaces, and it (arguably unfairly) more or less completely ignores 'island' sites or pages, no matter how closely they match a particular query.
I'm actually working on a simple search engine implementation for a couple of small websites, and may have to tinker a bit with some of this.
One thing I'm not sure about, though: are there any standard, efficient *persistent* data stores/structures especially suitable for this kind of vector data? I'm hoping to work in Ruby, which unfortunately lacks a complete mathematical data library like PDL, but can easily hook in either a simple C library or a native implementation of any appropriate structures.
Re:no good for large collections of documents (Score:1)
I believe the islands of pages are ignored because of abuse, you basically write some scripts that create an assload of interconnected webpages, thereby making the pagerank go up. Another way to think of it is, if the pages are all correlated (i.e., one source), then each page doesn't add as much information to the equation as an independently authored page would, thus you should discard them.
To me the biggest problem is the lack of contextual understanding. If the search engine could differentiate computer generated gibberish then they could eliminate most of the abuse going on.
My other peve is doing a search and just getting some guys list of random links containing my search terms in no relation to each other.
I am unfamilar with the Porter algorithm. Got a good link handy?
Re:no good for large collections of documents (Score:2)
As far as spoofing is concerned, though, I'm afraid that LSI and other "semantic" search methods will only outpace spammed results for a while, since the same basic algorithms that let you determine the relevance of a document could quite conceivably be used to generate the "ideal" match for any search you can think of.
For the porter stemming algorithm, check out the following:
http://www.tartarus.org/~martin/PorterStemmer/
There are implementations for just about any language I could imagine someone using for serious web development.
Re:no good for large collections of documents (Score:2, Informative)
one killer feature of this is...
Quote "Looking for articles about Tiger Woods in the same database brings up many stories about the golfer, followed by articles about major golf tournaments that don't mention his name. Constraining the search to days when no articles were written about Tiger Woods still brings up stories about golf tournaments and well-known players." -- bloody amazing and relativly simple to do.
and it seems to be scaling very well at the moment..
Re:no good for large collections of documents (Score:2)
Re:no good for large collections of documents (Score:2)
Disclaimer: I work in text retrieval for a living.
This is not true.
Recent research [rmit.edu.au] has shown it's quicker to access a compressed inverted file than it is to access an uncompressed inverted file, even if the whole index fits in RAM, provided your compression method is fast on your processor.
The reason for this counter-intuitive result is entirely due to cache performance. For a vector search, say, you have to consult the whole index, so your working set is the whole memory. For an inverted file-based search, you only need to consult the terms of interest. Not only is your working set much smaller, it's mostly accessing contiguous blocks of memory in a coherent manner.
Re:no good for large collections of documents (Score:1)
-Kevin
vector based search partially exists allready (Score:1)
applications in spellchecking? (Score:2)
Just taking the sums of the different letters in the word for different domains keeps closely-spelled words together, but it also maps other, highly less likely words to the same (or close) point in the vector space. For example: "101111001000" and "010000110111" would map to the same point.. although it's unlikely that one would be trying to type one and accidentally type the other (unless the typer was dyslexic or something.. in which case the spellchecker doesn't really have to care).
So we need something that gives a bit more information than simply positioning the word on a 2-dimensional vector space. So we can notice that often, similar words have similar sequences of letters, as well as a similar sum total of letters. I.e., the number of times a particular number "0" or "1" occurs after either a "0" or "1". For example in "10011101" and "10011001". The number of "0"s followed by "0"s is 1 in the first, 2 in the second. The number of "0s" followed by "1s" is 2 in both. The number of "1s" followed by "0"s is 2 in both, and the number of "1"s followed by "1"s is 2 in the first, 1 in the second. So the words would map to points (1, 2, 2, 2) and (2, 2, 2, 1) respectively, in the 4-dimenstional vector space over the Naturals (N x N x N x N). The distance between these two points, is sqrt(2).. which is pretty close.
We can combine the first approach and the second, and for each misspelled word, do a search for other words in both the 2D space and the 4D space, and allow all words within a certain distance threshold that are present in both spaces near the misspelled word. Potentially you could extend the spaces out so that you consider 3 consecutive letters (an 8D space for binary strings).. but as the dimensions get higher, you have to put more work into optimizing the searches. If we move from binary strings to actual english words, it's not a 2D and 4D space anymore.. but a 26D and 676D space. But it's possible to come up with some nice sparse data structures that let you make searches reasonably efficient in both space and time.
I'm not exactly sure why I'm posting this.. but I blazed last night and the shit was good.. but my coat got stolen.. and I've got cobwebs this morning.. and I tend to think of CS things a lot when I have cobwebs in my mind after a night of battling the green ninja.
-Laxitive
Re:applications in spellchecking? (Score:2)
I can't work out if you're trolling or if you were seriously thinking about this. There are absolutely massive flaws in the theory on this, but I gave it more thought myself and I think you're on to a great idea.
Binary is not the way to go with words though (after all, the letter A is always 01000001.. how does that connect it to any other letter which has 2 1's in the binary pattern? (say.. B, which is 01000010)
I think you need to use a multidimensional arrangement, and somehow use numbers which more represent the word.. perhaps common letter combinations or such.
Still, I'm so busy with work to think about this in any depth, but there sounds like there may be some mileage in your rough idea.