Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Perl Programming

Vector Space Search Engines in Perl 20

cascadefx writes "There is a great article on writing a Vector Space Search Engine in Perl over at the Orielly Network's Perl.com site. Most search engines, such as Google, use a reverse keyword index algorithm. The article, complete with code, proves yet again that there is (definitely) more than one way to do it. Try your hand. Add improvements and you may have a handy search engine for your site or... the next Google-killer."
This discussion has been archived. No new comments can be posted.

Vector Space Search Engines in Perl

Comments Filter:
  • Wait till... (Score:1, Interesting)

    by Anonymous Coward
    ... Quantum(tm) Computers become available! They can search through Hilbert space in polytime! Beat that with your puny Turing machines, ha!
  • by rpeppe ( 198035 ) on Wednesday February 26, 2003 @12:57PM (#5387427)
    As far as I can see, this algorithm runs in time proportional to the number of documents, which might be fine for a small number of documents, but would be abysmal in large environments.

    It's a pity the article doesn't make this clearer (instead, it just says "Searches take place in RAM, so there's no disk or database access", which makes it sound as if it would be faster!)

    There's a reason people invented indexing schemes!

    • [disclaimer: I wrote the article]

      Since when is linear scaling abysmal? Keep in mind that for very large collections, you can use clustering or other techniques to reduce the number of comparisons. You essentially layer an indexing step on top of things, so that you only search the most promising vectors.

      Keep in mind also that vector comparisons are very, very, very fast. People have run vector space search engines on collections in the millions without running into performance issues. And the claim that vector search is faster than a database search is true, for as long as you can fit all the document vectors into RAM.
      • Skimming the article, it seems to be quite similar to Self-Organizing Maps or Vector Quantizers. Care to comment?

        Also, I've worked on SOMs previously and there is no way you can hold all the document vectors in RAM. Think google, when you do a search on 3 million documents. The part that really hurts is the number of possible keywords.

        • I hadn't heard of either technique - if you have some links, I'd love to look at them.

          You're completely right about web-size collections being too big to hold in RAM. It's a question of defining the problem space. My article was targeted for the hobbyist, who is far more likely to be building a search engine for a website (thousands of pages) than for the entire web.

          Presumably, if you're trying to index the web, you won't be writing your search code in Perl ;-)

          That's not to say you can't scale vector-space search engines. But it makes things complex. There is a lot of handwaving in the article, to try and keep things simple. One example, as you point out, is neglecting to mention practical ways of reducing vector size (number of keywords). That problem goes away if you use latent semantic analysis, which reduces the dimensionality of the vectors by several orders of magnitude. And as a nice side effect, you get a search engine with better recall, that can give you relevant results without a keyword match.

          But LSI has its own scaling issues, and those scaling issues have their own workarounds, and so on and so on. Put it all in an introductory article, and it gets overwhelming.

          Still, I think there's room for novel algorithmic approaches in the 100-100K document range. To give you one example, we run an LSI search engine on a database of old Civil War articles for the University of Virginia. There's no way the search could scale to ten million documents, but it doesn't have to. The collection is bounded.

          I suspect other people have interesting collections in that kind of size range. Not everything has to scale to the billions of documents to be useful.

          • I hadn't heard of either technique
            Sadly that is often the case. History is riddled with examples of mathematicians duplicating discoveries. That is my motivation for studying these types of techniques, a way to better compare scientific papers, to root out similarities. The history of the wavelet is a good example to discuss. The theory was almost hit upon several times, in slightly different fashions but the researchers involved never knew that others were making similar discoveries.

            As far as SOM stuff, Kohonen [cis.hut.fi] is the man.

            There are indeed many things that can be done with SOMs, but to get really good, human-like, results is quite difficult. The problem expands when you need to consider different words for the same thing (e.g., cat/feline), and exclude words that appear identical, but aren't. Then the need to consider context becomes increasingly important (e.g., this article is not about cats). The complexity builds quite quickly.

          • One potential extension not raised in the article (though it was perhaps hinted at in the discussion of attributes attached to each document) is the use of explicit hypertext link information to affect the vector value of each document in a collection. One could even 'stem' the target URLs to the top-level domain, affecting the same kind of term-space reduction as the Porter algorithm does for English text.

            Now that I think of it, to some extent, both the Google PageRank methodology and some of the emerging weblog and news-tracking tools utilize hyperlink information very effectively to build a sort of loose semantic grouping between pieces of web content. However, from what I understand of the PageRank algorithm, at least, the conceptual model is much closer to basic flow graphs than vector spaces, and it (arguably unfairly) more or less completely ignores 'island' sites or pages, no matter how closely they match a particular query.

            I'm actually working on a simple search engine implementation for a couple of small websites, and may have to tinker a bit with some of this.

            One thing I'm not sure about, though: are there any standard, efficient *persistent* data stores/structures especially suitable for this kind of vector data? I'm hoping to work in Ruby, which unfortunately lacks a complete mathematical data library like PDL, but can easily hook in either a simple C library or a native implementation of any appropriate structures.
            • In my experience, the document vectors containing the terms et al present must be stored in a sparse vector. The reason is that as new terms and such get added on it would require growing all exisiting vectors, only to be filled with zeros.

              I believe the islands of pages are ignored because of abuse, you basically write some scripts that create an assload of interconnected webpages, thereby making the pagerank go up. Another way to think of it is, if the pages are all correlated (i.e., one source), then each page doesn't add as much information to the equation as an independently authored page would, thus you should discard them.

              To me the biggest problem is the lack of contextual understanding. If the search engine could differentiate computer generated gibberish then they could eliminate most of the abuse going on.

              My other peve is doing a search and just getting some guys list of random links containing my search terms in no relation to each other.

              I am unfamilar with the Porter algorithm. Got a good link handy?

              • The in-memory datastructures for the vector search aren't a big deal for the scale I'm working with -- I'm basically able to just use hashes for all the vectors at indexing time, which lets them grow with the collection in O(n) time. What I'm more concerned with is an efficient on-disk structure for repeated queries of the same index, so that I don't have to re-read the entire index into memory every time I run a simple lookup.

                As far as spoofing is concerned, though, I'm afraid that LSI and other "semantic" search methods will only outpace spammed results for a while, since the same basic algorithms that let you determine the relevance of a document could quite conceivably be used to generate the "ideal" match for any search you can think of.

                For the porter stemming algorithm, check out the following:

                http://www.tartarus.org/~martin/PorterStemmer/

                There are implementations for just about any language I could imagine someone using for serious web development.
      • hmm i suggest you all have a read of this laymen's [middlebury.edu] guide to latent semantic indexing, the vector model on steriods as the author puts it.

        one killer feature of this is...

        Quote "Looking for articles about Tiger Woods in the same database brings up many stories about the golfer, followed by articles about major golf tournaments that don't mention his name. Constraining the search to days when no articles were written about Tiger Woods still brings up stories about golf tournaments and well-known players." -- bloody amazing and relativly simple to do.

        and it seems to be scaling very well at the moment..
      • after "putting a layer of indexing on top of things", I can still see the usefulness of allowing some boolean operators (OR and NOT come to mind) in the queries
      • Disclaimer: I work in text retrieval for a living.

        And the claim that vector search is faster than a database search is true, for as long as you can fit all the document vectors into RAM.

        This is not true.

        Recent research [rmit.edu.au] has shown it's quicker to access a compressed inverted file than it is to access an uncompressed inverted file, even if the whole index fits in RAM, provided your compression method is fast on your processor.

        The reason for this counter-intuitive result is entirely due to cache performance. For a vector search, say, you have to consult the whole index, so your working set is the whole memory. For an inverted file-based search, you only need to consult the terms of interest. Not only is your working set much smaller, it's mostly accessing contiguous blocks of memory in a coherent manner.

    • I agree. This is not a good system for millions of documents. It's proportional to #terms * #documents, which obviously won't scale. The simple vector space model is fundamentally flawed for ranking in that it is biased towards returning short documents containing your terms. I still think this was a good article though.

      -Kevin

  • if you have visited www.kartoo.com you would notice that it is based on the same vector concept when searching for data, ofcourse it has the 3D gui that makes it easier to pick what you are looking for, personally i cant wait for google to implement and provide an option of being able to search the same way as kartoo.com does, in some incidents when i was cross searching for information, kartoos interface sped things up, so three cheers for vector based search engines ... (vector based mapping is used in voice recognition technologies... many 2.5G and 3G small companies that are attempting to provide the next must have mobile voice services are using vector based mapping for recognition, but ofcourse mapping doesnt do the whole job by itself ;)
  • I thought of something similar to this when trying to come up with a good spellchecking algorithm back in high school. To provide a simple example, we can start with trying to spellcheck binary strings. We can start off naively, representing each "word" as a point in a two-dimensional vector space over the field Naturals (N x N). Each word occupies a point in the space, calculated by summing the 1s and 0s in the word. "101000110" would occupy point (5, 4) - five 0s, four 1s. So given a misspelled word (let's say "101001110").. we first calculate that word's position in the vector space (in this case (6, 4)), and do a geograhic search for neighboring words.
    Just taking the sums of the different letters in the word for different domains keeps closely-spelled words together, but it also maps other, highly less likely words to the same (or close) point in the vector space. For example: "101111001000" and "010000110111" would map to the same point.. although it's unlikely that one would be trying to type one and accidentally type the other (unless the typer was dyslexic or something.. in which case the spellchecker doesn't really have to care).
    So we need something that gives a bit more information than simply positioning the word on a 2-dimensional vector space. So we can notice that often, similar words have similar sequences of letters, as well as a similar sum total of letters. I.e., the number of times a particular number "0" or "1" occurs after either a "0" or "1". For example in "10011101" and "10011001". The number of "0"s followed by "0"s is 1 in the first, 2 in the second. The number of "0s" followed by "1s" is 2 in both. The number of "1s" followed by "0"s is 2 in both, and the number of "1"s followed by "1"s is 2 in the first, 1 in the second. So the words would map to points (1, 2, 2, 2) and (2, 2, 2, 1) respectively, in the 4-dimenstional vector space over the Naturals (N x N x N x N). The distance between these two points, is sqrt(2).. which is pretty close.
    We can combine the first approach and the second, and for each misspelled word, do a search for other words in both the 2D space and the 4D space, and allow all words within a certain distance threshold that are present in both spaces near the misspelled word. Potentially you could extend the spaces out so that you consider 3 consecutive letters (an 8D space for binary strings).. but as the dimensions get higher, you have to put more work into optimizing the searches. If we move from binary strings to actual english words, it's not a 2D and 4D space anymore.. but a 26D and 676D space. But it's possible to come up with some nice sparse data structures that let you make searches reasonably efficient in both space and time.

    I'm not exactly sure why I'm posting this.. but I blazed last night and the shit was good.. but my coat got stolen.. and I've got cobwebs this morning.. and I tend to think of CS things a lot when I have cobwebs in my mind after a night of battling the green ninja.

    -Laxitive
    • This post was so long I had it linked on my desktop for a few days. I finally got round to reading it.

      I can't work out if you're trolling or if you were seriously thinking about this. There are absolutely massive flaws in the theory on this, but I gave it more thought myself and I think you're on to a great idea.

      Binary is not the way to go with words though (after all, the letter A is always 01000001.. how does that connect it to any other letter which has 2 1's in the binary pattern? (say.. B, which is 01000010)

      I think you need to use a multidimensional arrangement, and somehow use numbers which more represent the word.. perhaps common letter combinations or such.

      Still, I'm so busy with work to think about this in any depth, but there sounds like there may be some mileage in your rough idea.

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...