Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Learning About Full-text Search 140

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."
This discussion has been archived. No new comments can be posted.

Learning About Full-text Search

Comments Filter:
  • Salute (Score:2, Funny)

    by grub ( 11606 )

    ..and has been /.'ed once or twice..

    You mean two or three times now.
    • poor guy (Score:5, Informative)

      by understyled ( 714291 ) on Thursday December 18, 2003 @10:27AM (#7753656) Homepage Journal
      i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page [google.com].
      • Re:poor guy (Score:5, Insightful)

        by martingunnarsson ( 590268 ) <martin&snarl-up,com> on Thursday December 18, 2003 @10:52AM (#7753870) Homepage
        If Google can cache pages and put them online, so should Slashdot. People say copyright issues would be a problem, but in that case, why is Google's online cache any better?
        • Google is a search engine used and respected by virtually everyone. Slashdot is, well, Slashdot.

          Also, I believe that Google respects instructions in the robots.txt not to cache their page.
          • So could Slashdot do. Hehe, it's actually kind of funny! The webmaster's choises would be:
            1) Allow Slashdot to cache the site
            2) Get the site slashdotted back to the stoneage

            Nothing wrong with some maffia methods every now and then!
            • I think the main problem is that the guys who run slashdot would probably need to get permission beforehand to cache the linked page, and it would take too much time out of their day to email back and forth to every linked site. Sure, J Random Hacker wouldn't mind being cached, but CNN, News.com.com.com.com.com, and the New York Times just might. And they would have enough bandwidth to handle the Slashdotting.
              • Too much out of their day? Out of the 15 sites they link every day, they can't be bothered with asking because of *time constraints*?!

                Apparently it's more acceptable to them to knowingly blow sites out of the water (they even joked about it in this post) than to spend the time to fire off an email. The fact is, they don't even want to try.
              • Google isn't asking for permission. Again, Slashdot could obey to the rules in robots.txt.
                • Re:poor guy (Score:2, Insightful)

                  by ihummel ( 154369 )
                  Google is Google and Slashdot is Slashdot.

                  But even if the issue of liability were taken off the table, they would still have to get off of their metaphorical butts and set up a caching system. I don't know if there is any usable open-source system currently in existence, but if not, they would either have to code it themselves or adapt something already out there that doesn't serve their needs. Disk space isn't really an issue, as the commenting system takes a lot more space than the cache would (assuming
        • Re:poor guy (Score:5, Informative)

          by Arslan ibn Da'ud ( 636514 ) on Thursday December 18, 2003 @11:33AM (#7754287) Homepage
          • Re:poor guy (Score:4, Insightful)

            by davew2040 ( 300953 ) on Thursday December 18, 2003 @11:43AM (#7754390) Journal
            And they considered incorrectly.
            • Re:poor guy (Score:3, Informative)

              I don't know about that. There seem to be too many problems associated with caching. One that comes to my mind is the extra bandwith that they would have to worry about. An Article [alistapart.com] about the design of the site mentions that just changing over to CSS made a grand savings of 3-14 GB a day equalling something like $3,600.00 in the end. Now that's just by cutting 2-9KB off every page request. Now, think about them serving (possibly) huge pages from other sites that may not optomize their code... That's a lot o
              • Re:poor guy (Score:2, Funny)

                by davew2040 ( 300953 )
                Well then, I guess slashdot would learn firsthand about the slashdot effect!
                • Umm, I think they already do when you consider the amount of people who come here. Remember how often people say RTFA? If only those who view the articles cause "the slashdot effect" Imagine how much traffic slashdot already gets.
          • Their concern is that commercial sites will feel cheated out of ad revenue. But this problem is trivial to avoid: Don't cache pages initially, but have a system for caching them quickly if the webmaster asks. The stories wouldn't be delayed, but when they are accepted, a notification would be sent and a copy made. When the webmaster asks to be relieved, the links in the story would be changed to the cache.
        • by Anonymous Coward
          it's geared for public consumption,
          such is the nature of websites,
          so as long as you don't pretend you wrote it,
          it's abundantly clear where the original came from,
          go ahead and mirror (by mirror i mean take a snapshot).

          only if a copyright holder says don't do that should you remove it.
      • Re:poor guy (Score:5, Informative)

        by johnteslade ( 182250 ) on Thursday December 18, 2003 @12:33PM (#7754909)
        The site is still slashdotted. Each of his papers are on separate pages so here are the google caches of the individual papers:

        I have to write some crap at the end here so i can get past the "Your comment has too few characters per line" error message.

    • Re:Salute (Score:5, Interesting)

      by antarctican ( 301636 ) on Thursday December 18, 2003 @01:16PM (#7755355) Homepage
      ..and has been /.'ed once or twice..

      You mean two or three times now.


      And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M. :)

      However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.

      Funny thing is I didn't even know he was /.'ed until he emailed me. I went to check my email (via pine) and the console was as responsive as usual.

      For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content. :)
  • by Savatte ( 111615 ) on Thursday December 18, 2003 @10:18AM (#7753571) Homepage Journal
    He writes about seaching technology, but you can't easily search through his writings.
    • by Dreadlord ( 671979 ) on Thursday December 18, 2003 @10:23AM (#7753615) Journal
      too bad his pages are [w3.org] valid XHTML documents, it would have made an excellent +5 funny comment :(
    • Re:web page irony (Score:3, Interesting)

      by arrogance ( 590092 )
      Well, especially when it's been slashdotted. Here's a google cache hit to part of his writings [google.ca].

      I agree that it doesn't look to be easy to search around, at least when all you have is an URL to go on (http://www.tbray.org/ongoing/When/200x/2003/07/30 /OnSearchTOC) and Google to find reachable material. I'm also not too sure about using dates as folder names but that's just a personal thing: I think Tim Berners Lee recommended it at one point in an article "Cool URI's don't Change" [w3.org]. He does recommend using
    • Re:web page irony (Score:5, Informative)

      by Schwarzchild ( 225794 ) on Thursday December 18, 2003 @10:45AM (#7753822)
      He writes about seaching technology, but you can't easily search through his writings.

      Really? How about search site:tbray.org [google.com]?

  • by arvindn ( 542080 ) on Thursday December 18, 2003 @10:19AM (#7753582) Homepage Journal
    ...has been writing this endless series of essays on search technology since summer. He says he's finished now...

    Finished an endless series?

  • by peter303 ( 12292 )
    Try Knuth Vol 3.
    • by Anonymous Coward on Thursday December 18, 2003 @11:44AM (#7754396)
      Try reading the articles/essays. Knuth's vol 3 is about comparison search, not full-text search.
    • Maybe search technology has changed a lot since Kuth days. If one cursorily glances through the last coupla journals on Information Search and Retrieval, one cannot help the heavy influence of PageRank (Google's own technology). Thankfully the algorithm is well known. On the flip side, Critics have often asked wheather such algorithms be published? The bloggers have demonstrated that even Google rankings can be rigged... Personally, I would choose the open architecture philosophy, due to parallels with th
      • by Anonymous Coward
        You have that backwards. PageRank was heavily influenced by other systems, like Harvest. And full-text search has changed very little since Knuth. For instance, the basic extact string matching algorithms haven't advanced at all.
  • by clifgriffin ( 676199 ) on Thursday December 18, 2003 @10:34AM (#7753710) Homepage
    Though, I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet.


    • I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet

      But what if your stack grows big and you need to search through the stack ?
  • Anti-XML (Score:5, Interesting)

    by MattRog ( 527508 ) on Thursday December 18, 2003 @10:38AM (#7753753)
    Whether there's going to be a lot of XML around in repositories to search. XML these days is more used in interchange rather than archival applications.


    Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

    Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries [sybase.com] on them with minimal effort.
    • I thought Longhorn was going to use some sort of XML file system? Or at least there were thoughts about it?
    • namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

      I'm with ya there buddy.. If it wasn't for a corporate buyout, my OS/2 box with REXX scripts would still be ftp'ing files (I was really hoping for 10 years - but I've been gone for 3 now).

      Now they'll do it in some xxx.Net, because it's all new and cool. Whatever, at least my stuff was readable with 'edit'.

    • Re:Anti-XML (Score:5, Informative)

      by phurley ( 65499 ) on Thursday December 18, 2003 @10:55AM (#7753891) Homepage
      I agree to a point, but if we are talking about a mixed environement where you are using Oracle, I am using DB2, our friend Bob has his data in a legacy ISAM setup and a customer wants to integrate a search system across the tree systems they are going to have to write a lot of custom glue.

      If an XML aspect of the data is available (you can still keep it all in Oracle - just provide a "view" of it in XML) from each of us - common search tools and methods can be utilized.
      • I don't think writing your own DBMS engine (with query, data management, concurrency, etc.) support is going to be 'less' work than simply either ensuring that your SQL works with different vendors or writing small data pieces to talk to a number of DBMS products.

        You could, of course, bundle an existing DBMS product into the application which would remove the limitation of being forced to use the customer's DBMS product.
    • Re:Anti-XML (Score:4, Interesting)

      by arrogance ( 590092 ) on Thursday December 18, 2003 @11:02AM (#7753939)
      He even goes so far as to mention that Index Server will search your website: but fails to mention that it does full text searching on your entire file system.

      Unfortunately his site (http://www.tbray.org/ongoing/) seems to be sufficiently disorganized that I have trouble finding out what his real points are, or whether he's addressed all of the issues: for example, I saw no mention of the Semantic Web [w3.org] if his concern is searchability on web documents.

      As a side note, MS SQL is going more and more toward XML, as is the whole .NET framework. This results in richer (read: fatter) data but it does mean that you can store whatever metadata you want along with it.
    • I don't really get the advantages of XML Data storage either, but when it comes to emitting data in a generic, interoperable, self-describing format, XML works quite nicely, even if it is a tad verbose.

      Which (slightly OT) reminds me: has anyone here used an XML compression tool, that they'd like to share opinions on? I've looked at XMLPPM briefly but not worked with it yet. Any others?

      • "Which (slightly OT) reminds me: has anyone here used an XML compression tool"

        I've looked at a few, but frankly, haven't seen the point. Several generic compression types (e.g. zip) are based on finding sequences in the data (e.g. "<SomeTagName") that are repeated, and hence they do very well with XML. I had some really big XML doc that whatever zip compression lib I was using for other stuff, with default options got down to ~15%, while some XML-specific compressor, after a bit of configuration boug
    • Re:Anti-XML (Score:4, Insightful)

      by anomalous cohort ( 704239 ) on Thursday December 18, 2003 @11:17AM (#7754120) Homepage Journal

      From the google cache...

      searching for words isn't really what you want to do. You'd like to search for ideas, for concepts, for solutions, for answers.

      That is why he is going in an XML direction. The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized because, well, humans are highly non-normalized.

      I think that he should look at some work by a different Tim, the Semantic Web [w3c.org].

      • The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized ...

        That makes absolutely no sense.
        • Hmmm, perhaps a visit to a dictionary [dictionary.com] is in order. Once you read the definitions for rectilinear and normalized, I'll think you'll find the sense of the post.

          This is a sound strategy any time you run into a message that makes no sense. Simply look up the definitions of the words that you don't know.

          • It doesn't make any sense because it's meaningless. Try and provide reasoning why you think this sort of information can't be modeled relationally.
            • Because you'd (for example) have to provide a relational model for the semantics of the english language. And even that wouldn't meet the criterium of "generalized", because, ehm, it's specialized for the english language.

            • The problem isn't that the information can't be modeled in a relational manner, you could easily use a relational database for your data store.

              The problem is retrieving information to index. You pull information from existing data sources that have never heard of your data model and don't care. XML provides a simple way to map your existing content to some standard design that you come up with. That's the "normalization" step, and one of the harder parts of indexing.
      • Re:Anti-XML (Score:3, Insightful)

        by gorilla ( 36491 )
        Call me stupid if you like, but I don't see how the representation of the data helps to search for ideas concepts etc. Regardless of how the text is stores, unless you have a human do a lot of markup on the text, then you're going to have a problem in extracting the ideas from the text. And by markup I don't mean <heading>Heading</heading> I mean some entering what the ideas, concepts etc are for each part of the text - which can be done equally easily in a traditional database as in a XML docum
        • You can use stemmers, term frequencies and relative location in a document to provide some general gist of what a document is about. The whole point of creating advanced information retrieval tools is to make information processing a more automated task.
          • Yep, but what difference does it make if the text is stored in XML or in a database?
            • The XML part comes in when you are extracting content from an existing data store. You can use a relational database for a backend store, but when you're going through the step of mapping existing content to the info your indexing engine wants (the normalization step), XML is very handy.
        • True, the problem is that HTML became such a beast mixing semantic markup with visual markup that it is really hard to find well-marked up documents.

          Still, while it is possible to convert any form of data into a relational database, does that mean that the relational database is the best fit for all types of data?. One of the things that XML does well but relational databases don't do well (without a lot of violent shuffling around) is arbitrary parent-child relationships. So for example, a typical paper
    • SQL dbs might come with full-text indexing, but the power of information retrieval really comes into play when you can start clustering, using stemmers to find people/places, etc. Db full-text indexing feels more like a feature checkbox than a real information retrieval system.

      XML can be useful because you can take data from disparate sources (an Exchange server, SQL db, etc.) and normalize the meta data (the document author, date the document was created, etc.).

      I agree there's an overwhelming "silver-bul
    • Re:Anti-XML (Score:3, Interesting)

      by I8TheWorm ( 645702 )
      I tend to get on an XML soap (no pun intended) box when I see articles about it, so here goes...

      XML is great for sharing data between non-congruous systems. It's horrible, however, for storing data in any large quantity, and even more horrible for treating as a searchable text file. It's inherintly large and full of ascii/ansi/utf characters that are completely unnecessary when performing byte by byte text searches. For large amounts of data, you're right... RDBMS is the current way to go... maybe OODB
      • XML is almost ideal for storing structured text in large quantities. Storing non-textual data, not so much. (This is one reason why XML gets a bad reputation for data representation; people are using it for tasks which are not textual markup-related.) For byte-by-byte searching... true enough, it sucks for that. But surely if you have text in large quantities, you're hardly going to search it using "grep". That would be insane whether it's stored in XML or plain text.

    • Re:Anti-XML (Score:3, Interesting)

      by DrVomact ( 726065 )

      The reason why XML is widely used today for a multitude of purposes (e.g., data interchange between otherwise incompatible systems, configuration files, technical documents, command protocols that communicate with servers, etc. etc.) and why it will be used for even more stuff in the future is that it is centered on a very simple and powerful idea: self-documenting data. That is, the data is structured by internal markers that give information about the type of information contained in each logical element

      • I have no idea whether the databases of the future will store their data in XML form or not

        Not likely. XML is designed to solve the data identification problem, not the data storage problem.

        Due to the heirarchical nature of XML, a validating parser must read the entire document before returning any results. Given the way that most parsers are designed, the entire document will be read into memory and first parsed, then validated. Which, of course, limits the size of your database to the machine's m

    • I didn't get that impression from the article that he was considering XML as data storage. I saw the point as being that we don't know how much XML a search system will have to process. If your data consists of a large number of OpenOffice, DocBook, XHTML or Framemaker documents, then it might just be easier to keep things in XML rather than to split the data apart into a bunch of atomic chunks.

      I love using RDBMS but for some applications, creating a normalized database is a pain in the rear. Bibliograp
    • Relational databases and full-text indexing are a poor fit once you have a lot of text to store. Yes, I know. Most SQL DBMS come with full-text indexing. That's not enough. Read on for the reason why.

      Think about how a relational DBMS works. Internally, the major data structure is the "stream of tuples". A tuple is a virtual record which is made up of a number of fields, each of which has data in it.

      When you search, you get back a stream of tuples, which is usually some projection of the record store

  • Everything beyond the TOC (which I loaded onto my browser) is slashdotted. The problem with the links to the different articles is that its not part of a tree hierarchy, I cant just say "wget all pages beyond point X", nor can I make a guess and do a regex download of all URLs with "search" in them, because some articles [tbray.org] do not conform to that pattern.

    A tarball for offline browsing would be nice ? didnt see it on the page, though. Save you a part of a slashdotting, Tim.. how about it ? :)

  • by Pathetic Coward ( 33033 ) on Thursday December 18, 2003 @10:46AM (#7753829)
    Search technology. Hmmm. Wasn't that outsourced to India last month? Or was that last year? I just can't keep up with IT today.
  • by leoaugust ( 665240 ) <leoaugust@[ ]il.com ['gma' in gap]> on Thursday December 18, 2003 @10:46AM (#7753832) Journal

    I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

    "Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica [antarctica.net] .....

    I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.

    Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.

    Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...

    • Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...


      I thought that was just a myth [slashdot.org]?
    • From the article (On Search: Backgrounder), on using Open Source tools:

      Each of the ones I've looked at has a problem (lightly/poorly maintained, scalability problems, lack of internationalization, awkward API).

      Good luck convincing him to go Open Source!

    • "This is the last in my series of On Search essays. I've written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I'd like to change this part of the world. In short, I'd like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I'll write something on how it might get built.

      Nami

      • Sure we did RTFA. Can you Read Between The F* Lines RBTFL ?

        Here is what Tim says:

        This essay is about what that software should look like. Early next year I'll write something on how it MIGHT get built.

        So BRF is going to be open-source.

        I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

        And if the following is not Consultant-Speak I don't know what is - Consultants are great at telling you why you should not be doing what

    • Go Open Source Tim ... and get the ball rolling.
      The ball is already rolling. Check out Lucene [apache.org] or Nutch [nutch.org]. Either of these could be enhanced to support Tim's ideas. Volunteers? (I'm already working on it.)
    • Tim Bray was one of the founders of open text corporation ... they INVENTED the search engine.
      Digital (with whom they were working) "stole" the idea and opened Altavista 3 months before their IPO.
      I worked for Open Text for a year but after Tim left (just about the time the 1.0 draft of the XML spec appeared).
  • here [216.239.41.104]
  • by Glass of Water ( 537481 ) on Thursday December 18, 2003 @11:32AM (#7754280) Journal
    But there is some good stuff out there; for example Slashdot's search engine seems to run smooth, clean, and fast, but some poking around failed to reveal what it is: I wouldn't be surprised if it's just the Mysql search facility.
    Anybody know the answer to this one?
  • by JPMH ( 100614 ) on Thursday December 18, 2003 @11:48AM (#7754442)
    An interesting counterpoint to this story in the Register today:

    "A Quantum Theory of Internet Value" [theregister.co.uk] by Andrew Orlowski
    -- why librarians are better at finding the book you want than Google.

  • Mirror (Score:5, Informative)

    by Door-opening Fascist ( 534466 ) <skylar@cs.earlham.edu> on Thursday December 18, 2003 @12:11PM (#7754675) Homepage
    Since the site looks bogged down from the /.'ing, I've made a few mirrors:

    Mirror #1 [earlham.edu]

    Mirror #2 [earlham.edu]

    Mirror #3 [dhs.org]

    • Excellent work. Not sure how you were able to circumnavigate the /. takedown.
      One problem, however: It's just the front page. The meat of the information is still hiding on his server.
      • One problem, however: It's just the front page. The meat of the information is still hiding on his server.
        It was originally just the front page. I decided to get that up fast to get the load off the original server. I've just updated the mirror with the important links, but those took a little longer to fetch.
  • Also, Google claims that links from pages that themselves have a lot of incoming links count for more, but I'm not actually sure they'd need to do that to get the results they do.
    Is Google's page rank algorithm really that mysterious? I know they fiddle with it in secret ways now and then to discourage abuse, but I heard the fundamental algorithm was basically pretty simple: something like finding the eigenvectors and eigenvalues of the matrix of links. (Not sure exactly what they do with these -- associat
    • (Replying to myself): Here [iprcom.com] is a site that claims to explain Google page rank completely. I found it by doing a Google search on 'google "page rank"', and I assume it's pretty authoritative, because it had the highest page rank :-)
    • google broken [google-watch.org]? (www.google-watch.org)

      "... unique ID for each page stored as ansi c, 4 bytes on Linux system (~4yo) gives theoretical limit of 4.2 billion pages. ..."

      discusses the move to 5 bytes and suggests how this move may be the cause of weird search results on google searchs this year - of course the other reason my be google foiling search cue jumpers [webmasterworld.com].

  • by Anonymous Coward
    It just has a new name, and it's being developed by librarians.
    http://www.dlxs.org/products/xpat.htm l
  • I've been looking all over...
  • More search related functions should be available to php and perl and built in to them .. Even Mysql too...
  • in the post-google world, UIs like the General Interface [geninterface.com] will appear. check out their demo at Integrated Web Services [geninterface.com] and no i dont work there. i just like the direction they are going in.
    • If they're so general, how come I get this when I try to view the sample apps?

      Sample Applications

      General Interface Objects currently supports Internet Explorer 5.5 and later browsers running on Windows. For access to the sample applications please use another browser.

      I guess "general" means "IE only".

      Sean

  • From the site: "It has fifteen instalments not including this table of contents."

    Last I searched the dictionary, it was "installments."

    I guess alphabetical searching is best after all.

For God's sake, stop researching for a while and begin to think!

Working...