Learning About Full-text Search 140

Posted by michael on Thursday December 18, 2003 @10:15AM from the looks-easy-but-isn't dept.

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."

This discussion has been archived. No new comments can be posted.

Learning About Full-text Search

Load All Comments

Search 140 Comments Log In/Create an Account

Comments Filter:

Salute (Score:2, Funny)

by grub ( 11606 ) writes:

..and has been /.'ed once or twice..

You mean two or three times now.
- poor guy (Score:5, Informative)
  
  by understyled ( 714291 ) writes: on Thursday December 18, 2003 @10:27AM (#7753656) Homepage Journal
  
  i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page [google.com].
  
  Parent Share
  twitter facebook
  - Re:poor guy (Score:5, Insightful)
    
    by martingunnarsson ( 590268 ) writes: <martin&snarl-up,com> on Thursday December 18, 2003 @10:52AM (#7753870) Homepage
    
    If Google can cache pages and put them online, so should Slashdot. People say copyright issues would be a problem, but in that case, why is Google's online cache any better?
    
    Parent Share
    twitter facebook
    - Re:poor guy (Score:1)
      
      by ihummel ( 154369 ) writes:
      
      Google is a search engine used and respected by virtually everyone. Slashdot is, well, Slashdot.
      
      Also, I believe that Google respects instructions in the robots.txt not to cache their page.
      - Re:poor guy (Score:2)
        
        by martingunnarsson ( 590268 ) writes:
        
        So could Slashdot do. Hehe, it's actually kind of funny! The webmaster's choises would be:
        1) Allow Slashdot to cache the site
        2) Get the site slashdotted back to the stoneage
        
        Nothing wrong with some maffia methods every now and then!
        
        Re:poor guy (Score:1)
        
        by ihummel ( 154369 ) writes:
        
        I think the main problem is that the guys who run slashdot would probably need to get permission beforehand to cache the linked page, and it would take too much time out of their day to email back and forth to every linked site. Sure, J Random Hacker wouldn't mind being cached, but CNN, News.com.com.com.com.com, and the New York Times just might. And they would have enough bandwidth to handle the Slashdotting.
        
        Re:poor guy (Score:1)
        
        by davew2040 ( 300953 ) writes:
        
        Too much out of their day? Out of the 15 sites they link every day, they can't be bothered with asking because of *time constraints*?!
        
        Apparently it's more acceptable to them to knowingly blow sites out of the water (they even joked about it in this post) than to spend the time to fire off an email. The fact is, they don't even want to try.
        
        Re:poor guy (Score:3, Offtopic)
        
        by martingunnarsson ( 590268 ) writes:
        
        Google isn't asking for permission. Again, Slashdot could obey to the rules in robots.txt.
        
        Re:poor guy (Score:2, Insightful)
        
        by ihummel ( 154369 ) writes:
        
        Google is Google and Slashdot is Slashdot.
        
        But even if the issue of liability were taken off the table, they would still have to get off of their metaphorical butts and set up a caching system. I don't know if there is any usable open-source system currently in existence, but if not, they would either have to code it themselves or adapt something already out there that doesn't serve their needs. Disk space isn't really an issue, as the commenting system takes a lot more space than the cache would (assuming
    - Re:poor guy (Score:5, Informative)
      
      by Arslan ibn Da'ud ( 636514 ) writes: on Thursday December 18, 2003 @11:33AM (#7754287) Homepage
      
      Slashdot has already considered this. [slashdot.org] RTFFAQ
      
      Parent Share
      twitter facebook
      - Re:poor guy (Score:4, Insightful)
        
        by davew2040 ( 300953 ) writes: on Thursday December 18, 2003 @11:43AM (#7754390) Journal
        
        And they considered incorrectly.
        
        Parent Share
        twitter facebook
        
        Re:poor guy (Score:3, Informative)
        
        by spectre_240sx ( 720999 ) writes:
        
        I don't know about that. There seem to be too many problems associated with caching. One that comes to my mind is the extra bandwith that they would have to worry about. An Article [alistapart.com] about the design of the site mentions that just changing over to CSS made a grand savings of 3-14 GB a day equalling something like $3,600.00 in the end. Now that's just by cutting 2-9KB off every page request. Now, think about them serving (possibly) huge pages from other sites that may not optomize their code... That's a lot o
        
        Re:poor guy (Score:2, Funny)
        
        by davew2040 ( 300953 ) writes:
        
        Well then, I guess slashdot would learn firsthand about the slashdot effect!
        
        Re:poor guy (Score:1)
        
        by spectre_240sx ( 720999 ) writes:
        
        Umm, I think they already do when you consider the amount of people who come here. Remember how often people say RTFA? If only those who view the articles cause "the slashdot effect" Imagine how much traffic slashdot already gets.
      - Re:poor guy (Score:2)
        
        by Nucleon500 ( 628631 ) writes:
        
        Their concern is that commercial sites will feel cheated out of ad revenue. But this problem is trivial to avoid: Don't cache pages initially, but have a system for caching them quickly if the webmaster asks. The stories wouldn't be delayed, but when they are accepted, a notification would be sent and a copy made. When the webmaster asks to be relieved, the links in the story would be changed to the cache.
    - it's geared for public consumption (Score:1, Insightful)
      
      by Anonymous Coward writes:
      
      it's geared for public consumption,
      such is the nature of websites,
      so as long as you don't pretend you wrote it,
      it's abundantly clear where the original came from,
      go ahead and mirror (by mirror i mean take a snapshot).
      
      only if a copyright holder says don't do that should you remove it.
  - Re:poor guy (Score:5, Informative)
    
    by johnteslade ( 182250 ) writes: on Thursday December 18, 2003 @12:33PM (#7754909)
    The site is still slashdotted. Each of his papers are on separate pages so here are the google caches of the individual papers:
    
    Table of Contents [216.239.59.104]
    
    Chapter 1: Backgrounder [216.239.59.104]
    
    Chapter 2: The Users [216.239.59.104]
    
    Chapter 3: Basic Basics [216.239.59.104]
    
    Chapter 4: Precision and Recall [216.239.59.104]
    
    Chapter 5: Intelligence [216.239.59.104]
    
    Chapter 6: Squirmy Words [216.239.59.104]
    
    Chapter 7: UI Archeology (No Cache)
    
    Chapter 8: Stopwords [216.239.59.104]
    
    Chapter 9: Metadata [216.239.59.104]
    
    Chapter 10: I18n [216.239.59.104]
    
    Chapter 11: Result Ranking [216.239.59.104]
    
    Chapter 12: Interfaces [216.239.59.104]
    
    Chapter 13: XML [216.239.59.104]
    
    Chapter 14: Robots [216.239.59.104]
    
    I have to write some crap at the end here so i can get past the "Your comment has too few characters per line" error message.
    Parent Share
    twitter facebook
- Re:Salute (Score:5, Interesting)
  
  by antarctican ( 301636 ) writes: on Thursday December 18, 2003 @01:16PM (#7755355) Homepage
  
  ..and has been /.'ed once or twice..
  
  You mean two or three times now.
  
  And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M. :)
  
  However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.
  
  Funny thing is I didn't even know he was /.'ed until he emailed me. I went to check my email (via pine) and the console was as responsive as usual.
  
  For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content. :)
  
  Parent Share
  twitter facebook
web page irony (Score:3, Funny)

by Savatte ( 111615 ) writes: on Thursday December 18, 2003 @10:18AM (#7753571) Homepage Journal

He writes about seaching technology, but you can't easily search through his writings.

Share
twitter facebook
- Re:web page irony (Score:5, Funny)
  
  by Dreadlord ( 671979 ) writes: on Thursday December 18, 2003 @10:23AM (#7753615) Journal
  
  too bad his pages are [w3.org] valid XHTML documents, it would have made an excellent +5 funny comment :(
  
  Parent Share
  twitter facebook
  - - Re:web page irony (Score:2, Informative)
      
      by Anonymous Coward writes:
      
      they don't, but the parent post is about finding some conflict between the author's pages and aticles.
      He's got an article about searching and his pages aren't searchable, and he's got articles about XML, so having non-valid XHTML pages would definitely have been ironic...
      - Re:web page irony (Score:2, Funny)
        
        by Anonymous Coward writes:
        
        Tee hee, I get it now. It reminds me of this time that something didn't happen, but if it had happened, it would have been funny. Ha ha, that still cracks me up. Yes, most amusing.
- Re:web page irony (Score:3, Interesting)
  
  by arrogance ( 590092 ) writes:
  
  Well, especially when it's been slashdotted. Here's a google cache hit to part of his writings [google.ca].
  
  I agree that it doesn't look to be easy to search around, at least when all you have is an URL to go on (http://www.tbray.org/ongoing/When/200x/2003/07/30 /OnSearchTOC) and Google to find reachable material. I'm also not too sure about using dates as folder names but that's just a personal thing: I think Tim Berners Lee recommended it at one point in an article "Cool URI's don't Change" [w3.org]. He does recommend using
- Re:web page irony (Score:5, Informative)
  
  by Schwarzchild ( 225794 ) writes: on Thursday December 18, 2003 @10:45AM (#7753822)
  
  He writes about seaching technology, but you can't easily search through his writings.
  Really? How about search site:tbray.org [google.com]?
  
  Parent Share
  twitter facebook
Hold on there (Score:5, Funny)

by arvindn ( 542080 ) writes: on Thursday December 18, 2003 @10:19AM (#7753582) Homepage Journal

...has been writing this endless series of essays on search technology since summer. He says he's finished now...
Finished an endless series?

Share
twitter facebook
- Re:Hold on there (Score:5, Funny)
  
  by MooCows ( 718367 ) writes: on Thursday December 18, 2003 @10:22AM (#7753609)
  
  The maximum number of results have been returned.
  
  Parent Share
  twitter facebook
- Re:Hold on there (Score:1)
  
  by Dreadlord ( 671979 ) writes:
  
  Maybe we have just hit +Infinity of time?
  Time flies when we're sitting in front of our comps, reading /.
- Bray's theorem (Score:4, Funny)
  
  by KoolDude ( 614134 ) writes: on Thursday December 18, 2003 @10:42AM (#7753787)
  
  The essay series converges to text book when time tends to infinity. Proof is left as an exercise to the reader.
  
  Parent Share
  twitter facebook
- ObHutz (Score:3, Funny)
  
  by sharkey ( 16670 ) writes:
  
  Mr. Simpson, this is the most blatant case of fraudulent advertising since my suit against the film, ``The Never-Ending Story''.
- Re:Hold on there (Score:2)
  
  by Lozzer ( 141543 ) writes:
  
  In the first three months he wrote a page, in the next month and a half he wrote another page, in the next (scratching of head) three quarters of a month he wrote another page, and so on. Now after six months he has written an endless amount of stuff, simple (yet amazing) really.
- Re:Hold on there (Score:1)
  
  by haystor ( 102186 ) writes:
  
  He wrote them in a circle.
re-inventing the wheel (Score:2, Interesting)

by peter303 ( 12292 ) writes:

Try Knuth Vol 3.
- Re:re-inventing the wheel (Score:4, Insightful)
  
  by Anonymous Coward writes: on Thursday December 18, 2003 @11:44AM (#7754396)
  
  Try reading the articles/essays. Knuth's vol 3 is about comparison search, not full-text search.
  
  Parent Share
  twitter facebook
- Re:re-inventing the wheel (Score:2, Insightful)
  
  by getarun_vr ( 524827 ) writes:
  
  Maybe search technology has changed a lot since Kuth days. If one cursorily glances through the last coupla journals on Information Search and Retrieval, one cannot help the heavy influence of PageRank (Google's own technology). Thankfully the algorithm is well known. On the flip side, Critics have often asked wheather such algorithms be published? The bloggers have demonstrated that even Google rankings can be rigged... Personally, I would choose the open architecture philosophy, due to parallels with th
  - Re:re-inventing the wheel (Score:1, Insightful)
    
    by Anonymous Coward writes:
    
    You have that backwards. PageRank was heavily influenced by other systems, like Harvest. And full-text search has changed very little since Knuth. For instance, the basic extact string matching algorithms haven't advanced at all.
Interesting stuff! (Score:3, Funny)

by clifgriffin ( 676199 ) writes: on Thursday December 18, 2003 @10:34AM (#7753710) Homepage

Though, I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet.

Share
twitter facebook
- Re:Interesting stuff! (Score:2, Funny)
  
  by KoolDude ( 614134 ) writes:
  
  I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet
  
  But what if your stack grows big and you need to search through the stack ?
  - - Re:Interesting stuff! (Score:2, Offtopic)
      
      by stoborrobots ( 577882 ) writes:
      
      Actually no... one of the interesting things is that it is far more efficient to "scan" through a stack than to pop it if you're looking for something... (Assuming you have an in-memory stack, which is easily manipulated by memory operations as well as stack ops.)
      
      It breaks the abstraction, but the improvement may actually be worth it sometimes...
Anti-XML (Score:5, Interesting)

by MattRog ( 527508 ) writes: on Thursday December 18, 2003 @10:38AM (#7753753)

Whether there's going to be a lot of XML around in repositories to search. XML these days is more used in interchange rather than archival applications.

Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries [sybase.com] on them with minimal effort.

Share
twitter facebook
- Re:Anti-XML (Score:1)
  
  by SillySnake ( 727102 ) writes:
  
  I thought Longhorn was going to use some sort of XML file system? Or at least there were thoughts about it?
- Re:Anti-XML (Score:2)
  
  by Havokmon ( 89874 ) writes:
  
  namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.
  I'm with ya there buddy.. If it wasn't for a corporate buyout, my OS/2 box with REXX scripts would still be ftp'ing files (I was really hoping for 10 years - but I've been gone for 3 now).
  Now they'll do it in some xxx.Net, because it's all new and cool. Whatever, at least my stuff was readable with 'edit'.
- Re:Anti-XML (Score:5, Informative)
  
  by phurley ( 65499 ) writes: on Thursday December 18, 2003 @10:55AM (#7753891) Homepage
  
  I agree to a point, but if we are talking about a mixed environement where you are using Oracle, I am using DB2, our friend Bob has his data in a legacy ISAM setup and a customer wants to integrate a search system across the tree systems they are going to have to write a lot of custom glue.
  
  If an XML aspect of the data is available (you can still keep it all in Oracle - just provide a "view" of it in XML) from each of us - common search tools and methods can be utilized.
  
  Parent Share
  twitter facebook
  - Re:Anti-XML (Score:2)
    
    by MattRog ( 527508 ) writes:
    
    I don't think writing your own DBMS engine (with query, data management, concurrency, etc.) support is going to be 'less' work than simply either ensuring that your SQL works with different vendors or writing small data pieces to talk to a number of DBMS products.
    
    You could, of course, bundle an existing DBMS product into the application which would remove the limitation of being forced to use the customer's DBMS product.
- Re:Anti-XML (Score:4, Interesting)
  
  by arrogance ( 590092 ) writes: on Thursday December 18, 2003 @11:02AM (#7753939)
  
  He even goes so far as to mention that Index Server will search your website: but fails to mention that it does full text searching on your entire file system.
  
  Unfortunately his site (http://www.tbray.org/ongoing/) seems to be sufficiently disorganized that I have trouble finding out what his real points are, or whether he's addressed all of the issues: for example, I saw no mention of the Semantic Web [w3.org] if his concern is searchability on web documents.
  
  As a side note, MS SQL is going more and more toward XML, as is the whole .NET framework. This results in richer (read: fatter) data but it does mean that you can store whatever metadata you want along with it.
  
  Parent Share
  twitter facebook
  - Re:Anti-XML (Score:1)
    
    by radio4fan ( 304271 ) writes:
    
    for example, I saw no mention of the Semantic Web
    
    Try here [tbray.org] on the page about metadata.
- Re:Anti-XML (Score:2)
  
  by Hayzeus ( 596826 ) writes:
  
  I don't really get the advantages of XML Data storage either, but when it comes to emitting data in a generic, interoperable, self-describing format, XML works quite nicely, even if it is a tad verbose.
  Which (slightly OT) reminds me: has anyone here used an XML compression tool, that they'd like to share opinions on? I've looked at XMLPPM briefly but not worked with it yet. Any others?
  - Re:Anti-XML (Score:2)
    
    by 2short ( 466733 ) writes:
    
    "Which (slightly OT) reminds me: has anyone here used an XML compression tool"
    
    I've looked at a few, but frankly, haven't seen the point. Several generic compression types (e.g. zip) are based on finding sequences in the data (e.g. "<SomeTagName") that are repeated, and hence they do very well with XML. I had some really big XML doc that whatever zip compression lib I was using for other stuff, with default options got down to ~15%, while some XML-specific compressor, after a bit of configuration boug
- Re:Anti-XML (Score:4, Insightful)
  
  by anomalous cohort ( 704239 ) writes: on Thursday December 18, 2003 @11:17AM (#7754120) Homepage Journal
  
  From the google cache...
  searching for words isn't really what you want to do. You'd like to search for ideas, for concepts, for solutions, for answers.
  That is why he is going in an XML direction. The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized because, well, humans are highly non-normalized.
  
  I think that he should look at some work by a different Tim, the Semantic Web [w3c.org].
  
  Parent Share
  twitter facebook
  - Re:Anti-XML (Score:1)
    
    by MattRog ( 527508 ) writes:
    
    The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized ...
    
    That makes absolutely no sense.
    - Re:Anti-XML (Score:3, Funny)
      
      by anomalous cohort ( 704239 ) writes:
      
      Hmmm, perhaps a visit to a dictionary [dictionary.com] is in order. Once you read the definitions for rectilinear and normalized, I'll think you'll find the sense of the post.
      
      This is a sound strategy any time you run into a message that makes no sense. Simply look up the definitions of the words that you don't know.
      - Re:Anti-XML (Score:1)
        
        by MattRog ( 527508 ) writes:
        
        It doesn't make any sense because it's meaningless. Try and provide reasoning why you think this sort of information can't be modeled relationally.
        
        Re:Anti-XML (Score:2)
        
        by platypus ( 18156 ) writes:
        
        Because you'd (for example) have to provide a relational model for the semantics of the english language. And even that wouldn't meet the criterium of "generalized", because, ehm, it's specialized for the english language.
        
        Re:Anti-XML (Score:2)
        
        by platypus ( 18156 ) writes:
        
        Err, missed some context in this thread, clearly, XML as opposed to relational wouldn't help here either.
        
        Re:Anti-XML (Score:1)
        
        by cthrall ( 19889 ) writes:
        
        The problem isn't that the information can't be modeled in a relational manner, you could easily use a relational database for your data store.
        
        The problem is retrieving information to index. You pull information from existing data sources that have never heard of your data model and don't care. XML provides a simple way to map your existing content to some standard design that you come up with. That's the "normalization" step, and one of the harder parts of indexing.
  - Re:Anti-XML (Score:3, Insightful)
    
    by gorilla ( 36491 ) writes:
    
    Call me stupid if you like, but I don't see how the representation of the data helps to search for ideas concepts etc. Regardless of how the text is stores, unless you have a human do a lot of markup on the text, then you're going to have a problem in extracting the ideas from the text. And by markup I don't mean <heading>Heading</heading> I mean some entering what the ideas, concepts etc are for each part of the text - which can be done equally easily in a traditional database as in a XML docum
    - Re:Anti-XML (Score:1)
      
      by cthrall ( 19889 ) writes:
      
      You can use stemmers, term frequencies and relative location in a document to provide some general gist of what a document is about. The whole point of creating advanced information retrieval tools is to make information processing a more automated task.
      - Re:Anti-XML (Score:2)
        
        by gorilla ( 36491 ) writes:
        
        Yep, but what difference does it make if the text is stored in XML or in a database?
        
        Re:Anti-XML (Score:1)
        
        by cthrall ( 19889 ) writes:
        
        The XML part comes in when you are extracting content from an existing data store. You can use a relational database for a backend store, but when you're going through the step of mapping existing content to the info your indexing engine wants (the normalization step), XML is very handy.
    - Re:Anti-XML (Score:2)
      
      by kirkjobsluder ( 520465 ) writes:
      
      True, the problem is that HTML became such a beast mixing semantic markup with visual markup that it is really hard to find well-marked up documents.
      
      Still, while it is possible to convert any form of data into a relational database, does that mean that the relational database is the best fit for all types of data?. One of the things that XML does well but relational databases don't do well (without a lot of violent shuffling around) is arbitrary parent-child relationships. So for example, a typical paper
- Re:Anti-XML (Score:1)
  
  by cthrall ( 19889 ) writes:
  
  SQL dbs might come with full-text indexing, but the power of information retrieval really comes into play when you can start clustering, using stemmers to find people/places, etc. Db full-text indexing feels more like a feature checkbox than a real information retrieval system.
  
  XML can be useful because you can take data from disparate sources (an Exchange server, SQL db, etc.) and normalize the meta data (the document author, date the document was created, etc.).
  
  I agree there's an overwhelming "silver-bul
- Re:Anti-XML (Score:3, Interesting)
  
  by I8TheWorm ( 645702 ) writes:
  
  I tend to get on an XML soap (no pun intended) box when I see articles about it, so here goes...
  
  XML is great for sharing data between non-congruous systems. It's horrible, however, for storing data in any large quantity, and even more horrible for treating as a searchable text file. It's inherintly large and full of ascii/ansi/utf characters that are completely unnecessary when performing byte by byte text searches. For large amounts of data, you're right... RDBMS is the current way to go... maybe OODB
  - Re:Anti-XML (Score:2)
    
    by Pseudonym ( 62607 ) writes:
    
    XML is almost ideal for storing structured text in large quantities. Storing non-textual data, not so much. (This is one reason why XML gets a bad reputation for data representation; people are using it for tasks which are not textual markup-related.) For byte-by-byte searching... true enough, it sucks for that. But surely if you have text in large quantities, you're hardly going to search it using "grep". That would be insane whether it's stored in XML or plain text.
- Re:Anti-XML (Score:3, Interesting)
  
  by DrVomact ( 726065 ) writes:
  
  The reason why XML is widely used today for a multitude of purposes (e.g., data interchange between otherwise incompatible systems, configuration files, technical documents, command protocols that communicate with servers, etc. etc.) and why it will be used for even more stuff in the future is that it is centered on a very simple and powerful idea: self-documenting data. That is, the data is structured by internal markers that give information about the type of information contained in each logical element
  - Re:Anti-XML (Score:2)
    
    by gillbates ( 106458 ) writes:
    
    I have no idea whether the databases of the future will store their data in XML form or not
    Not likely. XML is designed to solve the data identification problem, not the data storage problem.
    
    Due to the heirarchical nature of XML, a validating parser must read the entire document before returning any results. Given the way that most parsers are designed, the entire document will be read into memory and first parsed, then validated. Which, of course, limits the size of your database to the machine's m
- Re:Anti-XML (Score:2)
  
  by kirkjobsluder ( 520465 ) writes:
  
  I didn't get that impression from the article that he was considering XML as data storage. I saw the point as being that we don't know how much XML a search system will have to process. If your data consists of a large number of OpenOffice, DocBook, XHTML or Framemaker documents, then it might just be easier to keep things in XML rather than to split the data apart into a bunch of atomic chunks.
  
  I love using RDBMS but for some applications, creating a normalized database is a pain in the rear. Bibliograp
- Re:Anti-XML (Score:2)
  
  by Pseudonym ( 62607 ) writes:
  
  Relational databases and full-text indexing are a poor fit once you have a lot of text to store. Yes, I know. Most SQL DBMS come with full-text indexing. That's not enough. Read on for the reason why.
  
  Think about how a relational DBMS works. Internally, the major data structure is the "stream of tuples". A tuple is a virtual record which is made up of a number of fields, each of which has data in it.
  When you search, you get back a stream of tuples, which is usually some projection of the record store
mirrors ? anyone ? (Score:2)

by psycho_tinman ( 313601 ) writes:

Everything beyond the TOC (which I loaded onto my browser) is slashdotted. The problem with the links to the different articles is that its not part of a tree hierarchy, I cant just say "wget all pages beyond point X", nor can I make a guess and do a regex download of all URLs with "search" in them, because some articles [tbray.org] do not conform to that pattern.
A tarball for offline browsing would be nice ? didnt see it on the page, though. Save you a part of a slashdotting, Tim.. how about it ? :)
- - Re:mirrors ? anyone ? (Score:1, Informative)
    
    by Anonymous Coward writes:
    
    http://developers.slashdot.org/faq/suggestions.sht ml#su900 [slashdot.org]
    
    Thank you, drive through.
This technology still exists? (Score:3, Funny)

by Pathetic Coward ( 33033 ) writes: on Thursday December 18, 2003 @10:46AM (#7753829)

Search technology. Hmmm. Wasn't that outsourced to India last month? Or was that last year? I just can't keep up with IT today.

Share
twitter facebook
- Re:This technology still exists? (Score:3, Insightful)
  
  by smittyoneeach ( 243267 ) writes:
  
  It will thrive until the Next Big Thing(tm) arrives, to "save us from the sad shortcomings of XML".
  
  XML's only real fault is that's it's been oversold, not unlike Object Oriented Programming and Java before it.
Why isn't "someone" Tim Bray (Score:5, Interesting)

by leoaugust ( 665240 ) writes: <leoaugust@gmail.cBLUEom minus berry> on Thursday December 18, 2003 @10:46AM (#7753832) Journal

I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

"Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica [antarctica.net] .....
I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.
Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.
Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...

Share
twitter facebook
- Re:Why isn't "someone" Tim Bray (Score:3, Informative)
  
  by wizarddc ( 105860 ) writes:
  
  Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...
  
  I thought that was just a myth [slashdot.org]?
- Re:Why isn't "someone" Tim Bray (Score:1)
  
  by veecee_veecee ( 694455 ) writes:
  
  From the article (On Search: Backgrounder), on using Open Source tools:
  Each of the ones I've looked at has a problem (lightly/poorly maintained, scalability problems, lack of internationalization, awkward API).
  Good luck convincing him to go Open Source!
- Re:Why isn't "someone" Tim Bray (Score:3, Informative)
  
  by mbrinkm ( 699240 ) writes:
  
  "This is the last in my series of On Search essays. I've written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I'd like to change this part of the world. In short, I'd like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I'll write something on how it might get built.
  Nami
  - RBTFL Re:Why isn't "someone" Tim Bray (Score:2)
    
    by leoaugust ( 665240 ) writes:
    
    Sure we did RTFA. Can you Read Between The F* Lines RBTFL ?
    Here is what Tim says:
    This essay is about what that software should look like. Early next year I'll write something on how it MIGHT get built.
    So BRF is going to be open-source.
    I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.
    And if the following is not Consultant-Speak I don't know what is - Consultants are great at telling you why you should not be doing what
- Re:Why isn't "someone" Tim Bray (Score:1)
  
  by cutting ( 534901 ) writes:
  
  Go Open Source Tim ... and get the ball rolling.
  The ball is already rolling. Check out Lucene [apache.org] or Nutch [nutch.org]. Either of these could be enhanced to support Tim's ideas. Volunteers? (I'm already working on it.)
- Re:Why isn't "someone" Tim Bray (Score:2, Informative)
  
  by gwhulbert ( 534218 ) writes:
  
  Tim Bray was one of the founders of open text corporation ... they INVENTED the search engine.
  Digital (with whom they were working) "stole" the idea and opened Altavista 3 months before their IPO.
  I worked for Open Text for a year but after Tim left (just about the time the 1.0 draft of the XML spec appeared).
google cache (Score:1)

by aarku ( 151823 ) writes:

here [216.239.41.104]
Slashdot search question (Score:3, Interesting)

by Glass of Water ( 537481 ) writes: on Thursday December 18, 2003 @11:32AM (#7754280) Journal

But there is some good stuff out there; for example Slashdot's search engine seems to run smooth, clean, and fast, but some poking around failed to reveal what it is: I wouldn't be surprised if it's just the Mysql search facility.
Anybody know the answer to this one?

Share
twitter facebook
- Re:Slashdot search question (Score:2)
  
  by stoborrobots ( 577882 ) writes:
  
  I don't know... <a href="http://www.slashcode.com/" >see for yourself</a>, then come and tell us...
  
  The <a href="http://ask.slashcode.com/article.pl?sid=02/0 2/09/183217&mode=thread&tid=4" >comment on this page</a> suggests that you are right...
  - Yeah, I know... Preview.... (Score:4, Informative)
    
    by stoborrobots ( 577882 ) writes: on Thursday December 18, 2003 @12:09PM (#7754657)
    
    I don't know... see for yourself [slashcode.com], then come and tell us... The comment on this page [slashcode.com] suggests that you are right...
    
    Parent Share
    twitter facebook
  - Re:Slashdot search question (Score:1)
    
    by utopyr ( 621354 ) writes:
    
    There are more suggestions here [slashcode.com] that you might be right.
    
    Is this like Frequently-Asked-Magic-8-Ball?
- Re:Slashdot search question (Score:1)
  
  by ddilling ( 82850 ) writes:
  
  Smooth, clean, fast... and kinda stupid.
  
  Enter the precise title of this very article ("Learning about full text search" -- it will strip the hyphen anyway), and order by date: Your top hit will be "A.I. Helicopters" with this article hit #2.
  Even better, order by score: your top hit will be "C++ Answers From Bjarne Stroustrup" -- this article doesn't even appear on the first page of 30 hits.
  Okay, you say... maybe it's not searching the titles, but the article bodies only. Let's try "Tim bray XML search"..
Or instead, talk to a librarian (the Register) (Score:3, Interesting)

by JPMH ( 100614 ) writes: on Thursday December 18, 2003 @11:48AM (#7754442)

An interesting counterpoint to this story in the Register today:
"A Quantum Theory of Internet Value" [theregister.co.uk] by Andrew Orlowski
-- why librarians are better at finding the book you want than Google.

Share
twitter facebook
Mirror (Score:5, Informative)

by Door-opening Fascist ( 534466 ) writes: <skylar@cs.earlham.edu> on Thursday December 18, 2003 @12:11PM (#7754675) Homepage

Since the site looks bogged down from the /.'ing, I've made a few mirrors:
Mirror #1 [earlham.edu]
Mirror #2 [earlham.edu]
Mirror #3 [dhs.org]

Share
twitter facebook
- Re:Mirror (Score:1)
  
  by quartertone ( 567439 ) writes:
  
  Excellent work. Not sure how you were able to circumnavigate the /. takedown.
  One problem, however: It's just the front page. The meat of the information is still hiding on his server.
  - Re:Mirror (Score:2)
    
    by Door-opening Fascist ( 534466 ) writes:
    
    One problem, however: It's just the front page. The meat of the information is still hiding on his server.
    It was originally just the front page. I decided to get that up fast to get the load off the original server. I've just updated the mirror with the important links, but those took a little longer to fetch.
page rank algorithm (Score:2)

by bcrowell ( 177657 ) writes:

Also, Google claims that links from pages that themselves have a lot of incoming links count for more, but I'm not actually sure they'd need to do that to get the results they do.
Is Google's page rank algorithm really that mysterious? I know they fiddle with it in secret ways now and then to discourage abuse, but I heard the fundamental algorithm was basically pretty simple: something like finding the eigenvectors and eigenvalues of the matrix of links. (Not sure exactly what they do with these -- associat
- Re:page rank algorithm (Score:1)
  
  by bcrowell ( 177657 ) writes:
  
  (Replying to myself): Here [iprcom.com] is a site that claims to explain Google page rank completely. I found it by doing a Google search on 'google "page rank"', and I assume it's pretty authoritative, because it had the highest page rank :-)
- another pagerank discussion (Score:2)
  
  by goon ( 2774 ) writes:
  
  google broken [google-watch.org]? (www.google-watch.org)
  "... unique ID for each page stored as ansi c, 4 bytes on Linux system (~4yo) gives theoretical limit of 4.2 billion pages. ..."
  discusses the move to 5 bytes and suggests how this move may be the cause of weird search results on google searchs this year - of course the other reason my be google foiling search cue jumpers [webmasterworld.com].
"long departed Open Text index?" Not (Score:2, Informative)

by Anonymous Coward writes:

It just has a new name, and it's being developed by librarians.
http://www.dlxs.org/products/xpat.htm l
So, where can I find it? (Score:1)

by mod_parent_down ( 692943 ) writes:

I've been looking all over...
searching using php perl and mysql (Score:2, Interesting)

by chrisranjana.com ( 630682 ) writes:

More search related functions should be available to php and perl and built in to them .. Even Mysql too...
- Re:searching using php perl and mysql (Score:2)
  
  by Pseudonym ( 62607 ) writes:
  
  Both Perl [z3950.org] and PHP [zend.com] already have Z39.50 [loc.gov] support to connect to full-text search engines [indexdata.dk].
UI you say - check out www.geninterface.com (Score:1)

by wheatking ( 608436 ) writes:

in the post-google world, UIs like the General Interface [geninterface.com] will appear. check out their demo at Integrated Web Services [geninterface.com] and no i dont work there. i just like the direction they are going in.
- General Interface? (Score:1)
  
  by sean.peters ( 568334 ) writes:
  
  If they're so general, how come I get this when I try to view the sample apps?
  
  Sample Applications
  
  General Interface Objects currently supports Internet Explorer 5.5 and later browsers running on Windows. For access to the sample applications please use another browser.
  
  I guess "general" means "IE only".
  
  Sean
I think his searching technique needs some work (Score:2)

by cbreaker ( 561297 ) writes:

From the site: "It has fifteen instalments not including this table of contents."

Last I searched the dictionary, it was "installments."

I guess alphabetical searching is best after all.
- Re:Like...wow. (Score:2)
  
  by no reason to be here ( 218628 ) writes:
  
  Actually, this is one of the few times that someone used "like" correctly. The linked documents are not a textbook on searching; however, they are similar to a textbook on searching. It is, therefore, apropriate to use the preposition "like," since the linked essays are, in fact, like a textbook on searching.
  - Re:Like...wow. (Score:1)
    
    by khamar ( 667861 ) writes:
    
    From what I can see through the war-haze of ./ing these articles are more like a blog. Are we confusing "essay" and "like a textbook" with some random ideas?
    I really like this guys comments, but would not confuse them with a textbook.
    Favorite idea: 'Turn on Search' built-in to Apache. This should be a standard feature.
    Of course, others have already started working on a flash version [ilovedaemon.net] before this blog was written.
- Re:Searching and Sorting (Score:1)
  
  by alw53 ( 702722 ) writes:
  
  And maybe discuss the actual algorithms
  instead of the UI.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Salute (Score:2, Funny)

poor guy (Score:5, Informative)

Re:poor guy (Score:5, Insightful)

Re:poor guy (Score:1)

Re:poor guy (Score:2)

Re:poor guy (Score:1)

Re:poor guy (Score:1)

Re:poor guy (Score:3, Offtopic)

Re:poor guy (Score:2, Insightful)

Re:poor guy (Score:5, Informative)

Re:poor guy (Score:4, Insightful)

Re:poor guy (Score:3, Informative)

Re:poor guy (Score:2, Funny)

Re:poor guy (Score:1)

Re:poor guy (Score:2)

it's geared for public consumption (Score:1, Insightful)

Re:poor guy (Score:5, Informative)

Re:Salute (Score:5, Interesting)

web page irony (Score:3, Funny)

Re:web page irony (Score:5, Funny)

Re:web page irony (Score:2, Informative)

Re:web page irony (Score:2, Funny)

Re:web page irony (Score:3, Interesting)

Re:web page irony (Score:5, Informative)

Hold on there (Score:5, Funny)

Re:Hold on there (Score:5, Funny)

Re:Hold on there (Score:1)

Bray's theorem (Score:4, Funny)

ObHutz (Score:3, Funny)

Re:Hold on there (Score:2)

Re:Hold on there (Score:1)

re-inventing the wheel (Score:2, Interesting)

Re:re-inventing the wheel (Score:4, Insightful)

Re:re-inventing the wheel (Score:2, Insightful)

Re:re-inventing the wheel (Score:1, Insightful)

Interesting stuff! (Score:3, Funny)

Re:Interesting stuff! (Score:2, Funny)

Re:Interesting stuff! (Score:2, Offtopic)

Anti-XML (Score:5, Interesting)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:5, Informative)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:4, Interesting)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:4, Insightful)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:3, Funny)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:3, Insightful)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:1)

Re:Anti-XML (Score:3, Interesting)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:3, Interesting)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:2)

Re:Anti-XML (Score:2)

mirrors ? anyone ? (Score:2)

Re:mirrors ? anyone ? (Score:1, Informative)

This technology still exists? (Score:3, Funny)

Re:This technology still exists? (Score:3, Insightful)

Why isn't "someone" Tim Bray (Score:5, Interesting)

Re:Why isn't "someone" Tim Bray (Score:3, Informative)

Re:Why isn't "someone" Tim Bray (Score:1)

Re:Why isn't "someone" Tim Bray (Score:3, Informative)

RBTFL Re:Why isn't "someone" Tim Bray (Score:2)

Re:Why isn't "someone" Tim Bray (Score:1)

Re:Why isn't "someone" Tim Bray (Score:2, Informative)

google cache (Score:1)

Slashdot search question (Score:3, Interesting)

Re:Slashdot search question (Score:2)