Learning About Full-text Search 140
An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."
Salute (Score:2, Funny)
You mean two or three times now.
poor guy (Score:5, Informative)
Re:poor guy (Score:5, Insightful)
Re:poor guy (Score:1)
Also, I believe that Google respects instructions in the robots.txt not to cache their page.
Re:poor guy (Score:2)
1) Allow Slashdot to cache the site
2) Get the site slashdotted back to the stoneage
Nothing wrong with some maffia methods every now and then!
Re:poor guy (Score:1)
Re:poor guy (Score:1)
Apparently it's more acceptable to them to knowingly blow sites out of the water (they even joked about it in this post) than to spend the time to fire off an email. The fact is, they don't even want to try.
Re:poor guy (Score:3, Offtopic)
Re:poor guy (Score:2, Insightful)
But even if the issue of liability were taken off the table, they would still have to get off of their metaphorical butts and set up a caching system. I don't know if there is any usable open-source system currently in existence, but if not, they would either have to code it themselves or adapt something already out there that doesn't serve their needs. Disk space isn't really an issue, as the commenting system takes a lot more space than the cache would (assuming
Re:poor guy (Score:5, Informative)
Re:poor guy (Score:4, Insightful)
Re:poor guy (Score:3, Informative)
Re:poor guy (Score:2, Funny)
Re:poor guy (Score:1)
Re:poor guy (Score:2)
it's geared for public consumption (Score:1, Insightful)
such is the nature of websites,
so as long as you don't pretend you wrote it,
it's abundantly clear where the original came from,
go ahead and mirror (by mirror i mean take a snapshot).
only if a copyright holder says don't do that should you remove it.
Re:poor guy (Score:5, Informative)
I have to write some crap at the end here so i can get past the "Your comment has too few characters per line" error message.
Re:Salute (Score:5, Interesting)
You mean two or three times now.
And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M.
However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.
Funny thing is I didn't even know he was
For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content.
web page irony (Score:3, Funny)
Re:web page irony (Score:5, Funny)
Re:web page irony (Score:2, Informative)
He's got an article about searching and his pages aren't searchable, and he's got articles about XML, so having non-valid XHTML pages would definitely have been ironic...
Re:web page irony (Score:2, Funny)
Re:web page irony (Score:3, Interesting)
I agree that it doesn't look to be easy to search around, at least when all you have is an URL to go on (http://www.tbray.org/ongoing/When/200x/2003/07/30
Re:web page irony (Score:5, Informative)
Really? How about search site:tbray.org [google.com]?
Hold on there (Score:5, Funny)
Finished an endless series?
Re:Hold on there (Score:5, Funny)
Re:Hold on there (Score:1)
Time flies when we're sitting in front of our comps, reading
Bray's theorem (Score:4, Funny)
The essay series converges to text book when time tends to infinity. Proof is left as an exercise to the reader.
ObHutz (Score:3, Funny)
Re:Hold on there (Score:2)
In the first three months he wrote a page, in the next month and a half he wrote another page, in the next (scratching of head) three quarters of a month he wrote another page, and so on. Now after six months he has written an endless amount of stuff, simple (yet amazing) really.
Re:Hold on there (Score:1)
re-inventing the wheel (Score:2, Interesting)
Re:re-inventing the wheel (Score:4, Insightful)
Re:re-inventing the wheel (Score:2, Insightful)
Re:re-inventing the wheel (Score:1, Insightful)
Interesting stuff! (Score:3, Funny)
Re:Interesting stuff! (Score:2, Funny)
I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet
But what if your stack grows big and you need to search through the stack ?
Re:Interesting stuff! (Score:2, Offtopic)
It breaks the abstraction, but the improvement may actually be worth it sometimes...
Anti-XML (Score:5, Interesting)
Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.
Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries [sybase.com] on them with minimal effort.
Re:Anti-XML (Score:1)
Re:Anti-XML (Score:2)
I'm with ya there buddy.. If it wasn't for a corporate buyout, my OS/2 box with REXX scripts would still be ftp'ing files (I was really hoping for 10 years - but I've been gone for 3 now).
Now they'll do it in some xxx.Net, because it's all new and cool. Whatever, at least my stuff was readable with 'edit'.
Re:Anti-XML (Score:5, Informative)
If an XML aspect of the data is available (you can still keep it all in Oracle - just provide a "view" of it in XML) from each of us - common search tools and methods can be utilized.
Re:Anti-XML (Score:2)
You could, of course, bundle an existing DBMS product into the application which would remove the limitation of being forced to use the customer's DBMS product.
Re:Anti-XML (Score:4, Interesting)
Unfortunately his site (http://www.tbray.org/ongoing/) seems to be sufficiently disorganized that I have trouble finding out what his real points are, or whether he's addressed all of the issues: for example, I saw no mention of the Semantic Web [w3.org] if his concern is searchability on web documents.
As a side note, MS SQL is going more and more toward XML, as is the whole
Re:Anti-XML (Score:1)
Try here [tbray.org] on the page about metadata.
Re:Anti-XML (Score:2)
Which (slightly OT) reminds me: has anyone here used an XML compression tool, that they'd like to share opinions on? I've looked at XMLPPM briefly but not worked with it yet. Any others?
Re:Anti-XML (Score:2)
I've looked at a few, but frankly, haven't seen the point. Several generic compression types (e.g. zip) are based on finding sequences in the data (e.g. "<SomeTagName") that are repeated, and hence they do very well with XML. I had some really big XML doc that whatever zip compression lib I was using for other stuff, with default options got down to ~15%, while some XML-specific compressor, after a bit of configuration boug
Re:Anti-XML (Score:4, Insightful)
From the google cache...
searching for words isn't really what you want to do. You'd like to search for ideas, for concepts, for solutions, for answers.That is why he is going in an XML direction. The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized because, well, humans are highly non-normalized.
I think that he should look at some work by a different Tim, the Semantic Web [w3c.org].
Re:Anti-XML (Score:1)
That makes absolutely no sense.
Re:Anti-XML (Score:3, Funny)
Hmmm, perhaps a visit to a dictionary [dictionary.com] is in order. Once you read the definitions for rectilinear and normalized, I'll think you'll find the sense of the post.
This is a sound strategy any time you run into a message that makes no sense. Simply look up the definitions of the words that you don't know.
Re:Anti-XML (Score:1)
Re:Anti-XML (Score:2)
Re:Anti-XML (Score:2)
Re:Anti-XML (Score:1)
The problem is retrieving information to index. You pull information from existing data sources that have never heard of your data model and don't care. XML provides a simple way to map your existing content to some standard design that you come up with. That's the "normalization" step, and one of the harder parts of indexing.
Re:Anti-XML (Score:3, Insightful)
Re:Anti-XML (Score:1)
Re:Anti-XML (Score:2)
Re:Anti-XML (Score:1)
Re:Anti-XML (Score:2)
Still, while it is possible to convert any form of data into a relational database, does that mean that the relational database is the best fit for all types of data?. One of the things that XML does well but relational databases don't do well (without a lot of violent shuffling around) is arbitrary parent-child relationships. So for example, a typical paper
Re:Anti-XML (Score:1)
XML can be useful because you can take data from disparate sources (an Exchange server, SQL db, etc.) and normalize the meta data (the document author, date the document was created, etc.).
I agree there's an overwhelming "silver-bul
Re:Anti-XML (Score:3, Interesting)
XML is great for sharing data between non-congruous systems. It's horrible, however, for storing data in any large quantity, and even more horrible for treating as a searchable text file. It's inherintly large and full of ascii/ansi/utf characters that are completely unnecessary when performing byte by byte text searches. For large amounts of data, you're right... RDBMS is the current way to go... maybe OODB
Re:Anti-XML (Score:2)
XML is almost ideal for storing structured text in large quantities. Storing non-textual data, not so much. (This is one reason why XML gets a bad reputation for data representation; people are using it for tasks which are not textual markup-related.) For byte-by-byte searching... true enough, it sucks for that. But surely if you have text in large quantities, you're hardly going to search it using "grep". That would be insane whether it's stored in XML or plain text.
Re:Anti-XML (Score:3, Interesting)
The reason why XML is widely used today for a multitude of purposes (e.g., data interchange between otherwise incompatible systems, configuration files, technical documents, command protocols that communicate with servers, etc. etc.) and why it will be used for even more stuff in the future is that it is centered on a very simple and powerful idea: self-documenting data. That is, the data is structured by internal markers that give information about the type of information contained in each logical element
Re:Anti-XML (Score:2)
Not likely. XML is designed to solve the data identification problem, not the data storage problem.
Due to the heirarchical nature of XML, a validating parser must read the entire document before returning any results. Given the way that most parsers are designed, the entire document will be read into memory and first parsed, then validated. Which, of course, limits the size of your database to the machine's m
Re:Anti-XML (Score:2)
I love using RDBMS but for some applications, creating a normalized database is a pain in the rear. Bibliograp
Re:Anti-XML (Score:2)
Relational databases and full-text indexing are a poor fit once you have a lot of text to store. Yes, I know. Most SQL DBMS come with full-text indexing. That's not enough. Read on for the reason why.
Think about how a relational DBMS works. Internally, the major data structure is the "stream of tuples". A tuple is a virtual record which is made up of a number of fields, each of which has data in it.
When you search, you get back a stream of tuples, which is usually some projection of the record store
mirrors ? anyone ? (Score:2)
Everything beyond the TOC (which I loaded onto my browser) is slashdotted. The problem with the links to the different articles is that its not part of a tree hierarchy, I cant just say "wget all pages beyond point X", nor can I make a guess and do a regex download of all URLs with "search" in them, because some articles [tbray.org] do not conform to that pattern.
A tarball for offline browsing would be nice ? didnt see it on the page, though. Save you a part of a slashdotting, Tim.. how about it ? :)
Re:mirrors ? anyone ? (Score:1, Informative)
Thank you, drive through.
This technology still exists? (Score:3, Funny)
Re:This technology still exists? (Score:3, Insightful)
XML's only real fault is that's it's been oversold, not unlike Object Oriented Programming and Java before it.
Why isn't "someone" Tim Bray (Score:5, Interesting)
"Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica [antarctica.net] .....
I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.
Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.
Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...
Re:Why isn't "someone" Tim Bray (Score:3, Informative)
I thought that was just a myth [slashdot.org]?
Re:Why isn't "someone" Tim Bray (Score:1)
Each of the ones I've looked at has a problem (lightly/poorly maintained, scalability problems, lack of internationalization, awkward API).
Good luck convincing him to go Open Source!
Re:Why isn't "someone" Tim Bray (Score:3, Informative)
"This is the last in my series of On Search essays. I've written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I'd like to change this part of the world. In short, I'd like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I'll write something on how it might get built.
Nami
RBTFL Re:Why isn't "someone" Tim Bray (Score:2)
Sure we did RTFA. Can you Read Between The F* Lines RBTFL ?
Here is what Tim says:
This essay is about what that software should look like. Early next year I'll write something on how it MIGHT get built.
So BRF is going to be open-source.
I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.
And if the following is not Consultant-Speak I don't know what is - Consultants are great at telling you why you should not be doing what
Re:Why isn't "someone" Tim Bray (Score:1)
Re:Why isn't "someone" Tim Bray (Score:2, Informative)
Digital (with whom they were working) "stole" the idea and opened Altavista 3 months before their IPO.
I worked for Open Text for a year but after Tim left (just about the time the 1.0 draft of the XML spec appeared).
google cache (Score:1)
Slashdot search question (Score:3, Interesting)
Re:Slashdot search question (Score:2)
The <a href="http://ask.slashcode.com/article.pl?sid=02/
Yeah, I know... Preview.... (Score:4, Informative)
Re:Slashdot search question (Score:1)
Is this like Frequently-Asked-Magic-8-Ball?
Re:Slashdot search question (Score:1)
Smooth, clean, fast... and kinda stupid.
Enter the precise title of this very article ("Learning about full text search" -- it will strip the hyphen anyway), and order by date: Your top hit will be "A.I. Helicopters" with this article hit #2.
Even better, order by score: your top hit will be "C++ Answers From Bjarne Stroustrup" -- this article doesn't even appear on the first page of 30 hits.
Okay, you say... maybe it's not searching the titles, but the article bodies only. Let's try "Tim bray XML search"..
Or instead, talk to a librarian (the Register) (Score:3, Interesting)
"A Quantum Theory of Internet Value" [theregister.co.uk] by Andrew Orlowski
-- why librarians are better at finding the book you want than Google.
Mirror (Score:5, Informative)
Mirror #1 [earlham.edu]
Mirror #2 [earlham.edu]
Mirror #3 [dhs.org]
Re:Mirror (Score:1)
One problem, however: It's just the front page. The meat of the information is still hiding on his server.
Re:Mirror (Score:2)
page rank algorithm (Score:2)
Is Google's page rank algorithm really that mysterious? I know they fiddle with it in secret ways now and then to discourage abuse, but I heard the fundamental algorithm was basically pretty simple: something like finding the eigenvectors and eigenvalues of the matrix of links. (Not sure exactly what they do with these -- associat
Re:page rank algorithm (Score:1)
another pagerank discussion (Score:2)
google broken [google-watch.org]? (www.google-watch.org)
"... unique ID for each page stored as ansi c, 4 bytes on Linux system (~4yo) gives theoretical limit of 4.2 billion pages. ..."
discusses the move to 5 bytes and suggests how this move may be the cause of weird search results on google searchs this year - of course the other reason my be google foiling search cue jumpers [webmasterworld.com].
"long departed Open Text index?" Not (Score:2, Informative)
http://www.dlxs.org/products/xpat.ht
So, where can I find it? (Score:1)
searching using php perl and mysql (Score:2, Interesting)
Re:searching using php perl and mysql (Score:2)
Both Perl [z3950.org] and PHP [zend.com] already have Z39.50 [loc.gov] support to connect to full-text search engines [indexdata.dk].
UI you say - check out www.geninterface.com (Score:1)
General Interface? (Score:1)
If they're so general, how come I get this when I try to view the sample apps?
Sample Applications
General Interface Objects currently supports Internet Explorer 5.5 and later browsers running on Windows. For access to the sample applications please use another browser.
I guess "general" means "IE only".
Sean
I think his searching technique needs some work (Score:2)
Last I searched the dictionary, it was "installments."
I guess alphabetical searching is best after all.
Re:Like...wow. (Score:2)
Re:Like...wow. (Score:1)
I really like this guys comments, but would not confuse them with a textbook.
Favorite idea: 'Turn on Search' built-in to Apache. This should be a standard feature.
Of course, others have already started working on a flash version [ilovedaemon.net] before this blog was written.
Re:Searching and Sorting (Score:1)
And maybe discuss the actual algorithms
instead of the UI.