Slashdot Log In
Streaming a Database in Real Time
Posted by
michael
on Fri Jan 21, 2005 06:20 PM
from the never-query-in-the-same-river-twice dept.
from the never-query-in-the-same-river-twice dept.
Roland Piquepaille writes "Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In 'Data On The Fly,' Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streaming data without storing it on disk gives them a tremendous speed advantage. The company claims it can process 140,000 messages per second on a $1,500 PC, when its competitors can only deal with 900 messages per second. Too good to be true? This overview contains more details and references."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Seriously, Michael (Score:4, Insightful)
It must be alot since the pay for play is so obvious.
THE TRUTH ABOUT ROLAND PIQUEPAILLE (Score:4, Informative)
I think most of you are aware of the controversy surrounding regular Slashdot article submitter Roland Piquepaille. For those of you who don't know, please allow me to bring forth all the facts. Roland Piquepaille has an online journal (I refuse to use the word "blog") located at http://www.primidi.com/ [primidi.com]. It is titled "Roland Piquepaille's Technology Trends". It consists almost entirely of content, both text and pictures, taken from reputable news websites and online technical journals. He does give credit to the other websites, but it wasn't always so. Only after many complaints were raised by the Slashdot readership did he start giving credit where credit was due. However, this is not what the controversy is about.
Roland Piquepaille's Technology Trends serves online advertisements through a service called Blogads, located at www.blogads.com. Blogads is not your traditional online advertiser; rather than base payments on click-throughs, Blogads pays a flat fee based on the level of traffic your online journal generates. This way Blogads can guarantee that an advertisement on a particular online journal will reach a particular number of users. So advertisements on high traffic online journals are appropriately more expensive to buy, but the advertisement is guaranteed to be seen by a large amount of people. This, in turn, encourages people like Roland Piquepaille to try their best to increase traffic to their journals in order to increase the going rates for advertisements on their web pages. But advertisers do have some flexibility. Blogads serves two classes of advertisements. The premium ad space that is seen at the top of the web page by all viewers is reserved for "Special Advertisers"; it holds only one advertisement. The secondary ad space is located near the bottom half of the page, so that the user must scroll down the window to see it. This space can contain up to four advertisements and is reserved for regular advertisers, or just "Advertisers". Visit Roland Piquepaille's Technology Trends (http://www.primidi.com/ [primidi.com]) to see it for yourself.
Before we talk about money, let's talk about the service that Roland Piquepaille provides in his journal. He goes out and looks for interesting articles about new and emerging technologies. He provides a very brief overview of the articles, then copies a few choice paragraphs and the occasional picture from each article and puts them up on his web page. Finally, he adds a minimal amount of original content between the copied-and-pasted text in an effort to make the journal entry coherent and appear to add value to the original articles. Nothing more, nothing less.
Now let's talk about money. Visit http://www.blogads.com/order_html?adstrip_category =tech&politics= [blogads.com] to check the following facts for yourself. As of today, December XX 2004, the going rate for the premium advertisement space on Roland Piquepaille's Technology Trends is $375 for one month. One of the four standard advertisements costs $150 for one month. So, the maximum advertising space brings in $375 x 1 + $150 x 4 = $975 for one month. Obviously not all $975 will go directly to Roland Piquepaille, as Blogads gets a portion of that as a service fee, but he will receive the majority of it. According to the FAQ, Blogads takes 20%. So Roland Piquepaille gets 80% of $975, a maximum of $780 each month. www.primidi.com is hosted by clara.net (look it up at http://www.networksolutions.com/en_US/whois/index. jhtml [networksolutions.com]). Browsing clara.net's hosting solutions, the most expensive hosting service is their Clarahost Advanced (http://www.uk.clara.net/clarahost/advanced.php [clara.net]) priced at £69.99 GBP. This is
Parent
For the record--Taco's response to this (Score:5, Informative)
I then told him about the controversy over it in posters' minds, and he said it was just a "new successful troll meme." Good luck getting through to Slashdot's editors, because clearly Malda does not consider this anything to take seriously.
Parent
Queue dancin' (Score:3, Funny)
speed focus (Score:3, Insightful)
Re:speed focus (Score:3, Informative)
Re:speed focus (Score:4, Interesting)
One question might be...why write the data directly to a database initially? Why not utilize a faster format, then write to the DB when things have slowed down (i.e. caching)?
Admittedly I haven't read the article, but I am familar with 200+G databases, and there are ways to deal with performance with current DB tech.
I do welcome any new competition, but there are ways of querying data in memory already. Heck, put the whole thing on a RAM Drive...how much data can there be for stock tickers?
Parent
Re:speed focus (Score:3, Funny)
Yeah, this will outright kill mysql, I'm swapping tomorrow, got any cash to spare?
Re:speed focus (Score:5, Informative)
Its called IRC.
Parent
Re:speed focus (Score:4, Informative)
Parent
Re:speed focus (Score:5, Funny)
you start with /join, of course...
Parent
I call foul (Score:4, Insightful)
So they manage to do their analysis without even touching main memory? Nifty! What do they do, make it all fit in the L1 data cache? OK, maybe the guy was misquoted - I trust reporters about as far as I can throw them - but the whole thing just smells funny to me. I'm betting that the massive speedup they report is only for carefully selected, pre-groomed data sets. I agree that analyzing data as it comes in rather than storing it up to recrunch later is the smart thing to do, but that insight isn't a breakthrough of the kind the article is spinning this as.
Re:I call foul (Score:5, Interesting)
You have a "standing query". So you can ask things, like, what's the rolling average for the last 60 seconds for this ticker name. What's the minimum price for this commodity.
You can ask to correlate things. Store the last 90 minutes worth of transactions on these commodities. Search for these types of patterns.
It sounds like what they have done is build an OLAP cube that builds its dataset on the fly by processing messages coming over a streaming interface.
It's much smarter to do that, then write every last transaction to disk, and then query the transactions after the fact. That'd be the natural way to thing about it if you used a Relational database.
Essentially, it sure sounds like he's written a generalized packet filter, that can compute interesting functions on the data. Think snort, think ethereal, think iptables, think policy routing. Now apply those kinds of technology to "The price of this stock", "the location of that soldier", where those values are embedded in a network packet frame somewhere.
While each single application of this sounds trivial to implement, if he has done it in a generalized way, that can keep pay with larger systems, bully for him.
The irony of all this for me is that at a former job, I used to process medical data exactly this way. It sounds like the HL7 interface issues we used to have. You couldn't possibly take a full HL7 stream and process it, so you'd filter it down to just the patients that this department was interested in. Then only process messages about those patients.
There were rows that even about those patients you weren't interested in that you had to filter out. You spent a bunch of time filtering, and re-filtering.
We wrote the raw messages to disk, and spooled them to ensure we didn't miss messages due database problems (if the database was down, you had to spool until the database came back up, it was unacceptable to miss patient records for database maintience).
Kirby
Parent
Has nothing to do with relational databases (Score:5, Insightful)
Re:Has nothing to do with relational databases (Score:5, Funny)
Uh oh... You dared to slam Rolly. Prepare for the wrath of Mikey and his infinite mod points.
Lube thy anus.
Parent
Re:Has nothing to do with relational databases (Score:5, Insightful)
Parent
Read the article before posting (Score:5, Insightful)
Re:Read the article before posting (Score:5, Insightful)
1. if you decide to add a new analytic you have to start with new data - you can't deploy a new analtyical component and against historical data.
2. if your machine crashes - it takes all your accumulated analytical data along with it. Maintaining a distribution of activity calculated every 5 minutes over 90 days? Great, but after the server comes back up your data starts all over.
3. if your analtyical component needs to run against a lot of history each time (ex: total number of unique telephone numbers accessed by day, calculate rolling median) then you'll have to maintain that detail data in memory. As you can imagine - you can *easily* identify calculations that will exceed your memory. So, to tune you'll be forced to keep your calculations to relatively recent data only.
ken
Parent
Data IS written to disk/backed-up. (Score:4, Informative)
Every time you get some thing interesting, you save that on disk too - but separately, into a much smaller db. This way state is also saved, and since state is going to be much smaller than the data, there will be no speed issues.
Now the clever thing to do would be to link this flowing-state dbms (FSDBMS) to a standard rdbms working from the disk. Then you could verify the information from the FSDBMS, and ensure that things aren't screwed up. Also, based on patterns seen by the rdbms with long term data, new queries could be generated on the FSDBMS, allowing it to generate results from the data on the wire.
Sounds like it would have applications primarily where response time is at a premium, and long history is not such a large component of the information.
So in the case of military info, where a HumVee could be in trouble (a situ someone else has mentioned), the FSDBMS would raise the alarm, and some other process would then follow up and ensure that the alarm was taken care of.(The data itself would be backed up for future analysis, such as whether the query was correctly handled).
Dynamic queries in such a situ could be - get the id of the closest Apache reporting in, or closest loaded bomber en-route to some other target. Then the alarm handling program would re-route the bomber/apache to the humvee for support. While querying the disk database may be time intensive, the FSDBMS would have delivered a sub-optimal FAST solution.
So imagine the FSDBMS as a filter, giving different bits of information to different people. With the option that you could change the filter on the fly. And the filter could be complex, based on previous history etc., just like a DB query.
Parent
A Better Solution (Score:4, Informative)
Another option is EPL server by ispheres [ispheres.com]. Unlike the product mentioned here, which seems to be just some extra code thrown on top of a database EPL server is built from the ground up for this sort of application.
For sensor networks (Score:4, Interesting)
seems similar to his Auroa project... stonebraker has a history of turning his university research projects into successful startups.
ACID? (Score:3, Insightful)
Seems kinda silly to me. (Score:4, Insightful)
Sooner or later you have to put something somewhere. Let's say you monitor a battalion in battle in realtime. All of these messages are streaming in and being analyzed. Great. But now what? So something triggers an alert, say. Well, what's tracking the status of the alert? Wouldn't you want to track the status of an alert saying "this Humvee is off course"? Wouldn't you want to track whether someone had acknowledged the alert, and what they did about it?
And don't forget there are liability issues, historical issues, and more. You're a stock trader, all of these messages are coming and being analyzed. You get an alert...one of your triggers tripped. You make a trade as a result, only to find out 30 minutes later that the trigger was WRONG and your trade was WRONG and you (or your company) is out $10 million. How do you prove that you made the trade based on the trigger like you were supposed to and not because you f**ked up? The trigger, and the data that caused it to trip, is long gone. What do you do now?
Eventually something has to be written (stored) somewhere, sometime. I guess I can see the need for summarizing data and only storing what StreamBase says is "important" but how would you know if everything was OK if the actual data driving everything was long gone?
This isn't streaming, this is message queuing... (Score:5, Informative)
I'm sure this is a great product, but both the submitter and the writer of the story seem to not grok what makes it great.
Re:Duh (Score:4, Insightful)
At no time is the data 'stored' in any way
You cannot query anything that happened in the past, because the program doesn't remember it.
Parent