Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Databases Programming Software IT

Streaming a Database in Real Time 194

Roland Piquepaille writes "Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In 'Data On The Fly,' Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streaming data without storing it on disk gives them a tremendous speed advantage. The company claims it can process 140,000 messages per second on a $1,500 PC, when its competitors can only deal with 900 messages per second. Too good to be true? This overview contains more details and references."
This discussion has been archived. No new comments can be posted.

Streaming a Database in Real Time

Comments Filter:
  • Seriously, Michael (Score:4, Insightful)

    by Anonymous Coward on Friday January 21, 2005 @06:22PM (#11437267)
    How much Does Roland Piquepaille pay you to link to his shitty articles?

    It must be alot since the pay for play is so obvious.
    • by Anonymous Coward on Friday January 21, 2005 @06:27PM (#11437319)
      Roland Piquepaille and Slashdot: Is there a connection?

      I think most of you are aware of the controversy surrounding regular Slashdot article submitter Roland Piquepaille. For those of you who don't know, please allow me to bring forth all the facts. Roland Piquepaille has an online journal (I refuse to use the word "blog") located at http://www.primidi.com/ [primidi.com]. It is titled "Roland Piquepaille's Technology Trends". It consists almost entirely of content, both text and pictures, taken from reputable news websites and online technical journals. He does give credit to the other websites, but it wasn't always so. Only after many complaints were raised by the Slashdot readership did he start giving credit where credit was due. However, this is not what the controversy is about.

      Roland Piquepaille's Technology Trends serves online advertisements through a service called Blogads, located at www.blogads.com. Blogads is not your traditional online advertiser; rather than base payments on click-throughs, Blogads pays a flat fee based on the level of traffic your online journal generates. This way Blogads can guarantee that an advertisement on a particular online journal will reach a particular number of users. So advertisements on high traffic online journals are appropriately more expensive to buy, but the advertisement is guaranteed to be seen by a large amount of people. This, in turn, encourages people like Roland Piquepaille to try their best to increase traffic to their journals in order to increase the going rates for advertisements on their web pages. But advertisers do have some flexibility. Blogads serves two classes of advertisements. The premium ad space that is seen at the top of the web page by all viewers is reserved for "Special Advertisers"; it holds only one advertisement. The secondary ad space is located near the bottom half of the page, so that the user must scroll down the window to see it. This space can contain up to four advertisements and is reserved for regular advertisers, or just "Advertisers". Visit Roland Piquepaille's Technology Trends (http://www.primidi.com/ [primidi.com]) to see it for yourself.

      Before we talk about money, let's talk about the service that Roland Piquepaille provides in his journal. He goes out and looks for interesting articles about new and emerging technologies. He provides a very brief overview of the articles, then copies a few choice paragraphs and the occasional picture from each article and puts them up on his web page. Finally, he adds a minimal amount of original content between the copied-and-pasted text in an effort to make the journal entry coherent and appear to add value to the original articles. Nothing more, nothing less.

      Now let's talk about money. Visit http://www.blogads.com/order_html?adstrip_category =tech&politics= [blogads.com] to check the following facts for yourself. As of today, December XX 2004, the going rate for the premium advertisement space on Roland Piquepaille's Technology Trends is $375 for one month. One of the four standard advertisements costs $150 for one month. So, the maximum advertising space brings in $375 x 1 + $150 x 4 = $975 for one month. Obviously not all $975 will go directly to Roland Piquepaille, as Blogads gets a portion of that as a service fee, but he will receive the majority of it. According to the FAQ, Blogads takes 20%. So Roland Piquepaille gets 80% of $975, a maximum of $780 each month. www.primidi.com is hosted by clara.net (look it up at http://www.networksolutions.com/en_US/whois/index. jhtml [networksolutions.com]). Browsing clara.net's hosting solutions, the most expensive hosting service is their Clarahost Advanced (http://www.uk.clara.net/clarahost/advanced.php [clara.net]) priced at £69.99 GBP. This is
      • by bonch ( 38532 ) on Friday January 21, 2005 @11:43PM (#11439029)
        I asked him why so many Roland articles get accepted, and he said he doesn't even look at the submitter's name and that Roland must be submitting good articles.

        I then told him about the controversy over it in posters' minds, and he said it was just a "new successful troll meme." Good luck getting through to Slashdot's editors, because clearly Malda does not consider this anything to take seriously.
      • Wow, you're right, we can't let this guy with a fishy-sound french name, "Roland Piquepaille" get away with this scheme! To the pitchforks men!
      • This comment seems fishy: "Visit Roland Piquepaille's Technology Trends (http://www.primidi.com/ [primidi.com]) to see it for yourself." Why would someone AGAINST primidi.com getting tons of money from per view /. floods suggest, constantly, on /. that others should go check it out? Wouldn't he be encouraging people to NOT go and visit?
      • Perhaps you should stop worrying about how much he is making off of his online journal and instead put that time into a competitive "online journal" that will net you $1200 a month!! Go for 20 accepted submissions and sit back and watch the cash come in by the truckloads...
  • by RenHoek ( 101570 ) on Friday January 21, 2005 @06:22PM (#11437271) Homepage
    From what I hear, Blizzard should think about hiring this guy ;)
  • speed focus (Score:3, Insightful)

    by Random Web Developer ( 776291 ) on Friday January 21, 2005 @06:22PM (#11437272) Homepage
    if they are so much focused on speed, couldn't this be the mysql killer for web applications that don't need funky features but where concurrency and speed are important
    • Re:speed focus (Score:3, Informative)

      According to the article what makes Streambase different is that it's able to query new data that is coming in at an extremely fast rate. Instead of writing the new data to disk before a query can be executed against it, the database is able to query it as soon as it is streamed into memory. According to the article the current customers testing the software are financial services companies who need to be able to analyze stock ticker information which comes in at an extremely high rate of speed. The $100
      • Re:speed focus (Score:4, Interesting)

        by airjrdn ( 681898 ) on Friday January 21, 2005 @07:21PM (#11437708) Homepage
        SQL Server Table Variable, and, to a certain extent, derived table, same basic premise...it's in RAM, not on Disk.

        One question might be...why write the data directly to a database initially? Why not utilize a faster format, then write to the DB when things have slowed down (i.e. caching)?

        Admittedly I haven't read the article, but I am familar with 200+G databases, and there are ways to deal with performance with current DB tech.

        I do welcome any new competition, but there are ways of querying data in memory already. Heck, put the whole thing on a RAM Drive...how much data can there be for stock tickers?
        • Re: (Score:3, Insightful)

          Comment removed based on user account deletion
      • Re:speed focus (Score:3, Informative)

        by jedidiah ( 1196 )
        This is also how Oracle works by default. You can have a database entirely resident in memory just due to the fact that Oracle will try to aggressively cache as much as it can. This is obviously not limited to Oracle or SQLServer.

        What distinguishes RDBMS systems is the fact that their storage is permanent and engineered to perform crash recovery. This means that even a memory resident Oracle database will be doing synchronous writes to it's transaction logs. This ensures that any transaction can be regener
    • by Anonymous Coward
      "Streambase charges customers annual subscriptions for its software, setting prices based on how many CPUs a customer uses to power the software. Typical deals so far have ranged from $100,000 to $300,000 a year"

      Yeah, this will outright kill mysql, I'm swapping tomorrow, got any cash to spare?
    • Re:speed focus (Score:5, Informative)

      by epiphani ( 254981 ) <(epiphani) (at) (dal.net)> on Friday January 21, 2005 @07:01PM (#11437571)
      The idea sounds a lot like the software I develop. We sit on a server-peer network, and process messages - without ever hitting disk. We can query state information out of the network, even though most traffic is dynamic and not stored past initial processing and resending. Two parts to our software, I guess. State data and traffic. Pretty impressive peice of software I think. Maintaining the network state is far more difficult than most people realize. We generally keep around 100 megs of state in RAM, more depending on the traffic levels. My software has been around, in various incarnations, since the 80s.

      Its called IRC.
    • Re:speed focus (Score:2, Insightful)

      by jonadab ( 583620 )
      > if they are so much focused on speed, couldn't this be the mysql killer
      > for web applications that don't need funky features but where concurrency
      > and speed are important

      As near as I can make out from the (somewhat nontechnical) article, this
      is not a traditional database in any normal sense; it's more like a query
      engine for streaming data. It doesn't permanently store all the data in
      the stream that's passing through it. What it does store, I take it, is
      query results. So I guess basically you
  • by Anonymous Coward
    Streaming data? Data must have some correlation otherwise it's useless. I doubt that all that can be kept in memory alone and so a permanent storage medium (disk, DAT, or holographic cubes) must be used.

    I used to work with a mySQL variant which facilitate queries by using a RAMDisk and an optimized version of Watcom Pascal to enhance query functionality. We made it open source, but last I heard, the last administrator had converted it into a MP3-labelling shareware package.
    • I doubt that all that can be kept in memory alone

      The dropping cost of memory wipes out your practical concerns. You can have all of the logical correlations that you want in memory. We tend to think that we have to write data to disk to make it organized because today's operating systems and programming languages give us very little direct control of memory, but they give us a great deal of control over what we write to the disk.

      If we had more operating systems that gave us direct control of memory, o

      • The dropping cost of memory wipes out your practical concerns.

        You're right there. At about $150/Gig, and not using disk space, that $1500 PC system could possibly be an Athlon 64 with 8 Gig of memory.

  • WWGT (Score:2, Funny)

    by Mikmorg ( 624030 )
    What Would Google Think?
  • I trust forbes.com about as much as I would trust, donaldtrump.com! http://www.trump.com/ [trump.com]
  • Duh (Score:2, Informative)

    Any of the enterprise databases will with gobs of memory end up caching the entire database in memory.

    As long as it's read only, the disk won't be touched.

    A writeable database that doesn't need to be written to disk is not a database, it's called a nonpersistent cache.
    • Re:Duh (Score:4, Insightful)

      by Anonymous Coward on Friday January 21, 2005 @06:33PM (#11437376)
      You've possibly misunderstood the point of this software.

      At no time is the data 'stored' in any way .. As it's collected (or INSERTed) it passes through a collection of preconfigured SELECT statements, and then disappears. There are no tables full of data, only tables as defined structures for handling incoming and outgoing data.

      You cannot query anything that happened in the past, because the program doesn't remember it.
      • This sounds pretty much like data-flow computing. Not really new. Just a new name. I'm sure it's quite fast, if all it does is look at an incoming stream and decide on which output stream it goes to. Databases have always been slow compared with in-memory tables. Of course I'm sure this system doesn't have to deal with all the record locking ans synchronization issures that an actual database would be doing.
      • Re:Duh (Score:2, Insightful)

        Sounds more like a messaging queue than a database. Of course, I work with Oracle DB's all day, so I have a rather targeted perspective/perception on the topic. The big dogs have messaging queues and data streaming technology built into the database, is this perhaps a way for it to come to the more "vanilla" MySQL/postgres world?
    • Re:Duh (Score:3, Interesting)

      by dubl-u ( 51156 ) *
      As others have pointed out, the article is talking about something completely different than what you had in mind. Even so:

      Any of the enterprise databases will with gobs of memory end up caching the entire database in memory.

      That's still much slower than in-memory approaches that don't use a database at all. For apps that are amenable to the stick-it-all-in-RAM approach, serializing all your data access is a performance killer.

      A writeable database that doesn't need to be written to disk is not a datab
  • What does it do? (Score:2, Redundant)

    by metalhed77 ( 250273 )
    I'm curious as to exactly what this does. The article is rather vague.
    • by Anonymous Coward
      It sounds to me that the application is to apply a set of rules to data as it comes running into the system. Imagine a database with "triggers" but no tables. Obviously, the rules are all cached in RAM, and they're not persisting the data stream at all (at least on this box, perhaps someplace else).

      That's just a SWAG, but from the article that's what it sounds like to me.
      • Ok, that's what I thought. But how this product is being compared to a database is odd. A database provides persistence. This sounds like it needs a constant stream of data.
        • >But how this product is being compared to a database is odd. A database provides persistence.

          I guess the idea is that you can run SQL queries on those in-memory tables (as opposed to searching memory in some non-standard way).

          >This sounds like it needs a constant stream of data.

          It doesn't _need_ a constant stream of data - data streams are there anwyay.
          It replaces disk-based databases which are aparently useless for real-time decision support systems that must process huge constant streams of data

  • Scientific programming question: Anybody have any experience with the Data Space Transfer Protocol [dataspaceweb.net]? Also known as the "Data Socket Transfer Protocol"? National Instruments [NI] wrote a DSTP front end into LabVIEW [ni.com], but if any major vendors have a DSTP back end, I haven't discovered it.

    Or does anyone have any experience with any other methods of moving large amounts of [strongly-typed] data across the wire so that it comes to rest in a central repository in some sort of a coherent fashion?

    Thanks!

    • No, Data Space Transfer Protocol is not "also known as" Data Socket Transfer Protocol. DataSocket is a National Instruments server that can reside on your test machine and enable streaming the test data across the Internet. So if you have a measurement test stand on one end, and LabView front-end on the other, DataSocket will take care of gluing two together.

      • No, Data Space Transfer Protocol is not "also known as" Data Socket Transfer Protocol.

        First of all, Grossman's group at UIC [dataspaceweb.net] tends to call it Data Space Transfer Protocol. On the other hand, the promotional and marketing material at National Instruments tends to call it Data Socket Transfer Protocol.

        Second, there seems to be some confusion as to what is meant by a backend. I want some sort of a server [something traditional, like Oracle/DB2/SQLServer, or something a little new-fangled, like Objectivity/

    • I've written a large data handling system for ESA that takes data from a variety of sources (thermocouples, PT100's, PLC's, vacuum gauges, ...) and stores it on a central server. From there the data is transferred on to prentation and control modules. We are geared towards large numbers of channels, fairly slow data updates (once a minute or so, although it will also work at much quicker rates), and large numbers of acquisition, presentation, and control stations.

      I've written my own wire protocol + packer


      • I've written my own wire protocol + packers and unpackers. I tag every data value with its type (number, time, string, ...) and message position (this I use to selectively leave out values under specific circumstances, i.e. to send partial messages). This arrangement works just fine: the wire format is machine independent, and quick to read and write. The coding overhead for message packing and unpacking is limited to pretty much a single function per message type (to identify the various fields), and conv
        • This is precisely what I feared: You had to write the whole thing from the ground up.

          Yeah. I suppose I could have used Corba, but now that I have the basic infrastructure in place there isn't really any advantage to doing so since the effort involved in remote function calls is now as small as it will ever get.

          Besides, I can think of at least one major (multi-million euro) software package that is considered almost too slow to be useable precisely because it is attempting to use Corba to shift serious

  • I wonder how this is different from MySQL Cluster [mysql.com] an in memory only DB. From my own comparisons of regular MySQL versus MySQL Cluster, I didn't see much of a performance increase. But, I guess it wasn't "streaming" either. I didn't really see too many technical specs for their new DB, but I didn't really look either. I wonder how they handle saving stuff to disk? Or do they not even bother and hope that the generator holds out until the power is restored?
  • I call foul (Score:4, Insightful)

    by RFC959 ( 121594 ) on Friday January 21, 2005 @06:30PM (#11437345) Journal
    I call foul. This quote from the article was what got to me:

    Traditional systems bog down because they first store data on hard drives or in main memory and then query it, Stonebraker says.

    So they manage to do their analysis without even touching main memory? Nifty! What do they do, make it all fit in the L1 data cache? OK, maybe the guy was misquoted - I trust reporters about as far as I can throw them - but the whole thing just smells funny to me. I'm betting that the massive speedup they report is only for carefully selected, pre-groomed data sets. I agree that analyzing data as it comes in rather than storing it up to recrunch later is the smart thing to do, but that insight isn't a breakthrough of the kind the article is spinning this as.
    • Re:I call foul (Score:5, Interesting)

      by ComputerSlicer23 ( 516509 ) on Friday January 21, 2005 @06:50PM (#11437501)
      Hmmm, I guess. My guess is that they have implemented something akin to SQL for datastrems. You define a message format. Think of each message as a row in the table. The message format is the table schema.

      You have a "standing query". So you can ask things, like, what's the rolling average for the last 60 seconds for this ticker name. What's the minimum price for this commodity.

      You can ask to correlate things. Store the last 90 minutes worth of transactions on these commodities. Search for these types of patterns.

      It sounds like what they have done is build an OLAP cube that builds its dataset on the fly by processing messages coming over a streaming interface.

      It's much smarter to do that, then write every last transaction to disk, and then query the transactions after the fact. That'd be the natural way to thing about it if you used a Relational database.

      Essentially, it sure sounds like he's written a generalized packet filter, that can compute interesting functions on the data. Think snort, think ethereal, think iptables, think policy routing. Now apply those kinds of technology to "The price of this stock", "the location of that soldier", where those values are embedded in a network packet frame somewhere.

      While each single application of this sounds trivial to implement, if he has done it in a generalized way, that can keep pay with larger systems, bully for him.

      The irony of all this for me is that at a former job, I used to process medical data exactly this way. It sounds like the HL7 interface issues we used to have. You couldn't possibly take a full HL7 stream and process it, so you'd filter it down to just the patients that this department was interested in. Then only process messages about those patients.

      There were rows that even about those patients you weren't interested in that you had to filter out. You spent a bunch of time filtering, and re-filtering.

      We wrote the raw messages to disk, and spooled them to ensure we didn't miss messages due database problems (if the database was down, you had to spool until the database came back up, it was unacceptable to miss patient records for database maintience).

      Kirby

    • I agree... how in the hell is TCP/IP going to be faster than going to memory? This kind of sounds like a cross between "Cold-Fusion-will-change-the-world" hype and "Make-everything-Internet-based" hype to me.
  • by Wesley Felter ( 138342 ) <wesley@felter.org> on Friday January 21, 2005 @06:30PM (#11437353) Homepage
    If Roland had RTFA, he'd have realized that this StreamBase thing is not a relational database and does not do the job of a traditional relational database. The whole point is that it uses a different architecture to solve problems that don't map well to relational databases.
  • by bigtallmofo ( 695287 ) on Friday January 21, 2005 @06:33PM (#11437379)
    Before another dozen people post about how in-memory databases have been done before, please read the article. They're specifically not talking about in-memory or on-disk databases. They're reading the data and analyzing it in real time as it flows through the network. For everyone asking how they're going to back such data up, you don't need to back up data that is useless 1 second after it has flowed through your network.
    • by kpharmer ( 452893 ) * on Friday January 21, 2005 @07:05PM (#11437604)
      Right, and this solution has its own limitations within this context: namely that if you crunch your data real time, rather than read it from a data store:

      1. if you decide to add a new analytic you have to start with new data - you can't deploy a new analtyical component and against historical data.

      2. if your machine crashes - it takes all your accumulated analytical data along with it. Maintaining a distribution of activity calculated every 5 minutes over 90 days? Great, but after the server comes back up your data starts all over.

      3. if your analtyical component needs to run against a lot of history each time (ex: total number of unique telephone numbers accessed by day, calculate rolling median) then you'll have to maintain that detail data in memory. As you can imagine - you can *easily* identify calculations that will exceed your memory. So, to tune you'll be forced to keep your calculations to relatively recent data only.

      ken
      • Some financials company is using this software to check incoming stock feeds for problems. It takes thousands of messages per second, and if certain stocks don't come in at least once in 5 seconds, it counts a miss. For others it's 1 in 30 seconds.

        If a given provider is consistently slow, it sounds a low-level alarm against the provider, not to trust their data because it's slow. Similarly for various markets, and probably other groupings too. It probably does other processing on the data.

        This data is
      • Why do distributions and such on the live data set? Stream through this system at highspeed, and drop the data onto a datawarehouse, who's *entire purpose in life* is to do historical crap.
    • by univgeek ( 442857 ) on Friday January 21, 2005 @10:05PM (#11438606)
      It's just that if you start querying AFTER you store it on disk, the I/O makes it much more slower. So what you do is pick up some of the information from the flowing data, and some other system behind yours saves the data.

      Every time you get some thing interesting, you save that on disk too - but separately, into a much smaller db. This way state is also saved, and since state is going to be much smaller than the data, there will be no speed issues.

      Now the clever thing to do would be to link this flowing-state dbms (FSDBMS) to a standard rdbms working from the disk. Then you could verify the information from the FSDBMS, and ensure that things aren't screwed up. Also, based on patterns seen by the rdbms with long term data, new queries could be generated on the FSDBMS, allowing it to generate results from the data on the wire.

      Sounds like it would have applications primarily where response time is at a premium, and long history is not such a large component of the information.

      So in the case of military info, where a HumVee could be in trouble (a situ someone else has mentioned), the FSDBMS would raise the alarm, and some other process would then follow up and ensure that the alarm was taken care of.(The data itself would be backed up for future analysis, such as whether the query was correctly handled).

      Dynamic queries in such a situ could be - get the id of the closest Apache reporting in, or closest loaded bomber en-route to some other target. Then the alarm handling program would re-route the bomber/apache to the humvee for support. While querying the disk database may be time intensive, the FSDBMS would have delivered a sub-optimal FAST solution.

      So imagine the FSDBMS as a filter, giving different bits of information to different people. With the option that you could change the filter on the fly. And the filter could be complex, based on previous history etc., just like a DB query.
  • A Better Solution (Score:4, Informative)

    by logicnazi ( 169418 ) <gerdesNO@SPAMinvariant.org> on Friday January 21, 2005 @06:37PM (#11437402) Homepage
    Just to let everyone know this is not the only product or even the first product to do this.

    Another option is EPL server by ispheres [ispheres.com]. Unlike the product mentioned here, which seems to be just some extra code thrown on top of a database EPL server is built from the ground up for this sort of application.
  • For sensor networks (Score:4, Interesting)

    by Anonymous Coward on Friday January 21, 2005 @06:41PM (#11437431)
    So this is mostly for sensor networks.. where you have hundereds (or thousands) of small, cheap sensors sending data to a nearby controller.. the controller doesn't need to store every bit of data it receives; it just calculates some prespecified queries (histograms, running sums, checking for trigger conditions, etc) on them and might store some small window of data for ad hoc queries... these systems are more simlar to dataflow applications than traditional databases.

    seems similar to his Auroa project... stonebraker has a history of turning his university research projects into successful startups.
  • ACID? (Score:3, Insightful)

    by plopez ( 54068 ) on Friday January 21, 2005 @06:44PM (#11437459) Journal
    How do they deal with the durability of aspect of ACID? If the system crashes without any data in a durable data store, it dissappears forever. It sounds more like high speed data analysis vs. a true database which implies longer term storage.
    • Re:ACID? (Score:3, Informative)

      by ray-auch ( 454705 )
      Difficult to tell from the vague article, but my guess is they don't, and they throw the data away after analysis. They might map some kind of database schema to the incoming data and provide some form of SQL for queying, but still no real database anywhere.

      So, throw out ACID (if problem domain doesn't require it) and get performance increases, wow! Probably they are now patenting it because no one had thought of that before...

    • I don't know how they're handling it, but personally I'd consider having the "streaming" database analysis machine on the same network as a file based server with an ethernet card set to promiscuous mode sniffing the packets aimed at the file server. (With the switch set to route the packets to both machines, of course.) That way you could have multiple file servers (assuming your flow of data was so great that it could bog down a single server) and have the real time server analyzing the incoming flow of
  • This reminds me of cyberpunk-esque network traffic. More specifically, I'm talking about those futures when bandwidth is so cheap that it becomes affordable (even necessary?) to have a constant flow of data coming and going from a datacenter.

    Seems to me that something like this would be incredibly useful for that: when the data from a couple seconds ago is now obsolete, you definitely need to be able to parse your queue as fast as you can.
  • > Streaming data without storing it on disk gives
    > them a tremendous speed advantage.

    There's a reason people generally don't do this, and that's because memory is expensive.

    > The company claims it can process 140,000
    > messages per second on a $1,500 PC, when its
    > competitors can only deal with 900 messages per
    > second.

    But I bet you its competitors can serve huge web-sites at 900 messages per second, whereas StreamBase can serve fits-in-memory-only web-sites at 140,000 messages per secon
  • by G4from128k ( 686170 ) on Friday January 21, 2005 @06:53PM (#11437522)
    Classifier Systems [ai.tsi.lv] are a genetic algorithm analog for this type of streaming data/pattern analysis. With classifier systems a stream of incoming messages interacts with a constantly evolving population of classifier rules and an internally changing pool of working messages to create a stream of outputs. A reward/feedback loop drives adaption of the rule system to reinforce when it creates "good" outputs. The entire Classifier System concept is analogous to the mammalian immune system in the way that neural nets are analogous to brains and genetic algorithms are analogous to Darwinian evolution.

    With a high enough stream processing speed (using StreamBase's methods), classifier systems might be useful for AI/adaptive learning scenarios.
    • Check out this diagram [mit.edu] of a classifier system. It's taken from The Computational Beauty of Nature [mit.edu]. The website isn't really up to date nowadays, but the full source code for everything in the book is available in both Linux and Windows downloads and there's a java applet of all the examples too.
      The material covered in the book is also still very relevant and the books a joy to read.
      You should buy it :^) Not astroturfing just really enjoyed the book myself.
  • This seems more like Message Oriented Middleware than a Database...

  • why not use the echo port? write data out to an echo port, then tee it off to your echo port. Then you can drink from the never-ending stream of data bouncing between your box and the remote box.

    Simple, lots of space, and secure...until a power failure.
  • If the article is correct, the only thing that distinguishes this dbms from more traditional is that it doesn't serialize its writes to the disk. If that's true, I don't know what the selling point is. Both MS SQL Server and Oracle have the capacity to run a database in commitless mode, in which changes aren't recorded to the disk (they can optionally be serialized on a timed interval). The military applications they talk about being difficult with traditional dbms' are already largely implemented today
  • More Information (Score:2, Informative)

    by adesai9 ( 563061 )
    DB Group @ Stanford is doing some Stream projects as well. Incase anyone is interested in more technical information check out: http://www-db.stanford.edu/stream/
  • It kinda makes me think of what you'd get if you crossed SED with SQL.
  • by Anonymous Coward


    Streaming a Database in Real Time

    Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In "Data On The Fly," Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streami
  • For a minute there I thought they were trying to store a large database by just forwarding all the little bits of it around the net constantly and then grabbing them when they came back around to save disk space.. but thats a thought!

    This idea really doesn't seem that new though? its just real-time DSP on text-based data! with a front-end that pretends to be a database.
  • Given that we are able to get ~50k entries / second with a tethereal output parsed via lex/yacc -> postgresql on a moderate pc I would more amazed at what level of analysis they are providing. Also the data does tend to have some importance over time for those transient issues. Add a hash to your parser and you can just aggregate the data to reduce the load on the db.
  • by boodaman ( 791877 ) on Friday January 21, 2005 @07:28PM (#11437738)
    OK, I get what they're trying to do, but my question: so what?

    Sooner or later you have to put something somewhere. Let's say you monitor a battalion in battle in realtime. All of these messages are streaming in and being analyzed. Great. But now what? So something triggers an alert, say. Well, what's tracking the status of the alert? Wouldn't you want to track the status of an alert saying "this Humvee is off course"? Wouldn't you want to track whether someone had acknowledged the alert, and what they did about it?

    And don't forget there are liability issues, historical issues, and more. You're a stock trader, all of these messages are coming and being analyzed. You get an alert...one of your triggers tripped. You make a trade as a result, only to find out 30 minutes later that the trigger was WRONG and your trade was WRONG and you (or your company) is out $10 million. How do you prove that you made the trade based on the trigger like you were supposed to and not because you f**ked up? The trigger, and the data that caused it to trip, is long gone. What do you do now?

    Eventually something has to be written (stored) somewhere, sometime. I guess I can see the need for summarizing data and only storing what StreamBase says is "important" but how would you know if everything was OK if the actual data driving everything was long gone?
    • 'Wouldn't you want to track the status of an alert saying "this Humvee is off course"?'

      It depends on you application, unless your running a black-box the best course of action would be to relay the message to the driver of the Humvee.

      It's also real handy when you get asked to produce the data in court.
    • Just because you kind find situations were this doesn't fit (not that I agree with your examples) does not mean it is a bad idea. In fact, storage of historical data would not be dificult, simply tie another system to this system and have it store data as it is asynchronously sent. At first this seems a little hokey, why use two systems when one would do the job, but you could set up a fairly nice alert-based historical system with somehting like this and a standard db.
      Here's my logic, if this system can ha
  • by X ( 1235 ) <x@xman.org> on Friday January 21, 2005 @07:36PM (#11437784) Homepage Journal
    This isn't streaming, it's standard message queuing. Most messaging products allow you to have non-persistent queues and allow you to extract data based on arbitary queries. There are well over a decades worth of products for doing this kind of stuff.

    I'm sure this is a great product, but both the submitter and the writer of the story seem to not grok what makes it great.
  • But the idea of a query engine in front of those messages is interesting.

    Yet, then what is LabView? We've been processing live real-time data streams for years.

    I still don't get the scope of it. It seems on one hand to be a lot of the same. This idea that they need this type of software to process data from remote sensors doesn't click. I process data from remote sense in real-time all the time (no pun intended). There is no need to store it in a DBMS and then query it in order for the data to be use
  • by murr ( 214674 ) on Friday January 21, 2005 @08:42PM (#11438170)
    ... if the software costs $300K
  • When I was a project manager at ECONZ http://www.econz.co.nz/ [econz.co.nz] in 1999 I did a high level design for a product similar to this but we merged it with a relational database (Oracle in this instance).

    Other posts are correct that what is talked about here is a message queuing mechanism to some degree. What I had designed and built was what we called an event server.

    Basically how it worked was that you sent what SQL statement you wanted registered and then you got the initial data set back and then any change
  • This is old news (Score:3, Insightful)

    by Chitlenz ( 184283 ) <chitlenz.chitlenz@com> on Friday January 21, 2005 @11:09PM (#11438823) Homepage
    I remember seeing a RAM-Cacheing scheme for Oracle a few years ago that had the same claims. In actuality Microsoft, for all the love they'll have here, allows you to do this exact thing in a Dataset object within .NET. There are several solutions to this kind of problem, but the .NET way is the one I'll focus on here.

    The CommandBehavior.SequentialAccess descendant of the SelectCommand Class in C# can be assigned in a way that allows binary objects, or otherwise ... data..etc., to 'stream' in a way back and forth in realtime within the relational Dataset objects created at app instantiation. Essentially, .NET allows for the same type of action by instantiating a 'database' within the Client-side apps by building a schema of sorts, up through and including relational refernces such as foreign keys. At this point, we have a 'database' of RAM (dataset) that can now be resynched via ports to any other client or server using the same architecture.

    I do this today to provide a distribution network for doctors who need access from several places to a pool of active patient data. This is a data volume of Serveral Terrabytes per location, so I assure you that we are discussing the same scale here as the article.

    Consequently, the TPC benchmarks show 3,210,540 TpCM as the current posted record for AIX on a Big Blue machine, so their numbers are skewed if not wrong. Most processes, including those using binaries, can be proceduralized at the back end anyway, thus make call -> server -> stored_procedure ->return (); be the flow, with all data living inside of RAM, and sorts happening in 'real-time', that is from a pinned table into another location in memory at the server layer, returning into a dataset that is kept in RAM on the client.

    I don't really see anything revolutionary about all this, correct me if I'm mistaking something?

    -chitlenz

  • Forgive my ignorance, but last time I checked "database" implied "persistence" of some sort. It's great that it can **process** 140,000 messages per second, but how many can it **store**?! Show me something that can store 140,000 items per second and I'll be duly impressed. Until then, let's compare apples with apples and keep everybody honest.
  • by mshurpik ( 198339 ) on Saturday January 22, 2005 @12:02AM (#11439138)
    This press release says a lot about analyzing streams and nothing about altering them. Most of the weight of a database is in manipulating a permanent record. INSERTS are slow. Streambase may not have any.

  • by enewhuis ( 852237 ) on Saturday January 22, 2005 @12:11AM (#11439177)
    My first reaction is: He is late in the game. Check out www.kx.com. They've already done this. And this kind of thing has been used for years to analyze real-time stock and commodities trading data as the trades occur in real-time. I've deployed several systems that are essentially streaming databases like this. Or did I miss something here?
  • an rdms does so much more. availability and redundency. What happens in a power outage if you have several gigs in memory your ups better be able to stand up long enough for everything to be backed up to disk. Primary memory is not the place for important datastores, unless your trying to lose your job. Poof power outage or software patch leakes some memory and your screwed. Raid 10 Oracle or all my critical data sitting on a $100 dimm. I'll choose the slow Oracle.

Our business in life is not to succeed but to continue to fail in high spirits. -- Robert Louis Stevenson

Working...