Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

Database Bigwigs Lead Stealthy Open Source Startup 187

BobB writes "Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica. And not just him, he has recruited former Oracle bigwigs Ray Lane and Jerry Held to give the company a boost before its software leaves beta testing. The promise — a Linux-based system that handles queries 100 times faster than traditional relational database management systems."
This discussion has been archived. No new comments can be posted.

Database Bigwigs Lead Stealthy Open Source Startup

Comments Filter:
  • Partners (Score:5, Informative)

    by stoolpigeon ( 454276 ) * <bittercode@gmail> on Wednesday February 14, 2007 @05:22PM (#18016580) Homepage Journal
    The article mentions that redhat and hp are listed among their partners. i'm not surprised by red hat or informatica (another partner though they aren't mentioned in the article) but i was a little surprised by hp - since they have been trying to get the word out [hp.com] about their own data warehousing and bi stuff. i wonder what that indicates about how they regard this new player.
     
    also interesting is the wikipedia article on Michael Stonebraker [wikipedia.org] if you aren't already familiar with him.
  • by georgewilliamherbert ( 211790 ) on Wednesday February 14, 2007 @05:37PM (#18016808)
    KX is primarily in-memory. The competing column-oriented product is primarily Sybase IQ, which has been on the market for a while now.

  • Re:Column oriented? (Score:5, Informative)

    by AKAImBatman ( 238306 ) * <akaimbatman@gmaYEATSil.com minus poet> on Wednesday February 14, 2007 @05:43PM (#18016880) Homepage Journal

    A column oriented relational database? I'd like some more details on how that works.

    http://en.wikipedia.org/wiki/Column-oriented_DBMS [wikipedia.org]

    It's basically an optimization of the current data access patterns. Databases have been row-oriented for decades, because they evolved from fixed width flat files. Once we eliminated COBOL-style accesses to databases, the full row data became less important. It became far more important to be able to scan a column as fast as possible. For example:

    select * from names where lastname LIKE '%son'

    The above query might have an index available to find what it needs. But it's just as likely that the database will need to do a table-scan. Since table-scans involve looking through every record in the database, you can imagine that it would be faster to just load the lastname column rather than loading every row in the database just to discard 90% of that data.
  • Re:Column oriented? (Score:1, Informative)

    by Anonymous Coward on Wednesday February 14, 2007 @05:53PM (#18016978)
    I don't suppose it's just a regular SQL db with Excel's Pivot Tables run on it...

    Essentially it is - take each column and put it in a file, sequentially by row number. Queries are really easy (read record n out of each column-file) but inserts are rather difficult. Searches are quite efficient (you can jam a lot of data in a data block without all those other columns in the way) but updates aren't so much. Data compresses better because a column tends to be consistent in format and repetetive, so you can pack even more information in each data block (and search even faster, but make updating even slower). It's cool, as long as you don't change much data.

    I can't find anything to suggest it, but I suspect this group has some tricks to make updates less painful, or maybe they're just shooting for the warehouse market. It'll never take over the OLTP market but they may find a niche.

  • Speculation (Score:5, Informative)

    by cartman ( 18204 ) on Wednesday February 14, 2007 @05:55PM (#18016998)

    I noticed that Stonebraker is the company founder. Stonebraker has contributed extensively to database research over the years.

    He's known for advocating the "shared-nothing" approach to parallel databases. The shared-nothing approach means that nodes in the parallel database don't attempt memory or cache synchronization, and each node has its own commodity disk array. In a shared-nothing parallel database, the data is "partitioned" across servers. So, for example, rows with id's 1-10 would be on the first server, 11-20 on the second server, etc. Executing the SQL query "select * from table where id < 1000" would send requests to multiple commodity servers and then aggregate the results. The optimizer is modified to take into account network bandwidth and latency, etc.

    My guess on what they're doing: they're working on a shared-nothing parallel RDBMS with an in-memory client similar to Oracle TimesTen.

    The are a few drawbacks to the shared-nothing approach: 1) the RDBMS software is more difficult to implement; 2) since the data is partitioned, any transaction that updates tuples on more than one database node requires a two-phase distributed commit, which is much more expensive; and 3) some queries are more expensive because they require transmitting large amounts of data over the network rather than a memory bus, and in rare cases that network overhead cannot be eliminated by the optimizer.

    The advantage, of course, is linear scalability by adding commodity hardware. No more need for $3M+ boxes.

  • Re:buzzword enabled (Score:5, Informative)

    by c0nst ( 655115 ) on Wednesday February 14, 2007 @05:59PM (#18017032)
    Here you go:
    Stonebraker, Mike; et al. (2005). C-Store: A Column-oriented DBMS [mit.edu] (PDF). Proceedings of the 31st VLDB Conference.
    From the paper:
    Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of columnoriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures
    :-)
  • Re:Good..If it works (Score:1, Informative)

    by Anonymous Coward on Wednesday February 14, 2007 @06:15PM (#18017204)
    Personally, I think the breakthrough for managing data warehousing volumes of data with real-time response is going to come from NitroSecurity's NitroEDB [nitrosecurity.com]. I saw a demo they gave running on a single commodity laptop which delivered query responses thousands of times faster than Oracle, on a data set with billions of records. They're working with MySQL [mysql.com] creating an interface to use NitroEDB as a storage engine as well.
  • Given that... (Score:5, Informative)

    by CodeShark ( 17400 ) <ellsworthpc@NOspAm.yahoo.com> on Wednesday February 14, 2007 @06:26PM (#18017312) Homepage
    MonetDb, [monetdb.cwi.nl] is similarly configured as a column oriented AND Open source, and appears to clean the clock of most of the major commercial and Open Source databases for huge data set queries, (see the benchmarks at axyana.com [axyana.com] for an example), where is Vertica's market advantage supposed to be?


    By which I am asking that while Vertica is obviously well-researched and well funded as a start up, MonetDB is well-researched, already benchmarked and available now.. So why would I wait to invest my time, energy, and $$ in a proprietary future product rather than the time and energy, etc. to develop market leadership in my chosen corporate area in the present?

  • Re:buzzword enabled (Score:5, Informative)

    by perfczar ( 1064296 ) on Wednesday February 14, 2007 @06:54PM (#18017616)
    Buzzwords, yes, but they have a little bit of meaning left. Grid-enabled means that it works on a "shared nothing" environment, that you can use a networked cluster of commodity computers if one isn't enough to hold the data, and so on. This is in contrast to using one big huge box (big computer, big storage array, or whatever). Of course many databases are similarly grid-enabled. Column-oriented means that data is stored on disk by column, this makes it fast to process a subset of columns that touch lots of rows, as is typical in data warehouse applications. This is a key architectural difference among databases; Oracle, DB2, etc., are "row stores", while Sybase IQ, Vertica, etc. are "column stores". Note: I work for Vertica Systems
  • Re:buzzword enabled (Score:5, Informative)

    by ChrisA90278 ( 905188 ) on Wednesday February 14, 2007 @06:54PM (#18017618)
    Column oriented means it can read data in from one column from the disk without pulling in all the other bytes in the row. Possibly much less reduced I/O bandwidth usage depending on the query. (kind of like if you turned the normal file structure side ways.)

    Grid enabled - This means the DBMS can make use of a large distributed group of computers and potentially have access to a huge amount of computing power. The typical DBMS runs on at beat a multi-processor server. Thi sis kind of like a DBMS server running a a "seti at home" type network.

    Going solely by the developer's reputation, this could be a big deal. He is not some random hacker. He is a well known university professor who has several times in the past lead projects that have been revolutionary and turned the field around. His ideas are widely used Still "100X faster" is a big claim. Lots of smart people have been working on DMBSes for many years, a two order of magnitude improvement is a "I will have to see it to believe it" type claim

    I'm using PostgreSQL to handle some telemetry data right now. If my 45 minute run times can be reduced to seconds, I'll be happy.

  • Re:open source? (Score:3, Informative)

    by perfczar ( 1064296 ) on Wednesday February 14, 2007 @07:00PM (#18017680)
    Vertica is not open source. Not sure where the confusion came from.

    Note: I work for Vertica.
  • by russryan ( 981552 ) on Wednesday February 14, 2007 @07:09PM (#18017774)
    See http://en.wikipedia.org/wiki/Bigtable [wikipedia.org] for a description of Google's column oriented database.
  • by perfczar ( 1064296 ) on Wednesday February 14, 2007 @07:10PM (#18017784)
    The Vertica business model is to sell a database engine (software to store and query data). Clearly use of standard interfaces is important, otherwise nobody would be able to make use of the product (which really ends up being a component of a larger system or strategy) without going to a heap of trouble. So of course Vertica has:

    • A JDBC driver
    • An ODBC driver
    • An interactive SQL client
    • A growing list of tested integrations with other software

    Note: I work for Vertica
  • by ramakant ( 256472 ) on Wednesday February 14, 2007 @07:37PM (#18018052)
    This looks like it will be a commercial version of the Michael Stonebraker and MIT developed C-Store column-oriented:
    - Web site: http://db.lcs.mit.edu/projects/cstore/ [mit.edu]
    - Wikipedia Entry: http://en.wikipedia.org/wiki/C-Store [wikipedia.org]
    They distribute the source with a fairly liberal license, so this looks like something the open source community could pick up and run with.
  • Re:Given that... (Score:5, Informative)

    by perfczar ( 1064296 ) on Wednesday February 14, 2007 @07:46PM (#18018116)

    Here are a few of the technical reasons one might choose Vertica over Monet; I'll not get into business issues.


    Vertica is designed for large amounts of data, and is optimized for disk based systems. Monet does benchmarks against TPC-H Scale Factor 5 (30 million records, an amount which would fit in main memory) running on Postgres; Vertica does TPC-H Scale factor 1000 (6 billion records) against commercial row stores tuned by people who do such work to make a living.

    Vertica runs on multi-node clusters, allowing the cluster to grow as the amount of data grows, while Monet doesn't scale to multiple machines.

    There are numerous differences in the transaction systems, update architecure, tolerance of hardware failure, and so on, that make Vertica better suited to the enterprise DW market.


    Note: I work for Vertica
  • by ramakant ( 256472 ) on Wednesday February 14, 2007 @08:03PM (#18018290)
    Here's a good comparison of the two approaches:
    http://glinden.blogspot.com/2006/05/c-store-and-go ogle-bigtable.html [blogspot.com]
    (per my post below, Vertica is a commercial version of MIT C-Store: http://db.lcs.mit.edu/projects/cstore/ [mit.edu] )
  • by cartman ( 18204 ) on Wednesday February 14, 2007 @08:26PM (#18018560)

    Gee, I don't know anyone who's been succuessfully doing this for years...

    I'm certainly not suggesting these guys are the first to implement a shared-nothing parallel RDBMS. IBM has offered DB2 parallel edition which is shared-nothing for some time now. However IBM wants a ton of money for parallel edition, and DB2 has some legacy stuff which might not be useful in a shared-nothing architecture. An open-source shared-nothing RDBMS might be compelling.

    I think the shared-nothing approach is the best one for an open-source RDBMS offering. Organizations which use open source will almost certainly want to use commodity, open hardware. Shared-nothing will allow them to do that.

  • by jfroelich ( 1022159 ) on Wednesday February 14, 2007 @09:15PM (#18018996)
    Is that you do not scale as well to a large number of columns. To access a set of X records with 100 columns, you have 100 asynchronous I/O calls to the separate column stores. I sell an analytical software that does just this, and it is not a technical something that should just be ignored. In some regards the single file row oriented system has less I/O overhead. We have come up with some ways to reduce the file system overhead, but while it is small, it is noticeable, more so on systems not designed to have a some large amount simultaneous open files. All that really happened is that it switched part of the bottleneck to rely less on the product architecture and more on the system architecture. Whether you think that is wise, well, that's up to you.

    BTW, first post, I am no longer an eavesdropper, yay

    Josh
  • by Virtual_Raider ( 52165 ) on Wednesday February 14, 2007 @10:29PM (#18019562)

    Still "100X faster" is a big claim. Lots of smart people have been working on DMBSes for many years, a two order of magnitude improvement is a "I will have to see it to believe it" type claim

    Oh ye of little faith, here i present thee with The Facts. Or a paper at the very least: One size fits all? a Benchmark [mit.edu]

  • by Jayson ( 2343 ) <jnordwick@gmailOPENBSD.com minus bsd> on Thursday February 15, 2007 @01:46AM (#18020624)
    One of the benefits of column oriented DBs is that tables have an ordering, and that ordering can be exploited in queries. SQL doesn't give a good way to exploit it. Column DBs do allows SQL, but they also have other native languages that people tend to use.

There are two ways to write error-free programs; only the third one works.

Working...