Slashdot Log In
Database Bigwigs Lead Stealthy Open Source Startup
Posted by
ScuttleMonkey
on Wed Feb 14, 2007 04:17 PM
from the hope-it-isn't-vaporcorp dept.
from the hope-it-isn't-vaporcorp dept.
BobB writes "Michael Stonebraker, who cooked up the Ingres and Postgres database management systems, is back with a stealthy startup called Vertica. And not just him, he has recruited former Oracle bigwigs Ray Lane and Jerry Held to give the company a boost before its software leaves beta testing. The promise — a Linux-based system that handles queries 100 times faster than traditional relational database management systems."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Partners (Score:5, Informative)
also interesting is the wikipedia article on Michael Stonebraker [wikipedia.org] if you aren't already familiar with him.
Re:Partners (Score:5, Insightful)
It's called "hedging your bets". If the little company doesn't work out, no big deal. If it does, then HP is in a position to either benefit from contractual relations, acquire it, or squash it. Whichever happens to be their fancy.
Parent
Column oriented databases (Score:2, Interesting)
How does this differ than KX System's kdb (www.kx.com) which IIRC is similar in that way; and is alredy in use at many if not most major financial institutions (see their customer list)?
Re:Column oriented databases (Score:5, Informative)
Parent
When Will This Be Ported? (Score:4, Funny)
Re: (Score:3, Funny)
Where by mainstream, you mean useless?
Everyone, we are moving to ASP now (Score:4, Funny)
You're bound to get some strange looks... (Score:5, Funny)
Parent
buzzword enabled (Score:4, Insightful)
What does that mean?
If anything.
Re:buzzword enabled (Score:5, Informative)
Stonebraker, Mike; et al. (2005). C-Store: A Column-oriented DBMS [mit.edu] (PDF). Proceedings of the 31st VLDB Conference.
From the paper:
Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of columnoriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures
Parent
Re:buzzword enabled (Score:5, Funny)
Uh, a spreadsheet?
Parent
Re:buzzword enabled (Score:5, Informative)
Parent
Re:buzzword enabled (Score:5, Informative)
Grid enabled - This means the DBMS can make use of a large distributed group of computers and potentially have access to a huge amount of computing power. The typical DBMS runs on at beat a multi-processor server. Thi sis kind of like a DBMS server running a a "seti at home" type network.
Going solely by the developer's reputation, this could be a big deal. He is not some random hacker. He is a well known university professor who has several times in the past lead projects that have been revolutionary and turned the field around. His ideas are widely used Still "100X faster" is a big claim. Lots of smart people have been working on DMBSes for many years, a two order of magnitude improvement is a "I will have to see it to believe it" type claim
I'm using PostgreSQL to handle some telemetry data right now. If my 45 minute run times can be reduced to seconds, I'll be happy.
Parent
Big claims are backed (Score:4, Informative)
Oh ye of little faith, here i present thee with The Facts. Or a paper at the very least: One size fits all? a Benchmark [mit.edu]
Parent
Re:buzzword enabled (Score:4, Insightful)
1. Make up lots of 100-column+ tables
2. Select one column from each table
3. If you're IO bound, you should now see about a 100:1 increase
However, most real data models don't work that way. Usually you put stuff that's useful at the same time in the same table, in which case it probably won't make much of a difference.
Parent
Column oriented? (Score:2)
Seriously, though, the target market for grid-based high volume data-warehousing type dbs are a lot smaller than the MySQL crowd. Not as big a deal as it seems, but it'd be nice to have if you needed it.
Re: (Score:3, Insightful)
Re: (Score:3, Interesting)
Re:Column oriented? (Score:5, Informative)
http://en.wikipedia.org/wiki/Column-oriented_DBMS [wikipedia.org]
It's basically an optimization of the current data access patterns. Databases have been row-oriented for decades, because they evolved from fixed width flat files. Once we eliminated COBOL-style accesses to databases, the full row data became less important. It became far more important to be able to scan a column as fast as possible. For example:
select * from names where lastname LIKE '%son'
The above query might have an index available to find what it needs. But it's just as likely that the database will need to do a table-scan. Since table-scans involve looking through every record in the database, you can imagine that it would be faster to just load the lastname column rather than loading every row in the database just to discard 90% of that data.
Parent
Re: (Score:3, Insightful)
Yes! No! Sort of!
Indexes only optimize some types of queries. To get the absolute maximum performance out of your database, you have to make sure that there is a specific index for each query you run, and that your indexes are properly rebuilt and optimized for least-time search. Suffice it to say, this rarely happens in the real world. So there's almost always some scanning, even after the indexes narrow things down a bit. By going with a column-oriented storage design, the scan
Re:Column oriented? (Score:5, Insightful)
Column oriented is easy. Imagine a database as a set of tables, each of which has rows of data records, in organized columns (column 1 = "User name", column 2 = "User ID", column 3 = "Favorite slashdot admin", etc).
Normal row-oriented databases store records which have a row of the data: "User name", "User ID", "Favorite slashdot admin" for user row #12345.
Column oriented databases store records which have a column of the data: "User name" for user rows 1-100,000; "User ID" for user rows 1-100,000; etc.
Updates are faster with row-oriented: you access the last record file and append something, or access an intermediate record file and update one "row" across.
Searches are faster with column-oriented: you access the record file for "Favorite slashdot admin" and look for entries which say "Phred", and then output the list of rows of data which match. Instead of going through the whole database top to bottom for the search, you just search on the one column. If you have 100 columns of data, then you look through 1/100th of the total data in the search. To pull data out, you then have to look at all the column files and index in the right number of records, but that goes relatively quickly.
Indexes are useful, but column-oriented is more efficient in some ways. You don't have to maintain the indexes, and can just automatically search any column without having indexed it, in a reasonably efficient manner.
Column-oriented also lets you compress the data on the fly efficiently: all the records are the same data type (string, integer, date, whatever) and lists of same data types compress well, and uncompress typically far faster than you can pull them off disk, so you can just automatically do it for all the data and save both speed and time...
Parent
Re: (Score:3, Insightful)
Re: (Score:3, Interesting)
Awesome (Score:2, Interesting)
With comodity hardware getting faster and cheaper by the minute, having a system that can handle a higher than average load with optimized software is, imho, a winner.
I'm sure everyone here can add some anecdotal evidence to how they had a heavy-hardware, database serving machine die on them because of some software bug.
This is one of the reasons I've been looking forward to ZFS. Hopefully the DB guru's will take the best of what's good about software, drop the legacy c
But does it save the children? (Score:2)
Perfect timing (Score:4, Interesting)
Loading a million random records out of a set of one hundred million records is an enormously difficult task for an RDBMS on commodity hardware (e.g. magnetic rotating disks). This is a more common task than you would think. ORM systems backed by an RDBMS, such as Ruby on Rails, Django, Hibernate, have exactly this requirement and will only demand more as these models become more mainstream. Think about what search engines have to do: find millions among billions, all to show a user a dozen.
These problems are solvable now, but there's a lot of duplication of effort going on that a smart database vendor could solve for us.
Doesn't "stealthy" require some stealth anymore? (Score:4, Insightful)
This is some new Network World definition of "Stealthy", apparently...
Re:Doesn't "stealthy" require some stealth anymore (Score:3, Funny)
Network World is a trade rag. To them, anything not advertised is stealthy. Especially since they want to motivate people to think "oh no, I don't want to be stealthy, that means unknown! quick buy some advertising!"
Best of luck (Score:5, Insightful)
What happened to Gallium Arsenide replacing silicon? What happened to solid state memory completely repalcing magnetic disks? Technology field is littered with such fiascos.
Re: (Score:3, Insightful)
Let me give you an example.
Suppose you have a table with, say, 100 billion rows. You want to create a report which provides aggregated data on a very large subset of a few columns of table. With a tradition RDBMS, you have to read through every single one of the 100 billion rows to aggregate the data (indeces don't help if you are going to be searching through a sizeable
Patent Problems (Score:3)
Why does a company promising Linux solutions... (Score:3, Interesting)
Re: (Score:3, Interesting)
Speculation (Score:5, Informative)
I noticed that Stonebraker is the company founder. Stonebraker has contributed extensively to database research over the years.
He's known for advocating the "shared-nothing" approach to parallel databases. The shared-nothing approach means that nodes in the parallel database don't attempt memory or cache synchronization, and each node has its own commodity disk array. In a shared-nothing parallel database, the data is "partitioned" across servers. So, for example, rows with id's 1-10 would be on the first server, 11-20 on the second server, etc. Executing the SQL query "select * from table where id < 1000" would send requests to multiple commodity servers and then aggregate the results. The optimizer is modified to take into account network bandwidth and latency, etc.
My guess on what they're doing: they're working on a shared-nothing parallel RDBMS with an in-memory client similar to Oracle TimesTen.
The are a few drawbacks to the shared-nothing approach: 1) the RDBMS software is more difficult to implement; 2) since the data is partitioned, any transaction that updates tuples on more than one database node requires a two-phase distributed commit, which is much more expensive; and 3) some queries are more expensive because they require transmitting large amounts of data over the network rather than a memory bus, and in rare cases that network overhead cannot be eliminated by the optimizer.
The advantage, of course, is linear scalability by adding commodity hardware. No more need for $3M+ boxes.
Re: (Score:3, Informative)
I'm certainly not suggesting these guys are the first to implement a shared-nothing parallel RDBMS. IBM has offered DB2 parallel edition which is shared-nothing for some time now. However IBM wants a ton of money for parallel edition, and DB2 has some legacy stuff which might not be useful in a shared-nothing architecture. An open-source shared-nothing RDBMS might be compelling.
I think the shared-nothing approach is the best one fo
I've been waiting for something like this ... (Score:3, Insightful)
Classic RDBMSes are crutches. A forced-upon neccesitiy we have to put up with for our app models to latch on to real world hardware and it's limitations. A historically grown mess with an overhead so huge it's insane. With a Database PL and 30+ dialects of it from back in the days when we flew to the moon using a slide-ruler as primary means of calculation.
If what they claim is true, these guys are probably finally ditching the omnipresent redundant n-fold layers user and connection management in favour of a lean system that at last does away with the distinction of filesystem and database and data access layer. Imagine a persistance layer with no SQL, no extra user management, no extra connection layer, no filesystem under it and native object suport for any PL you wish to compile in.
I tell you, finally ditching classic RDBMSes is *long* overdue, they're basically all the same ancient pile of rubble, from MySQL up to Oracle. If these guys are up to taking on this deed (or part of it) and they get finished when solid-state finally relieves our current super-slowpoking spinning metal disks on a broad scale we'll feel like being in heaven compared to the shit we still have to put up with today.
I wish these guys all the best. They appear to have the skills to do it and the authority to emphasise that todays RDBMSes and their underlying concepts are a relic of the past.
My 2 cents.
Given that... (Score:5, Informative)
By which I am asking that while Vertica is obviously well-researched and well funded as a start up, MonetDB is well-researched, already benchmarked and available now.. So why would I wait to invest my time, energy, and $$ in a proprietary future product rather than the time and energy, etc. to develop market leadership in my chosen corporate area in the present?
Re:Given that... (Score:5, Informative)
Here are a few of the technical reasons one might choose Vertica over Monet; I'll not get into business issues.
Vertica is designed for large amounts of data, and is optimized for disk based systems. Monet does benchmarks against TPC-H Scale Factor 5 (30 million records, an amount which would fit in main memory) running on Postgres; Vertica does TPC-H Scale factor 1000 (6 billion records) against commercial row stores tuned by people who do such work to make a living.
Vertica runs on multi-node clusters, allowing the cluster to grow as the amount of data grows, while Monet doesn't scale to multiple machines.
There are numerous differences in the transaction systems, update architecure, tolerance of hardware failure, and so on, that make Vertica better suited to the enterprise DW market.
Note: I work for Vertica
Parent
Google uses this approach (Score:3, Informative)
Re: (Score:3, Informative)
http://glinden.blogspot.com/2006/05/c-store-and-g
(per my post below, Vertica is a commercial version of MIT C-Store: http://db.lcs.mit.edu/projects/cstore/ [mit.edu] )
This is a commercial version of MIT C-Store (Score:4, Informative)
- Web site: http://db.lcs.mit.edu/projects/cstore/ [mit.edu]
- Wikipedia Entry: http://en.wikipedia.org/wiki/C-Store [wikipedia.org]
They distribute the source with a fairly liberal license, so this looks like something the open source community could pick up and run with.
Stupid question: Still SQL? (Score:3, Interesting)
SQL would inefficient (Score:3, Informative)
Re: (Score:3, Funny)
Re: (Score:3, Funny)
Re: (Score:3, Funny)
Microsoft is backing them?
Re: (Score:3, Funny)
Re: (Score:3, Insightful)
Seriously, you tested MySQL vs. other databases with "out of the box" setups? MySQL isn't a real database when running MyIsam engine, you simply cannot compare that with anything else. And on top of that, try do a proper insertion in MySQL, one single transaction with a few millions of rows and see how well that does. Oh and did you ever stop to think about _why_ MySQL does perform so much faster on that test? Try doing it on a InnoDB table with standard setup, even at 600k rows it slows to a cra
Re: (Score:3, Informative)
Note: I work for Vertica.
Re:Sounds great but.. (Score:4, Informative)
Note: I work for Vertica
Parent