Slashdot Log In
Database Clusters for the Masses
Posted by
michael
on Wed Apr 30, 2003 10:15 AM
from the no-slashdot-isn't-using-it dept.
from the no-slashdot-isn't-using-it dept.
grugruto writes "Cluster of databases is no more the privilege of few high-end commercial databases, open-source solutions are striking back! ObjectWeb, an Apache-like group, has announced the availability of Clustered JDBC (or C-JDBC). C-JDBC is an open-source software that implements a new concept called RAIDb (Redundant Array of Inexpensive Databases). It is simple: take a bunch of MySQL or PostgreSQL boxes, choose your RAIDb level (partitioning, replication, ...) and you obtain a scalable and fault tolerant database cluster."
This discussion has been archived.
No new comments can be posted.
Database Clusters for the Masses
|
Log In/Create an Account
| Top
| 279 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
WOOHOO! (Score:2, Funny)
(http://www.semifamous.com/)
Non-Java Implementations? (Score:5, Interesting)
(http://slashdot.org/)
So, the question is - is anyone working on anything like this for Perl, C, or generic implmentations?
Re:Non-Java Implementations? (Score:4, Insightful)
(http://www-rohan.sdsu.edu/~cleaver/software/)
Exactly -- given that the RAIDb itself sits elsewhere, I can't imagine it would be that hard to take the source itself and make a Perl DBD::Module out of it.
If only I had the spare time...
Re:Non-Java Implementations? (Score:4, Insightful)
(http://www.thedruid.co.uk/ | Last Journal: Monday June 21 2004, @06:14AM)
Seriously though, this may reduce the costs for some users but I don't think it will get a wide take up. Most people will not want to leave the deniability you can have with large corps like Oracle. Oracle is a 'safe' solution for the purchaser with their ass on the line, which is most corperate users these days.
And the more entrepenrial users will not usually have the hardware to use this properly anyway.
Anyone who is financing this lot will want proven standards.
Just my flawed £0.02
Sigh - Looks like I have my work cut out for me... (Score:5, Funny)
(http://www.greatmindsworking.com/)
Re:Non-Java Implementations? (Score:5, Informative)
(http://slashdot.org/)
When I said "generic implementation" I meant "an implementation which doesn't require your programs be written in a particular language." Which is probably a bit of a pipe dream, you'd still need some sort of glue code (ODBC, JDBC, DBD, etc). But, as was alluded to above, I was trying to beat the Beowulf comment when I asked my question.
Re:Non-Java Implementations? (Score:4, Interesting)
Please don't take my previous post as a flame, I completely agree with your point. What I was whining about was the fact that java doesn't play nice with system libs, as it is 'easy' to import other libs, but exporting java classes to other languages is ... :)
Let's say that few people feel like embedding a JVM to their C app
hmmm (Score:4, Interesting)
(http://slashdot.org/)
If only replicaton was so trivial (Score:4, Insightful)
(http://www.mk.w.pl/)
Running many databases is easy. Organizing and serializing replication is hard. Even if one have distributed transactions handy - not present in this case. But let's read their code...
Performance? (Score:5, Interesting)
(http://www.zserf.com/)
I wonder how much slower my query will be when the data is spread across several machines. I'd imagine that a few complex queries that aren't correctly optimized would bring this system to it's knees rather quickly.
Re:Performance? (Score:5, Informative)
(http://2soc.net/)
There are better ways to improve the performance of a database, horizontal partitioning, federated servers, etc.
This would be very cool if there was a generic implementation; we build many Microsoft SQL clusters and just the hardware requirements for an MSCS cluster easily exceed $50k, let alone the licensing...as an MCDBA I'd consider an open source solution if I could use it as a back-end ot an ASP/VB.NET application, just to save the licensing $$ for consulting! ; )
This is a threat to the big vendors (Score:5, Insightful)
(http://www.livejournal.com/users/jackwilliambell/ | Last Journal: Wednesday November 12 2003, @12:20PM)
But Oracle shops are dealing with expensive boxes they would love to replace, not to mention expensive Oracle licenses. Often the only reason they use Oracle (other than Oracle salesmen licking their buttholes) is because only Oracle has the horsepower to meet their requirements. Give them a cheaper alternative with the same capabilities and they will bail out faster than you can say 'Geronimo'.
Expect Larry Ellison to start talking about the dangers of using Open Source software now...
Re:This is a threat to the big vendors (Score:4, Insightful)
(http://www.marotti.com/ | Last Journal: Thursday February 15 2007, @01:48PM)
What does proprietary software have that Open Source doesn't? Insurance.
The best way to knock over oracle is to start up a company that supports open source for a fee (which is cheaper than running oracle for a year).
Re:This is a threat to the big vendors (Score:4, Informative)
(Last Journal: Sunday April 29 2007, @08:26PM)
Josh, know what you're talking about before you post. MySQL [mysql.com] (the company which does the vast majority of development of MySQL) offers a variety of levels of support and consulting, regardless of the number of systems that you admin. For $48,000/year, you get:
Does Oracle match that for the price?
Re:This is a threat to the big vendors (Score:5, Insightful)
(http://www.towardsafreeworld.com/ | Last Journal: Thursday June 26 2003, @03:38AM)
Prior to Oracle taking off in a big way people used to say:
Then Larry E. shamelessly put together a cool SQL database which copied every major innovation IBM had made and added in a few more for good measure. He also cut the price by a third, IBMs database customers deserted in droves, after all if this Oracle thing turned out to be shit, they could always get IBM to come clean up the mess. It turned out though, that Oracle wasn't and isn't shit.
That does not mean that Oracle is immortal and will always be top of the pile, Postgres now replicates almost all of the major features and is proven in the reliability stakes, tools like this are only going to make it more likely that corporate data departments will dip their toes into the Free software waters, after all if it turns out to be shit, they could always get Oracle to come clean up the mess.
Re:This is a threat to the big vendors (Score:4, Informative)
(http://www.fitzg.com/ | Last Journal: Wednesday May 07 2003, @03:06AM)
If you want to cluster Oracle, use Oracle RAC (Real Application Clusters). It's based on Parallel Server so is mature enough to put forward for consideration... and even then it might be eschewed from above. Cheap databases are not going to ring the bells of the people with the say-so simply because Oracle (and DB2 etc) are proven over the years, and the cost of losing your data because you went for the cheap option is going to lose your company a lot of money, and you your job!
Technically better, cheaper and all those good things does not mean better for a business. Databases are predominantly used for *business*, and as such a *business* reason it used when choosing one over another, not technical reasons.
Re:This is a threat to the big vendors (Score:4, Insightful)
That's exactly the point. Who needs all the features of Oracle? Maybe the IRS or Mastercard, but the vast majority of Oracle users are getting just one feature: the Oracle reputation that their marketing has built.
And with all those features comes the big problem of managing them: no matter how small the application is, once you choose Oracle you need a team of experienced DBAs to correctly and reliably configure the system.
Quick thru the docs... (Score:5, Informative)
1. A Controller. It looks as tho a single controller is used by the clients to communicate to the various RAID'd dbs. I'm sure there can be multiple controllers since there would be little point to make some db's redundant, yet the access to them not. Still looking into this.
2. And also, it looks as tho the default port is 1099 - RMI. If you have, for a web app, your EJBs and web app local to that containter, that might not be a problem. However, I happen to have my EJB server on its own box and this might very well cause probs. I think it said you could specify our own ports, but I haven't seen any examples in the docs yet of this being the case. Also, still looking.
A few other things exist as well which are in the docs as known limitations:
* XAConnections
* Blobs
* batch updates
* callable statements
These could be serious issues for some. My last project used CLOBs/BLOBs, batch updates and callable statements, so this would rule that out. Of course, all the db stuff was strictly tied to Oracle, so I think that would rule this all regardless.
All in all tho, this looks like a good start. As my current project progresses, clustered dbs will become more and more of an issue. I've looked into some other projects out there for Postgres, but nothing yet really satisfactory. I think this is a good step in the right direction - for Java developers. It'll be interesting to watch.
Of course, if mysql had replication worth a damn (Score:1, Flamebait)
(http://www.hyperlogos.org/ | Last Journal: Wednesday July 18, @08:19PM)
Where are the benchmarks that they speak of ? (Score:5, Insightful)
(http://www.red82.com/ | Last Journal: Monday April 19 2004, @11:00AM)
Supposedly, This new version has been successfully tested with Tomcat, JOnAS, MySQL and PostgreSQL. Excellent results have been obtained with the TPC-W and RUBiS benchmarks.
Don't get me wrong, I like the idea, and I have been wanting something like this for years, but I sure would like to _see_ the test results, even if they are preliminary.
How about a meta-database adapter? (Score:5, Interesting)
(Last Journal: Saturday January 29 2005, @08:51PM)
Here's an example of an application: I have a database-driven Web application [slashdot.org] that allows my onsite clients to register network services for openings in the firewall. Another software component probes the registered hosts for daemon version information and records it in the database, so that we can send out alerts when security holes are discovered in particular versions. I use PostgreSQL on Debian and Solaris. Independently of my work, our networking office has a Microsoft SQL Server database of IP addresses, MAC addresses, and physical switch ports and jack numbers.
What I'd like to do is mount both my database and the networking office's database into some sort of "meta-database" -- analogous to mounting filesystems from two different hosts via NFS -- and run SQL queries that span both data sets. I wouldn't expect to be able to write to this conjoined database -- locking would be a nightmare -- but being able to SELECT across the two sets would be incredibly valuable.
More info on transactions (Score:4, Interesting)
Why? (Score:2, Insightful)
supposed to be at RDMS level (Score:5, Insightful)
(Last Journal: Thursday August 23 2001, @09:23PM)
I mean, this is neat and all, but I really don't want to have to use this interface just so that I can cluster my database. You're much better off placing clustering functions within the database itself. Then you can access the data by any method (ODBC, native libraries, hell even with the provided command line interface).
Take a look at how MS SQL Server performs clustering sometime. Everything (and I mean EVERYTHING) is performed via triggers and tsql. All the clustering setup does is set up a bunch of known working trigger scripts to propagate the data. You can even edit them to your liking afterwards if you wish. Now I'm not saying that MS's solution for clustering is the cat's ass. Personally, I think it is kind of hackish, but then again I believe that clustering should be something you simply turn on, and shouldn't be able to fuss with. Realistically, I can't think of any good reason to change the cookie cutter tsql scripts that perform the clustering, so I only see the ability to modify them as a potential way to fsck it up (that being an obviously bad thing).
Clustering really isn't that hard to implement. I'm pretty surprised that MySQL and Postgres don't have better support for it. Especially Postgres, since transaction support is really the one big key that makes clustering possible. Maybe no one has really had an itch to make it heppen yet. Hopefully it will happen soon, since I'd love clustering to be another argument for why OSS databases can play with the big kids just as easily.
"Shared-Nothing Architecture" (Score:2, Insightful)
Know what? There are a ton of deep issues beyond just making the different partitions transparent to the application level. Think about joins across partitions for sec...
Slightly Offtopic.... (Score:3, Insightful)
(http://aol.com/)
My view is that it may be difficult to migrate OSes or even hardware, but its almost darm impossible to migrate existing Databases.
A Database is the most fundamental and most cared about aspect of a major business. There is a lot of time and effort and MONEY spent to incorporate it in to the company.
Lots and lots of critical business applications are written using the propritory extenstions of these vendors. Is it very easy to migrate this code ?
May be interesting for a future pilot project, but if expect business to change their database vendors.. that's not going to happen very soon.
How does clustering improve performance? (Score:2, Interesting)
(http://www.babe-test.com/ | Last Journal: Wednesday September 17 2003, @11:59AM)
How do you join one table to another when they are on two separate boxes?
Well. I know how to actually use SQL to join two tables from two separate databases. But what is actually happening inside the RDBMS at the low lever. Does one just bring over the entire other table. How does it use indexes.
Seems to me this really is doing at best, a reference implementation that may actually degrade performance.
their site is not slasdotted... (Score:2)
(http://neirol.wordpress.com/ | Last Journal: Tuesday November 26 2002, @02:42AM)
Merge it with J2EE spec (Score:1)
(http://www.trajano.net/ | Last Journal: Thursday April 15 2004, @02:17AM)
It may also have the advantage of using the transactional, load balancing and clustering facilities of the J2EE container as well.
DB Clusters of the world, unite! (Score:2, Funny)
Finally, my grandmother can have that database cluster she has been bugging me about.
Also new! (Score:5, Funny)
(Last Journal: Monday November 08 2004, @10:00AM)
RAID -- Redundant Array of Inexpensive Developers
RAID 0
Multiple developers work on the same project but none of them has any idea what the other is doing at the same time. One developer failing (caffeine dehydration, severe electrostatic shock, sex, etc) will cause the entire project to screw up and become a mess.
RAID 1
Extreme Programming.
RAID 2
Inefficient way to keep track of what developers are doing. For every 10 developers, 4 are needed to keep track of them and recover any error by the aforementioned 10 while they don't work together at all. Level of efficienty comparable to a modern goverment.
RAID 3
Equal to RAID 2, except all responsibility for checking the code is now granted to one person. The rest has been budget-cutted away. A bite more effective but considering people still don't cooperate, not too good.
RAID 4
Equal to RAID 3, escept people are finally working together now. Kinda efficient and fast, except it all still relies on that one person who checks the data.
RAID 5
Everyone knows what everyone else is doing, they all work perfectly together and they can easily miss one person because of that.
Limitations (Score:1, Insightful)
4.4. Current limitations
The C-JDBC driver currently does not support the following features:
* XAConnections,
* updatable ResultSets,
* callable statements (stored procedures),
* Blobs,
* batch updates,
* multiple controller failover is subject to controller support for distributed virtual databases,
* JDBC 3.0 features.
Fine-grained caching question (Score:2, Interesting)
Can someone who understands C-JDBC better than I do explain what this might mean? Sounds to me like they are replacing a feature of CMP by doing this, which is not necessarily something that would be "useful with EJB entity beans" if I understand it right (unless maybe they are referring to folks using EJB 1.0?). That is, the container already handles cache-invalidation at a fine-grained level. Perhaps there is a scenario I am not imagining where it would be useful to have this at the database level also... thoughts?
This is not that novel of an idea (Score:2)
Essentially, this seems to be that front-end piece which abstracts the calling app from which server it is connecting to, and can dynamically point that app at another server. I'm sure it will be a handy module for anyone who doesn't want to write their own logic for dynamically determining a connection to a database.
However, the cost of writing that bit of code is much lower than the overhead of maintaining all those database servers (heterogenous replication? ugh). So sure, this is helpful, but anyone with enough wherewithal to set up and maintain a set of synchronized database servers probably has enough sense to be able to set up application logic to utilize those servers anyway.
good idea--just not new (Score:2)
It's good that these are becoming available in open source form, but the concept is not new at all. IBM and Oracle both have had commercial versions for a while (I suppose the "inexpensive" part is new).
Thorough rundown (Score:5, Informative)
After actually reading the documentation, here's my informed take on this:
1) In it's current incarnation, it's only useful for very very simple database access. No transactions, no blobs, etc. Basically if you're just storing some simple weblication tables and doing single-statements against them for selects/updates (no big cross-table transactions), you can use it.
2) It's JDBC only. Perhaps someone could port the concept to ODBC though.
3) There's a new middle tier between the JDBC driver and the database itself, which is the bulk of their code. This tier actually re-implements some database constructs like recovery logging, query caching, etc. Of course this is neccesary, as trying to do replication from the client-code side alone would be impossible (what do you do when one of 3 DB mirrors goes offline for an hour? have every jdbc client cache the requests and replay them later, hoping those clients are even stilla round later?)
For some applications and some companies, in it's current state this is a godsend - but it's not a general solution yet. Making it ODBC (or even better, having the front of it emulate a native postgresql or mysql listener) would broaden it's applicability.
Supporting transactions would be a big win too, although I'm not sure how feasible this is - I think at that point they may as well just write their own new database engine which is parallel from the start, seeing as they'll be re-implementing in their cluster tier almost everything the database server does except for actual physical storage.
Still, it's nice to see that someone did this and made it work - and for a lot of simple databases behind java apps it's all you really need.
PostgreSQL has all the transaction support in place already, so of all the free DBs out there it would seem they have the best shot at doing their own native parallelism, if they would just get it done someday.
Tried this before... its a tough sell (Score:2)
(http://www.cafepress.com/chpwn)
1st... multiple points of failure. By increasing the number or databases your increasing the potential points of failure. What features are there to automatically backup data? If the data is spread randomly across the dbs and one of the drives or servers dies, what failover is there? Will the other databases take over? In a cost/risk analysis, is this really the cheapest way?
2nd...Is any speed increase from multiple databases going to be more then the speed increase from just upgrading the database server? More/faster disks, more processors etc. Sticking to one machine allows you to use the fault tolerance built into the RAID controller or the server itself. You could argue that once you got to the fastest hardware you need to go with more machines, but at that point you might need to look at your application. Quad Xeon 2.2Ghz with GBs of memory and an NetApp disk array is going to powerful enough for alot of apps.
3rd... Is this really faster? With simple SQL queries it might, but what about complex joins etc? Since this lies infront of the dbs, what about stored procedures etc?
The only really application that I could see this for is a small ecommerce website that needs to have millions of potential products to sell. (Electronics supply store etc). Something where the data needing replicating is static and is imported.
And as far as eliminating the need for a high priced Oracle DBA, someone able to support an array of 8-10 mysql databases using this technology is going to be both high price and hard to find.
dream of a language-agnostic system like this (Score:1)
Our idea was to write it in C, and make it proxy connections to mysql, postgres etc. In otherwords it would speak and understand the wire protocols of each database it supported. It would apply replication (etc) logic as it passed messages through to the real databases.
We imagined a type of pipeline which you could configure, and messages would move though that pipeline being processed by different modules... ie you could enable replication, logging, and perhaps various other types of processing, as options for each user/db or something like that.
Such a system would be useful for any client without modification (such as PHP, perl, C programs and of course the relevant JDBC drivers).
Well we didn't go very far with the idea... Ok we didn't go anywhere with it... But I still I felt like sharing.
Not there yet... (Score:2, Insightful)
(http://jfroebe.livejournal.com/)
The shared disk array (RAID, etc.) is just a part of implementating HA.
My recommendation is for the developers to take a look at how it is implemented in the enterprise DBMSs (Sybase, Oracle, MS SQL Server, DB2) first.
jason
This is a very very old idea.... (Score:2)
(http://www.cafepress.com/chpwn)
First, they should move more and more features of the DB to the controller layer. The goal should be that you can call plain SQL statements and complex joins directly. Later, you could even have stored procedures execute there and use the cluster as if it were one db.
Then, they should try and work it so that you make low level calls to the DB layer, this would save time in having the seperate DBs compile the SQL statements.
Next, make some kernal mods ala Tux to make the DB calls faster to execute, ie make the DB machines pure DB handlers.
Once you do that, you might want to consider moving the seperate dbs into one rack, maybe making them share power supplies, disk arrays to cut down the points of failure.
As well, have one handler computer handle all incoming connections which would appear to be a stand alone Database. Thus every database instance would apear to be a
It would be powerful to separate the hardware/database tie to allow the Admin to manage which servers would have which partitions, letting them span a partition accross a new server if it got too big. And let the partitions automatically move away from bad servers using parity information stored on a seperate server.
Once you finish developing all that... you should realize that's what Oracle already does. Oracle isn't some MIcrosoftish company that developed a product absent any competition so quailty, reliability and performance wasn't job #1. Oracle has long competed against IBM, Sybase, Microsoft etc and pretty much has the DB thing down.
The only use I could see for this tech, is in a small ecommerce web site that needed to search millions of records (electronics supply store). This would be for when a MYSQL table would start to bog down due to too many records. Even then, having multiple machines should be the very last resort.
As was said before that is really cool but! (Score:2)
(http://www.codepunk.com/)
to handle all the traffic.
Now that would be COOL!
overrated (Score:1)
Spliting up tables across db's seems a little rough, esp since you have to run the query on more than one db and then merge the results into a single result set. This means that you have to do your own sort. It gets even more fun if you use limit and offset in your query. It just gets wierd after a while. I say wait for postgres-R, it'll be much more of a kick in the pants for Oracle.
ACID? (Score:1)
Sure, I love clustering boxes as much as the next guy, but the overhead is tremendous if the rdbms doesn't support it, let alone the data integrity questions it brings up.
I wouldn't get too excited (Score:3, Informative)
Furthermore, to scale up systems generally take advantage of stripping. At the IO level that means striping across multiple disks (modern convention is to stripe across all!). In a parallel database one usually stripes a single table across multiple nodes for parallel query processing. While it is possible with C_JDBC to put table X on node A, table Y on node B I don't see any provision for striping the data. It will be very difficult to use your hardware efficiently in this scenario.
If you are going to go through the trouble of implementing a complete query processor (that can handle jobs larger than ram), a full update/query scheduler (lock manager), and a journalling mechanism that can (somehow) even maintain atomic transactions (even in the face of multiple failures) then why not just build your own database. This system might be useful in certain rare cases but I wouldn't use it except possibly for replication.
JJ
Oh fuck itīs great.. (Score:1)
Re:Nothing beats Oracle RAC (Score:1, Funny)
Re:That's nice... (Score:1)
a beowulf will just allow you to disperese workload. Either at the process or at the thread with MOSIX.
In fact a HPC like MOSIX can result in reduced uptimes. If a machine has a 1% failure rate then what is the failure rate of 100 machines in parrallel?
The question all boils down to how granular your recovery process is. For a desktop you need very fine granularity and few current systems provide this. I think Tandem provides this by using special hardware and kernel patches for NT.
Re:I'm 100% Confident (Score:2, Informative)
(http://www.objectweb.org/)
When you look at Oracle pricing policy, you can have Oracle RAC for the price of just Oracle (+ a free RAIDb), which is already a 50% discount!
Re:Hehehe... (Score:1)
+ Clueless
--------------
Script Kiddy
How did this post ever even get a score of 2?
Re:Hehehe... (Score:2, Funny)
*laughter*
Re:Hehehe... (Score:1)
That was a great post. Good stuff.