Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Zvents Releases Open Source Cluster Database Based on Google

Posted by ScuttleMonkey on Fri Feb 08, 2008 06:19 PM
from the surprised-it-took-this-long dept.
An anonymous reader writes "Local search engine company, Zvents, has released an open source distributed data storage system based on Google's released design specs. 'The new software, Hypertable, is designed to scale to 1000 nodes, all commodity PCs [...] The Google database design on which Hypertable is based, Bigtable, attracted a lot of developer buzz and a "Best Paper" award from the USENIX Association for "Bigtable: A Distributed Storage System for Structured Data" a 2006 publication from nine Google researchers including Fay Chang, Jeffrey Dean, and Sanjay Ghemawat. Google's Bigtable uses the company's in-house Google File System for storage.'"
+ -
story

Related Stories

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • I'll check back when they get out of alpha.
  • ..designed to scale to 1000 nodes, all commodity PCs...
    I'm just curious if anyone has had any experiance with these types of systems using commodity PCs, how is performance and does how well does it scale as you increase the amount of nodes?
    • I don't really have any first hand knowledge (outside of network rendering at a pretty small scale) but the concept is deffinetly sound, its the same reason why software uses "threads" and why processors now have more than one core...

      As for scaling, it would scale at the same rate as Non-Commodity Computers... if you have 999 computers all of equal performance, and then you add another one, you could expect a 0.1% change over-all...however its largely based on what sort of controllers you use, the same as h
    • I was contracted to make the firebird db able to work with OpenSSI. Quite frankly, it worked beautifully, and it didn't require that much work. The issue I was faced with was that the storage had to be remote, which wasn't necessarily a problem per se, because nothing ever failed while I was around. Now if the power went out on the storage server and a few nodes at random, I really have no idea the havoc it would have caused... I was told my job was done and they didn't have a need for any sort of fault tol
    • Re: (Score:3, Informative)

      So, Hypertable runs on top of Hadoop. We don't use Hypertable (or HBase) so I can't commen on those. I can share some of our experiences with Hadoop though. I think it is safe to say that it scales quite well for the vast majority of people who need it. Let's deep dive for a bit...

      Hadoop keeps all of its file system metadata in memory on a machine called the name node. This includes information about block placement and which files are allocated which blocks. Therefore, the big crunch we've seen is th
      • Yes, as a matter of fact I did read it. But I'm kind of curious as to people's first hand knowledge...
        • IIRC, Paypal and Google both use commodity PCs in clusters like this. Google uses something similar to above (duh), and Paypal uses a 3 tiered, multi-PC setup (Database, caching layer, and application side layers, respectively).
  • Really, this time, a full fucking beowulf cluster (that runs linux!) is available to /.ers. No. Fucking. Way.

    Alright, I know it's only storage and not processing power, but that was inevitable.
    • Re: (Score:3, Informative)

      Really, this time, a full fucking beowulf cluster (that runs linux!) is available to /.ers. No. Fucking. Way.

      What?

      There is no particular piece of software that defines a cluster as a Beowulf. Commonly used parallel processing libraries include MPI (Message Passing Interface) and PVM (Parallel Virtual Machine). Both of these permit the programmer to divide a task among a group of networked computers, and recollect the results of processing.

      "Beowulf (computing)." Wikipedia, The Free Encyclopedia. 28 Jan 2008, 12:25 UTC. Wikimedia Foundation, Inc. 9 Feb 2008 <http://en.wikipedia.org/w/index.php?title=Beowulf [wikipedia.org]

      • Wikipedia lists no less than eight Linux distributions designed specifically for building Beowulf clusters.
        Actually, I'm aware of that. You could say that I overreacted.
  • Project page: http://www.hypertable.org/ [hypertable.org]
    Zvents: http://www.zvents.com/ [zvents.com]
  • how useful is DHT? (Score:4, Insightful)

    by convolvatron (176505) on Friday February 08 2008, @06:42PM (#22355818)
    i've been interested in this question for the last few years. how much do people value the ability to use a relational language and transactional consistency, or for most of these uses are these things just historical artifacts?
    • by moderatorrater (1095745) on Friday February 08 2008, @06:51PM (#22355910)
      It's useful for ridiculously large data sets, like the entire internet. I know that medium sized stores (overstock, etc) use a relational database, and anything with less data than that is probably going to use a relational database. However, for extremely large data sets and certain repetitive, non-dependent loops (such as, say, looping through every website for a search), this can be useful. At least for now, relational databases are more useful overall, but tools like this have their place, and as data sets grow faster than real computational power, they'll be used more and more.
    • by ShieldW0lf (601553) on Friday February 08 2008, @07:00PM (#22355962) Journal
      i've been interested in this question for the last few years. how much do people value the ability to use a relational language and transactional consistency, or for most of these uses are these things just historical artifacts?

      In the 7 years I've been working in the industry, I've never delivered a single project that I would trust to a non-ACID database. Ever. And I doubt I ever will. If you want something that will generate some marketing material at high speed, and if it fails, who cares, well, use MySQL. If you want to do something that can handle a million pithy comments and if some of them get lost in the shuffle, who cares, well, that's fine too. Use whatever serves fast. If you're running Google, and it doesn't matter if a node drops out because there is no "right" answer to get wrong in the first place as long as you spit out a bunch of links, well, these sorts of non-resilient systems are fine.

      Personally, I've never done projects like that. In my projects, if the data isn't perfect always and forever, it's worse than if it had never been written. It's very existence is a liability, because people will rely on it when they shouldn't, for things that can't get by with "close".

      So yes. Transactional consistency and a solid relational model are pretty much mandatory, and not going anywhere soon. The idea that they might be replaced by technology such as this is laughable.
      • Re: (Score:3, Informative)

        So yes. Transactional consistency and a solid relational model are pretty much mandatory, and not going anywhere soon. The idea that they might be replaced by technology such as this is laughable.

        Relational databases don't implement the relational model correctly anyway. As for transactional consistency, you can get that on top of many different kinds of stores (including file systems); relational databases have no monopoly on that.
      • Re: (Score:1, Offtopic)

        corporations constantly put bullshit data into those acid-compliant databases and then believe them forever as if they were true.

        already, we have the Dick-Shrub using such databases to terrorize the populace with expansion planned.
      • In my thirty-plus years in the industry, I have seen a disk drive which could support transactional storage. The notion that you're going to write data in a manner which is more reliable than the underlying store is laughable. Even if you check the integrity of the underlying record, how do you know that your integrity check actually tested against the data you'll return next time? You don't; all you know is that the odds that you get back something else are negligibly small -- not zero, but low enough t
    • for most of these uses are these things just historical artifacts?
      they are not .There are still some places you can find their use .
      • oh i agree completely. check out datalog
      • I have been using ZODB for a couple years now and one thing that bothers me with systems that store objects directly instead of "dehydrated" representations of them is that when the underlying code for the object changes significantly all sort of weird things occur

        I kind of like dehydrating/serializing objects to a simpler representation when persisting them. This uncomfortable step is nice because it shoehorns the data into a brand new instance.

        But that may be just me.
  • The article talks about adapting MySQL to be a front end. I wonder if someone is working on adapting PostgreSQL to be a front end too.
  • by inKubus (199753) on Friday February 08 2008, @06:48PM (#22355882) Homepage Journal
    This is a classic column-orientated DBMS, ala Sybase. You use these for data warehousing since they are optimized for read queries and not transactions. Stuff like Google search queries. It also allows you to quickly build cubes of data across a timeline, since you have data in columns instead of rows.

    IE:

    a,b,c,d,e; 1,2,3,4,5,6; a,b,c,d,e;

    instead of:

    a, 1, a;
    b, 2, b;
    c, 3, c;
    d, 4, d;
    e, 5, e;

    A cube using the time dimension would look like:

    01:01:01; a,b,c,d,e; 1,2,3,4,5; a,b,c,d,e;
    01:01:02; a,b,c,d,e; 1,2,6,4,5; a,b,c,d,e;

    It's pretty difficult to do the same thing with row-based DBMS. However, you can see that doing an insert is going to be costly.. This looks like a pretty good try, I know there were some other projects going to try to replicate what BigTable does. And after hearing that IBM story the other day about one computer running the entire internet, I started thinking about Google.

    More interesting is their distributed file system, which is what makes this really work well.
     
  • Can we do a distributed search engine with it? Google@home would be sooo cool.
    • You want to donate your network to google?
      • Yeah, what a wonderful idea, I mean whatcouldpossiblygowrong if Google could access the hard drive of everyone who signed up to it?

        "Please wait while the Index is updated"

        "Please wait while we Upload new entries"

        "Please wait for the FBI to knock on your door"
  • Google 'Forms' (Score:3, Informative)

    by webword (82711) on Friday February 08 2008, @09:02PM (#22356874) Homepage
    I think Google Forms [blogspot.com] is more interesting. (Based on Google Spreadsheets.)
  • There's another open source BigTable clone called HBase . It's written in Java, and also runs on top of Hadoop Distributed Filesystem like Hypertable. It has the advantage of being a subproject of Hadoop. For anyone interested in using this kind of database, give HBase a shot. We can definitely use the additional testing. (Full disclosure - I am an HBase developer.)
  • Wheel: reinvented (Score:3, Insightful)

    by stonecypher (118140) <[stonecypher] [at] [gmail.com]> on Friday February 08 2008, @09:06PM (#22356898) Homepage Journal
    Mnesia has been able to handle things far in excess of the numbers cited, and with far better control of placement, for more than a decade. So has KDB. Also Coral8. This wouldn't even be on the map if people didn't start drooling the second they heard "based on Google." When they find out it's unstable and in alpha?

    Yawn.
      • Re: (Score:3, Informative)

        Mnesia is mostly a DHT for key-value pair lookups while hypertabe/bigtable support efficient primary key sorted range scans.

        Pretty much every database on earth has key sorted ranges. Please be less of a noob. Go look up ondex_match_object.

        For concurrent read/write/update, Mnesia requires explicit locking

        No, it doesn't. It offers explicit locking, because it's been proven for decades that without it, you cannot have hard realtime queries, something that mnesia wanted to offer. You don't have to use tha

        • If Google would just buy Bluetail already, things would start changing for the better, fast.

          I had thought Bluetail was bought many years ago and absorbed into Nortel ...
  • Over at ASF a bunch of smart people are building Hadoop and Hbase. The latter is the open-source version of the BigTable, similar to Hypertable, but written in Java (not C++) and being super actively developed in the open and under the ASF umbrella.
      • You're saying that it works great on everything, but it's not fast. Is that what you are telling us?
        No, he's saying that it's a bitch to install, works mediocre on most hardware, and is not fast.