Forgot your password?
typodupeerror
Cloud Databases

Yale Researchers Prove That ACID Is Scalable 272

Posted by CmdrTaco
from the i-could-prove-lunch dept.
An anonymous reader writes "The has been a lot of buzz in the industry lately about NoSQL databases helping Twitter, Amazon, and Digg scale their transactional workloads. But there has been some recent pushback from database luminaries such as Michael Stonebraker. Now, a couple of researchers at Yale University claim that NoSQL is no longer necessary now that they have scaled traditional ACID compliant database systems."
This discussion has been archived. No new comments can be posted.

Yale Researchers Prove That ACID Is Scalable

Comments Filter:
  • Pfah. (Score:5, Interesting)

    by stonecypher (118140) <stonecypher@gmail.STRAWcom minus berry> on Wednesday September 01, 2010 @11:53AM (#33437610) Homepage Journal

    NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google and Walmart that make the sites that built these databases in desperation look positively tiny.

    Digg's engineers wear clown shoes to work.

    • Re:Pfah. (Score:4, Insightful)

      by TheSunborn (68004) <tiller&daimi,au,dk> on Wednesday September 01, 2010 @12:05PM (#33437818)

      It was newer database size which were the problem but the number of queries per second(Aka performance) which could be executed.

      You can run a Google size database from MySQL, but you can't use to MySQL* to implement a search solution with performance like Google, without requiring much much much hardware.

      *Or an other sql database.

    • Re:Pfah. (Score:5, Insightful)

      by mini me (132455) on Wednesday September 01, 2010 @12:06PM (#33437830)

      NoSQL is not really about scalability, it is about modelling your data the same way your application does.

      There is a strong disconnect between the way SQL represents data and the way traditional programming languages do. While we've come up with some clever solutions like ORM to alleviate the problem, why not just store the data directly without any mapping?

      I am not suggesting that SQL is never the right tool for the job, but it most certainly is not the right tool for every job. It is good to have many different kinds of hammers, and perhaps even a screwdriver or two.

      • Re:Pfah. (Score:5, Insightful)

        by bluefoxlucid (723572) on Wednesday September 01, 2010 @12:28PM (#33438098) Journal

        There is a strong disconnect between the way SQL represents data and the way traditional programming languages do.

        Yes but there is a strong disconnect between computer RAM and information. Computer RAM contains DATA; information comes in associated tables. Relational databases represent data in tables with indexes, keys, etc. A Person is unique (has a unique ID), but they may share First Name, Last Name, and even Address (junior/senior in same household). There are many Races, and a Person will be of a given Race (or mix, but this is horribly difficult to index anyway). A Person will own a specific Car; that Car, in turn, will be a particular Make-Model-Year-Trim, which itself is a hierarchy of tables (Trim and Year are pretty separate, Model however will be of a particular Make, while a particular car available is going to be Model-Year-Trim).

        Indexing and relating data in this way turns it into information, which is what we want and need. Separating the data eliminates redundancies and lets us use fewer buffers along the way, crunching down smaller tables and making fast comparisons to small-size keys before we even reference big, complex tables. Meanwhile, we're still essentially asking questions like "Find me all people who own a 1996-2010 Year Toyota Prius." Someone might own 15 cars, so we're looking in the table of all individual Cars with MYT where table MYT.Model = (Toyota Prius) and .Year is between 1996 and 2010, and pulling all entries in table Persons for each unique Cars.Owner = Persons.ID (an inner join).

        Information theory versus programming. We're studying information here. We might have something more interesting to do than look in a giant array of Cars[VIN] = &Owners[Index]. For the actual data, the model we use makes sense; programmers get an API that says "Yeah, ask me a specific structured question and I'll give you a two-dimensional array to work with as an answer." That two-dimensional array is suitable for programming logic to manipulate specific structured data; extracting that data from the huge store of structured information is complex, but handled by a front-end that has its own language. You tell that front-end to find this data based on these parameters and string it together; it does tons of programming shit to search, sort, select, copy, and structure the data for you.

        • Re:Pfah. (Score:5, Insightful)

          by h4nk (1236654) on Wednesday September 01, 2010 @01:55PM (#33439316)
          Well said. This "problem" has more to do with architects and developers understanding the concepts of layering and information hiding. When programmers are allowed to dictate architecture under the pretense that certain interfaces to a Service should determine the structure of the Information itself, there is a huge problem at the business level. How does this happen? Uninvolved, or under-skilled DBAs and data architects. This is their job. My experience is that business managers and programmers have always seen the database as some sort of necessary evil without understanding its full purpose. Too many programmers with very little database experience are given direct access to databases themselves. The motivation of "Get it to work" takes precedence over well-researched and proven approaches, approaches that will only benefit in the long run. Companies that implement poor strategies for the sake of short-term gains usually have the idea that the best approach is somehow the one that takes the most time to implement. Short-sighted solutions are put into play and almost as soon as they are implemented, the scalability and data requirement issues begin to crop. These poor strategies are often the result of inexperience and poor education on all levels. This is why it is so important to hire people that really know what they are doing from C-level management down to the programmers. I have seen bad thinking gut companies. A service built on sound architecture will have issues maturing, not doubt. How well it matures depends on the wisdom and skill of the company.
          • by SQLGuru (980662)

            This is why I call myself a database programmer. I'm not a DBA, never have been and don't want to be. I understand how to make the database do what it needs to do. At a high level, I understand how data is stored to disk, but I don't really care about that (that's a DBAs job). I also understand at a high level the questions that an application developer needs to ask (not a DBAs job at all). I bridge the gap and write code (sprocs, triggers, functions, etc.) to support the app. I tune queries and db co

        • "Yeah, ask me a specific structured question and I'll give you a two-dimensional array to work with as an answer."

          I thought it was more like an array of structs, where each array entry is a row and each struct member is a column. In non-C you might say each row is an object, each field-of-a-class is a column (where class : table) and each field-of-an-object is a single cell.

          Then the cartesian product operation on tables of types T1 and T2 (respectively) has a type which is the product of T1 and T2, and everything matches up neatly.

          • You mean a linked list. I'm not sure for your particular API.

            The issue here is that you get rows that are effectively struct { char[]; int; long int; int; double; char[]; char[5]; }; which you can do. What you can also do is void* result[][], where (*(result[row][column])) (note that the inner set of parenthesis is optional in this case, but syntactically valid and more visually clear) points to the correct data.

            Working with arbitrary data gathered from an arbitrary information set is a pain. Consider

        • Re: (Score:3, Insightful)

          by lennier (44736)

          Yeah, ask me a specific structured question and I'll give you a two-dimensional array to work with as an answer.

          That's fine until someone asks you an unstructured question for which a two-dimensional array cannot contain the answer.

          Like, for example, 'Here's an ordered DOM tree of nodes each containing tags, subtrees and/or chunks of CDATA'.

          Or 'Here is a set of objects each of which contain their own custom properties not found in others.'

          Not every form of useful information in the real world is strictly typeful and represents a well-formed relation over finite domains.

      • by sarkeizen (106737)
        Isn't that a very specific context though? The underlying assumption seems to be that there is one dataset per application. Which may well be the general case - in other words what is the "same way your application does but without mapping" when your applications are written in different frameworks, languages or when the data is accessed via say an reporting environment?
      • by Tablizer (95088)

        For this reason I suggest that app language designers work on better fitting RDBMS and SQL rather than the other way around (at least for data-driven apps). OOP may be nice, but it inherently conflicts with relational concepts and patterns. Generally, one is based around attribute-handling idioms and the other behavior-handling idioms. OOP also tends to be nested, hierarchical, and/or graph-shaped; while relational is set-centric. Either you de-emphasize one or the other, or deal with complicated and expens

        • Re: (Score:3, Funny)

          by Atzanteol (99067)

          Right now it's like men wearing womens' underwear and vice-verse.

          You mean it makes me feel pretty?

      • Re: (Score:3, Insightful)

        by GWBasic (900357)

        NoSQL is not really about scalability, it is about modelling your data the same way your application does.

        I 100% agree. Earlier this year I created a moved a prototype application built around SQLite and flat files to MongoDB. MongoDB is SQL-like in its ability to have queries and indexes; but it stores its data in a way that doesn't require me to deconstruct all of my data structures into tables. This dramatically reduced complexity in code that used to deal with 5-6 SQLite tables. In the case of MongoDB, I was able to replace 5-6 tables with a single collection of structured documents. MongoDB lets me wr

        • Whose data is it? (Score:4, Insightful)

          by sbjornda (199447) <sbjornda@COBOLhotmail.com minus language> on Wednesday September 01, 2010 @02:17PM (#33439608)

          but it stores its data in a way that doesn't require me to deconstruct all of my data structures into tables.

          I take it this is not business-type data? Otherwise you're doing it backwards. Start with your Entity-Relationship diagrams, devolve into logical than physical data models, and THEN start programming.

          I forget who said it but it's true: The data belongs to the business, not to the application. The data should be structured and stored in a way that it will still be readable years after your program has become obsolete. (Unless it's data that has a short "best before" date.)

          --
          .nosig

      • Re: (Score:2, Insightful)

        by bsdaemonaut (1482047)

        NoSQL has a lot to do with scalability. Sure there's other reasons, but not enough to recommend them over hash databases. Hash databases have been around for decades which do what you propose and a lot more, their main con is the lack of scalability -- hence NoSQL. BerkeleyDB is an example, but it's a list to huge to continue..

      • Re:Pfah. (Score:4, Interesting)

        by RAMMS+EIN (578166) on Wednesday September 01, 2010 @03:15PM (#33440614) Homepage Journal

        ``There is a strong disconnect between the way SQL represents data and the way traditional programming languages do.''

        I agree, but ...

        ``While we've come up with some clever solutions like ORM to alleviate the problem,''

        I don't think ORM alleviates the problem so much as entrenches it. The classes-and-instances object model and the relational model are different, but can be expressed in one another. Object-relational mapping makes this easy by pretending the models are the same, and doing the mapping behind the scenes. This works for some cases, but if you want to get the best performance, you have to express things in a way that takes into account the efficiency considerations of the actual implementation. With ORM, you run into the situation where what is most succinct to express in code is not necessarily what is most efficient in terms of disk access and network resource usage. So, for efficiency reasons, you end up breaking the abstractions that your ORM provided ...

        ``why not just store the data directly without any mapping?''

        There isn't really such a thing as "without any mapping". However, you can ensure that the constructs your API provides are equivalent to what you can efficiently fetch or store in your data store. Since typical RDBMSs are usually optimized to execute typical SQL queries efficiently, SQL is actually a fairly good starting point. You can optimize this by creating indices to speed up common operations, and by tuning your RDBMS to speed up common operations. And, no doubt, you can do even better by creating custom shortcuts for specific needs of your application.

        This is sort of what so-called NoSQL databases do: they are optimized for specific scenarios, and thus may outperform stock RDBMSs that are optimized for "we don't know what you want to do, so we try to make everything reasonably fast". It's also worth noting that NoSQL systems often return stale data or even allow inconsistencies in order to improve performance. By contrast, the strength of a good relational database is preserving the integrity of your data no matter what happens. Different tools for different jobs - or at least, different optimizations for different scenarios.

      • Re:Pfah. (Score:5, Insightful)

        by hey! (33014) on Wednesday September 01, 2010 @05:51PM (#33442914) Homepage Journal

        NoSQL is not really about scalability, it is about modelling your data the same way your application does.

        I've actually been in the business long enough to remember when relational databases were the new thing. What people seem to forget is that modeling your data in a different way than your application does *was the whole point*. The idea was to make data a reusable resource *across applications*. Of course, that turned out to be a lot harder than we thought it would be. Philosophically, one might well ask whether it is possible to understand data at all apart from its intended applications. Of course, by the time we'd figured that out, a whole new generation was coming up trying to create a Semantic Web.

        I basically agree that SQL isn't always the right tool for the job. I happen to think certain aspects of the relational model are somewhat broken (e.g. composite keys), and SQL is a pretty crappy query language in any case. But I think because RDBMSs are a mature technology, recently trained programmers don't bother to understand them, and cover that lack of understanding by pooh-pooh-ing the stuff that's over their head. I went through a patch a few years ago where I was interviewing programming candidates who had XML coming out of their ears but hadn't the foggiest idea of what "NULL" means in the relational model. Naturally they had all kinds of problems on the relational end of things, and tended to view the RDBMS as a kind of pitfall in which bad things inexplicably happen. Consequently, they tended to think of the database as simply a backing store for the application *they* were working on. In some cases this is acceptable, but one often sees abominable schema that are the product of ignorance, pure and simple.

        Naturally, non-relational systems are most attractive where performance is at a higher premium than flexibility. This characterizes many web applications that do a small number of relatively simple things, but to do it on a scale that takes special expertise to achieve using a relational model. That was very much the case at the beginning of the relational era, when applications tended to be narrower in scope and query optimization primitive. You thought of order line items as "part-of" an order, whereas in relational thinking they could just as easily be considered attributes of products. This made the programmer's job a lot easier, so long as the RDBMS could process invoices fast enough to make the users happy.

    • Re:Pfah. (Score:5, Interesting)

      by bluefoxlucid (723572) on Wednesday September 01, 2010 @12:14PM (#33437914) Journal

      NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google

      Google uses BigTable, a NoSQL database.

    • by Shados (741919)

      Depends for what part, but Walmart's site runs at least partly on a "NoSQL" (I use the term loosely in this case) system.

    • Re:Pfah. (Score:5, Insightful)

      by DragonWriter (970822) on Wednesday September 01, 2010 @12:29PM (#33438114)

      NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google and Walmart that make the sites that built these databases in desperation look positively tiny.

      Database size was never the main driving force beyond the new move toward NoSQL databases. Support for distributed architectures is. In part, this is about handling lots of queries rather than handling lots of data; it also -- particularly if you are Google -- deals with latency when the consumers of data are widely distributed geographically.

      And note that one of the companies that is heavily involved in building, using, and supplying non-SQL distributed databases is Google, who, as you so well point out, is very much aware of both the capabilities and limits of scaling with current relational DBs.

      This new research may offer new prospects for better databases in the future -- but TFA indicates that the new design has a limitation which seems common in distributed, strongly-consistent system "It turns out that the deterministic scheme performs horribly in disk-based environments".

      In fact, given that it proposes strong consistency, distribution, and relies on in-memory operation for performance, it sounds a lot like existing distributed, strongly-consistent systems based around the Paxos algorithm, like Scalaris. And it seems likely to face the same criticism from those who think that durability requires disk-based persistence, and that replacing storage on disks (which, one should keep in mind, can also fail) with storage in-memory simultaneously on a sufficient number of servers (which, yes, could all simultaneously fail, but durability is never absolute, its at best a matter of the degree to which data is protected against probable simultaneous combinations of failures.)

      So -- reading only the blog post that is TFA announcing the paper and not the paper itself yet -- I don't get the impression that this is necessary are giant leap forward, though more work on distributed, strongly-consistent databases is certainly a good thing.

    • Re: (Score:2, Funny)

      by Tablizer (95088)

      After all, MySql is why slashdot is so relia~ `} v* m& + ' ,

    • NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google and Walmart that make the sites that built these databases in desperation look positively tiny.

      It isn't data volume that is the problem. It is often data organization. Traditional SQL databases are row stores. For some applications that is not a good way to store data. Column stores make more sense in data warehousing, for example. Michael Stonebraker has blogged about this a few times at the same blog site cited by the submitter.

  • by Dan667 (564390) on Wednesday September 01, 2010 @11:57AM (#33437670)
    digg has chased all their users away with the new version of their site so they could probably change over to MS Access and be ok.
  • Berkeley DB (Score:4, Funny)

    by nacturation (646836) * <nacturation AT gmail DOT com> on Wednesday September 01, 2010 @11:58AM (#33437690) Journal

    Didn't Berkeley prove back in the 60s and 70s that acid was scalable?

    • by Zak3056 (69287)

      Didn't Berkeley prove back in the 60s and 70s that acid was scalable?

      At the very least, they proved it was salable...

  • Interesting thesis (Score:5, Interesting)

    by Peeteriz (821290) on Wednesday September 01, 2010 @12:09PM (#33437876)

    In essence, TFA claims that if the traditional ACID guarantee "if three transactions (let's call them A, B and C) are active ... the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B" is not abandoned (as in NoSQL systems), but is even strengthened to a guarantee that the result will always be as if they arrived in A-B-C order, then it solves all kinds of possible replication problems, requires less networking between the many servers involved, and allows for high scaling while also keeping all the integrity constraints.

    • Determinism solves many things in DB design that's why things like WITH SCHEMABINDING for views and user defined functions in MS SQL make things run so much faster. With over 40 years of RDMS design, it's odd that this path has never been gone down before. But the whole turning "out that the deterministic scheme performs horribly in disk-based environments" makes perfect sense if this is something that scales very well in high memory environments that didn't exist until now.

      Now THIS is news for nerds, it's

  • by Tablizer (95088) on Wednesday September 01, 2010 @12:17PM (#33437952) Journal

    A bigger issue may be the cost of ACID even if it can in theory scale. Supporting ACID is not free. A free web service may be able to afford losing say 1 out of 10,000 web transactions. Banks cannot do it, but Google Experiments can. The extra expense of big-iron ACID may not make up for the relatively minor cost of losing an occasional transaction or customer. It's a business decision.

    • by sarkeizen (106737)
      Ok all puns on "acid" aside (especially when you add adjectives like "big-iron"). The point of the article seems to be about scaling out - specifically with cheaper hardware. I agree that one's choice of tools is a business decision (so is everything in business) but it's not like using MySQL or postgres is somehow cost prohibitive.
    • Re: (Score:3, Insightful)

      by Peeteriz (821290)

      Typically the NoSQL approach just shifts the problems from the database layer to the application programmer - if it's simply ignored, a typical app can't cope with unpredictable/corrupt data being returned from db, and results in weird bugreports that cost a lot of development time to find and fix; and with these fixes parts of the ACID compliance are simply re-implemented in the app layer.

      You gain some performance of the db, you lose some (hopefully less) performance in the app, and it costs you additional

  • by LightningBolt! (664763) <lightningboltlig ... nosPAM.yahoo.com> on Wednesday September 01, 2010 @12:28PM (#33438094) Homepage

    For instance, Neo4J is a scalable graph-based "nosql" DB with ACID.

  • NoSQL's two big features are scalability and the arbitrary schemas. While the paper covers the first (though I still think map/reduce has its place) NoSQL does do taxonomy-based (hierarchical) schema better. The only way to do that in SQL is to have a property table, where the parent object is a object RID, and a huge table of attached properties and values to that. You might be able to get your indexes to perform reasonably well, but only by duplicating the some data. And on top of that, just try writing a

    • There is more than one way to do Hierarchical Query's, it just depends on the RDMS. Oracle has had it for years and SQL Server implemented it in the 2005 edition. You don't need sub-selects.

    • by akpoff (683177)

      And on top of that, just try writing a query for hierarchical data! You'll have sub-selects for each level of hierarchy. This means in order to to something relatively simple, like KPCOFGS of species classifications, you'll need a select and 6 sub-selects. At least that one is well defined to . If its not, you just don't know how many, and you have to write a recursive function to generate your select query, or process the results from it. Either way, you repeatedly consider 99% useless records at every lev

  • We knew ACID can scale already.

    With enough money poured into it, and new implementations, ACID can scale.

    They solved some problems with scaling out, not necessarily the problems with it scaling up. Scaling does not necessarily just mean replicas and quick failover -- it means good performance without millions spent on hardware too, in terms of overhead, storage requirements, storage performance, server performance.

    NoSQL scales in certain cases less expensively, with less work, and doesn't require compl

  • I don't think they've proven it yet, they simply offer some solutions to what they admit is a very difficult problem. In other words, we'll see how their ideas pan out.
  • by elwin_windleaf (643442) on Wednesday September 01, 2010 @01:01PM (#33438582) Homepage

    From the Wikipedia Article (http://en.wikipedia.org/wiki/ACID [wikipedia.org])

    "In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction."

  • SQL syntax is dated and very obtuse. Just look at the different syntax between insert and an update. ...wouldn't you rather just have "save"?

    Object-relational mapping is cumbersome and mis-matched in SQL. 1:many either yields n+1 queries or a monster cartesian product set. And, what about inheritance? It just doesn't jive.

    It isn't about losing ACID- although not every purpose needs ACID. Your average shared drive filesystem isn't ACID, for example.

    When you have anemic domains that aren't nailed down an

  • Not NoACID, NoSchema (Score:3, Interesting)

    by bokmann (323771) on Wednesday September 01, 2010 @01:03PM (#33438606) Homepage

    Interesting article )and yes, I read the article), but the point of the NoSQL movement isn't so much about SQL, or ACID, as much as it is about Schema.

    Most applications today are written in object-oriented languges like Java, C#, Ruby, etc... and most common frameworks in these languages use object-relational models to essentially 'unpack' the object into a relational model, and then reconstitute the objects on demand. this post [tedneward.com] explains the kinds of problems better than most.

    NoSchema is about storing data closer to the format we process it in today. Key-Value pairs. XML. Sets and Lists. Object-Oriented data structures. This is about abstractions that make developers more productive. It is a tool in a toolbox, and useful in some circumstance and not in others.

    SQL databases do not have to be the 'one persistence data mechanism to rules them all'. We don't need one; we need many that solve differing classes of problems well.

  • To achieve 'nonconcurrency' one needs to introduce a global ordering of transactions. Which WILL require a shared resource among ALL of the transactions. No way around it, sorry.

    And what's funny, this resource some of the problems of ACID systems. However, there should be advantages (no need for rollbacks, etc.).

    Besides, all of this doesn't tackle another advantage of NoSQL systems: working with HUGE amounts of data. There'll still be problems in ACID systems if data access requires communication between se

  • This seems to be a reinvention of field calls, with a slightly different purpose.

  • by smcdow (114828) on Wednesday September 01, 2010 @02:15PM (#33439576) Homepage

    TFA hints at this but doesn't come out and say it: the larger you scale, the more you swamp yourself with atomicity protocol overhead. If your database is geographically distributed, then you have to decide if atomicity is more important than forgoing the very large bills for the associated network usage. I suspect that this may explain a lot about why Google, Amazon, etc., went with NoSQL solutions.

  • Yale Researchers Prove That ACID Is Scalable

    Finally. I've been telling Bob that for years, but nooo, he insists that we keep using blotter paper and sour patch kids.

  • Summary (Score:5, Informative)

    by azmodean+1 (1328653) on Wednesday September 01, 2010 @03:07PM (#33440488)

    Short Summary:
    We make some claims about scaling ACID databases, but then don't support them.

    Longer summary:
    We don't like NoSQL and enjoy making baseless cracks about it such as it being a "lazy" approach.
    In our paper we demonstrate that our unconventional version of an ACID database scales better than a traditional ACID database in a specific environment, while merely throwing away some robustness guarantees and changing how transaction ordering works.
    No direct comparison to any NoSQL implementation is made.

    So yea, I'm not holding my breath for companies to start migrating away from NoSQL.

  • by yaphadam097 (670358) on Wednesday September 01, 2010 @04:51PM (#33442070)

    The reason that NoSQL is necessary is that ACID is not the only thing that developers need to think about. RDBMS was an innovative solution to the limitations of mainframe hierarchical databases circa 1970. Since then it has been the only game in town (At least for most enterprise software. Some of us do other things occasionally.)

    It turns out that there are reasons to do things other ways, and having other options allows you to consider trade-offs. For many applications eventually consistent data scales just fine. For some applications, both big and small, an enterprise RDBMS is overkill. Why not just persist objects to a document store? Or even the file system?

    The research is interesting, although I agree that we already knew we could scale the ACID paradigm. The conclusion is ridiculous. NoSQL has nothing to do with ACID, and it brings a richness to the conversation that has been missing for far too long. Like the Perl folks say, TMTOWTDI.

panic: can't find /

Working...