Forgot your password?
typodupeerror
Databases Programming Math

SQL and NoSQL are Two Sides of the Same Coin 259

Posted by Unknown Lamer
from the dualistic-monadic-monsters-from-the-sea dept.
An anonymous reader writes "NoSQL databases have become a hot topic with their promise to solve the problem of distilling valuable information and business insight from big data in a scalable and programmer-friendly way. Microsoft researchers Erik Meijer and Gavin Bierman ... present a mathematical model and standardized query language that could be used to unify SQL and NoSQL data models." Unify is not quite correct; the article shows that relational SQL and key-value NoSQL models are mathematically dual, and provides a monadic query language for what they have coined coSQL.
This discussion has been archived. No new comments can be posted.

SQL and NoSQL are Two Sides of the Same Coin

Comments Filter:
  • by Jonner (189691) on Thursday April 07, 2011 @07:06PM (#35751536)

    SQL definitely sucks as a language. However, the relational model it was intended to expose does not. We need languages that more fully and naturally expose the relational model.

  • by Jonner (189691) on Thursday April 07, 2011 @07:19PM (#35751648)

    Neither SQL nor its original incarnation SEQUEL was the first language based on the relational data model. There are also more recent relational languages, such as Tutorial D [wikipedia.org], though none has gained much popularity and few people know they exist, even in the database management world. We badly need a replacement for SQL that is more flexible and more fully implements the relational model.

  • Re:using noSQL (Score:5, Interesting)

    by Sarten-X (1102295) on Friday April 08, 2011 @02:50AM (#35754384) Homepage

    Yes, no, and yes, in that order. I'm basing my answers on HBase, with which I have the most experience. My answers are also practically guaranteed to be wrong in somebody's eyes, because HBase is so much more flexible than an RDBMS. If I describe one way of doing something, another layout may work just as well, and somebody's going to favor that way.

    How does indexing work in NoSQL? Are there EXPLAIN-type tools available?

    EXPLAIN tools aren't really necessary in HBase, because almost all nontrivial queries are a scan over a small chunk of the alphanumerically-sorted rows. It will take a while, but please allow me to explain. Each row is a multi-value key-value store, with each value having a column name. If you really want to stick to the RDBMS style, you could have your key be a numeric row ID, and scan everything for every query. It would suck, because you're not using any indexes.

    Indexes are more or less left up to the programmer. Creating an index is effectively just adding more rows to the table. For example, that RDBMS-style layout in the last paragraph could be a table of ID numbers, usernames, passwords, and permissions (for 50 billion people, I guess...). For whatever business reason, the main key will be the ID number. Those rows are easy. They have the expected value columns: username, password, permissions. To index by username, we add new rows, with just a column for the ID number. We could just duplicate the data, but let's not. Now, our table is going to be huge, but sparse. Half of the rows have three of four columns filled, and the other half has only one. Searching by name, it'll take two requests to get to the actual row we want, but that's okay. Doubling the amount of work lets us run faster.

    The reason for that is HBase's split design. HBase's table is split into column families and regions. Column families are a means to group columns, so that even on data with overlapping key space, separate data could remain separate. Column families are stored as separate files in Hadoop. In our example, the username "index" could be a separate column family. That could speed up scanning, because the rows keyed by numeric usernames won't be interspersed with the rows keyed by user id. More importantly, the table is split into regions, each containing a number of rows. Those regions are also stored as separate files, and distributed across the entire Hadoop cluster.

    The cluster is really where Hadoop gets its speed. If we were to run all of our processing from one central location, it would be horribly slow and require a ridiculous number of requests. Instead, we'll distribute everything, including the query, similar to how some RDBMS sharding schemes work. We send a request to all nodes, asking for "the row with the key that matches the value of the 'userid' column of the row with a given key". Each node will report back its results. Unlike RDBMS sharding, the partitioning is handled automatically by HBase into regions that are optimal. It's these regions that are scanned for every request.

    After all of that, it should be quite clear: With HBase, the programmer is expected to know the layout of the data, and write requests based on the key. There is no EXPLAIN tool, because everything is just a key-value lookup.

    Whew. Next question...

    Can you do just about any query you could with SQL?

    Yes, but it's different. Every lookup is handled by scanning a region (in parallel on nodes that have that region's data files), and checking each column of each row to see if:

    1. The row key matches what was requested, or falls within a given range.
    2. The row contains a column that was requested.
    3. A given filter approves each column.

    Note that last item. The filter is simply a program that tells Hadoop whether the row (or some part of it) should be included in the returned results. That program can include other HBase requests, using other filters. If you're really stuck on using RDBMS

(1) Never draw what you can copy. (2) Never copy what you can trace. (3) Never trace what you can cut out and paste down.

Working...