SQL and NoSQL are Two Sides of the Same Coin 259
An anonymous reader writes "NoSQL databases have become a hot topic with their promise to solve the problem of distilling valuable information and business insight from big data in a scalable and programmer-friendly way. Microsoft researchers Erik Meijer and Gavin Bierman ... present a mathematical model and standardized query language that could be used to unify SQL and NoSQL data models."
Unify is not quite correct; the article shows that relational SQL and key-value NoSQL models are mathematically dual, and provides a monadic query language for what they have coined coSQL.
Re:The real reason people like noSQL... (Score:3, Informative)
And we DO still use SQL. And we do so because it works
I disagree.
There are some things which are fast for a computer to do but are slow and awkward to do in SQL. You can see quite a few of them in SAS supported by data steps (e.g. decent RBAR support). Another is say I want to get the latest transactions for each customer, I have to do something like
select customer_id,max(txn_datetime) latest_txn_datetime
from fact_txn
group by customer_id
-- I've now got the latest transaction time but because system reversals and the like are often set to happen at the same time as the transaction I have to next...
select customer_id,max(txn_key) latest_txn_key
from fact_txn
join (select customer_id,max(txn_datetime) latest_txn_datetime
from fact_txn
group by customer_id) latest_txn
on (fact_txn.customer_id = latest_txn.customer_id)
where fact_txn.txn_datetime = latest_txn.latest_txn_datetime
group by fact_txn.customer_id
-- Now we've got the right transaction key, but we still have to join again to get the actual transactions
select fact_txn2.*
from fact_txn fact_txn2
where exists
select 1 from
(select customer_id,max(txn_key) latest_txn_key
from fact_txn
join (select customer_id,max(txn_datetime) latest_txn_datetime
from fact_txn
group by customer_id) latest_txn
on (fact_txn.customer_id = latest_txn.customer_id)
where fact_txn.txn_datetime = latest_txn.latest_txn_datetime
group by fact_txn.customer_id) txn_keys_of_interest
where fact_txn2.txn_key = txn_keys_of_interest.latest_txn_key;
I can tidy this up a bit if needed - using CTEs or whatnot but the simple fact remains that a lot of the time I want to write a one liner like:
select * from fact_txn having txn_key = max(txn_key) over (PARTITION by customer_id order by txn_datetime)
Re:The real reason people like noSQL... (Score:4, Informative)
Like Angel'o'sphere said, if you can adapt your database, the problem becomes trivial. Make sure that at least for a given customer, each subsequent transaction ID is greater than the prior transaction ID (if this is not already the case, then add a new field populated by a sequence so that you have a field where it is the case).
Here's the solution with a sub-select (because it's easier to read, it can be converted to a join for efficiency):
SELECT
transactions.fieldNames
FROM transactions
WHERE
(transactions.customerID, transactions.transactionID) IN (
SELECT customerID, MAX(transactionID)
FROM transactions
GROUP BY customerID
)
If, as you suggest, you need it for specific date ranges, then add those to the sub-select. Like I said, for most RDBMS's this would be faster if converted to a join (and basically every sub-select can be converted to a join). For some RDBMS's they would convert it to a join as part of the execution planning anyway (I believe Postgres and Oracle do this).
Arguments like these actually only serve to strengthen RDBMS's case over NoSQL. Database engineers have been solving these problems easily and efficiently for years, but a new generation likes to think in new patterns. Not that there's anything wrong with that - except there is a certain tendency to try to put a square peg in a round hole, a complaint when it doesn't fit right, and a sigh from the guys who've been carving pegs so they fit snugly all along.
Key/value storage does have advantages over traditional RDBMS designs (assuming the RDBMS is designed and utilized properly), but those advantages are things like linear scalability, and very few cases where a task on the K/V side is substantially faster to complete than a properly designed solution on the RDBMS side - at least not until you are talking tens or hundreds of billions of records on 100+ CPU clusters (this is the linear scalability advantage).