MapReduce — a Major Step Backwards? 157
The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."
Re:may be missing the (data)points (Score:5, Informative)
Re:Money, meet mouth (Score:2, Informative)
Re:may be missing the (data)points (Score:1, Informative)
Relational databases were the perfect solution for the data processing environments of the 60's through the end of the century, but the computational landscape has changed significantly in three ways: scale, dirty data, and distribution.
To support ACID semantics, relational databases require convergence of control for transactional domains. While this can be distributed, the sacrifices necessary to make it happen typically reduce your relational performance beyond what is acceptable. To handle the scale of data that Amazon and Google process in real time every moment of every day, you need to work independently across dozens or hundreds of machines. Each one can't block its processing in a mother-may-I request to lock objects or commit a transaction. To be able to process every comment on any particular item in Amazon's database in time to spit out a web page in 300ms, you need to leave the transactions behind.
Especially for Google, the data does not lend itself well to indexing. They're sucking down their data from all corners of the Earth (Earth is processed in another system...) and trying to meaningful analysis on this grungy data. It's much easier to have local parsing and exception handling rather than trying to stuff everything into the same rigid schema. Most post-relational systems have soft notions of schema; it's more in the realm of metadata giving hints about how you might want to look at the data rather than a guarantee about what form the data will have, and the code adapts to dirty data as it comes through the pipe.
Related to the first point, connectivity is now so cheap that it's a requirement to make these systems distributed. You can't have all your data sitting in one data center; all it takes is one mouth-breather with a backhoe to turn off your company. So you replicate the data to multiple centers. Of course, you want to send updates to all those data centers, which brings us to the first point that distributed transactions are a barrier to scalability. But most fundamental to the whole discussion is the CAP theorem [psu.edu]: you can have Consistency, Availability, or survive network Partitions. Pick two. Post-relational systems choose Availability and Partition survival over guaranteed consistency of their data. This allows them to scale tremendously and be very, very resilient to interruptions in the underlying communications systems.
For a very interesting read, I recommend reading Werner Vogels's excellent paper on the theory and practice of Amazon's Dynamo [amazonaws.com] back-end.