Why My Team Went With DynamoDB Over MongoDB 106
Nerval's Lobster writes "Software developer Jeff Cogswell, who matched up Java and C# and peeked under the hood of Facebook's Graph Search, is back with a new tale: why his team decided to go with Amazon's DynamoDB over MongoDB when it came to building a highly customized content system, even though his team specialized in MongoDB. While DynamoDB did offer certain advantages, it also came with some significant headaches, including issues with embedded data structures and Amazon's sometimes-confusing billing structure. He offers a walkthrough of his team's tips and tricks, with some helpful advice on avoiding pitfalls for anyone interested in considering DynamoDB. 'Although I'm not thrilled about the additional work we had to do (at times it felt like going back two decades in technology by writing indexes ourselves),' he writes, 'we did end up with some nice reusable code to help us with the serialization and indexes and such, which will make future projects easier.'"
That's different... (Score:2, Funny)
They must run their company pretty different than where I work.
Where I work, the most senior and backstabby developer saddles the worst tools he can find on the rest of the team, and then blames them (behind their backs of course) for the results of his poor decision making.
I don't understand (Score:3, Funny)
But MongDB is web scale.
Re:I don't understand (Score:5, Funny)
MongoDB ... just a pawn in the game of life.
Re: (Score:1)
Re: (Score:2, Funny)
Re:I don't understand (Score:5, Funny)
Oblg. :-)
http://highscalability.com/blog/2010/9/5/hilarious-video-relational-database-vs-nosql-fanbois.html [highscalability.com]
No one cares (Score:5, Insightful)
No one cares. Stop click-baiting the buzzword Slashdot sub-sites. If we wanted to go to them we would do so voluntarily.
Re: (Score:1, Funny)
But I want Dice to tell me all the ways in which backend specialists are critical to online games!
devs and DB indexes (Score:2)
there are two kinds
the first creates a 10,000,000 row table with no indexes, no PK and then complains that the DBA's are dumb because the app is slow or the server is broke
the second kind i've seen have a 100 row table, with 10 columns and 15 indexes on it. sometimes half my day is spent on deleting unused indexes created by our BI devs
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
Worried about hosting data alongside others... (Score:3, Insightful)
"Our client is paying less than $100 per month for the data. Yes, there are MongoDB hosting options for less than this; but as I mentioned earlier, those tend to be shared options where your data is hosted alongside other data."
I think someone failed to explain how "the cloud" actually works.
It's so ... wrong (Score:5, Insightful)
Having actually RTFA, it just enforces how poorly most programmers understand relational databases and shouldn't be let near them. It's so consistently wrong it could be just straight trolling (which given it's posted to post-Taco Slashdot, is likely).
"However, the articles also contained data less suited to a traditional database. For example, each article could have multiple authors, so there were actually more authors than there were articles."
This is completely wrong, that's a text book case of something perfectly suited to traditional (relational) database.
Re: (Score:1)
NoSQL is a buzzword meaning "too dumb to understand a RDB". That's why they poorly reinvent the wheel.
Re:It's so ... wrong (Score:5, Funny)
"Those who don't understand SQL are condemned to reinvent it, poorly." (with apologies to Harry Spencer).
Re:It's so ... wrong (Score:5, Insightful)
Re:It's so ... wrong (Score:5, Insightful)
Mod parent up.
After a few years in other fields, I'm doing some serious coding again. Postgres and Doctrine. I can do in a few lines of code and SQL what would take a small program or module to do without the power of SQL and an ORM.
Anyone who reinvents that wheel because he thinks he can do the 2% he recoded better is a moron.
Re: (Score:2, Insightful)
Re: (Score:3)
There's "wrong" and there's wrong.
I'm pretty sure that my coding does not satisfy some theoretical top-of-the-mountain coding structure fanatics. But that is "wrong" in the sense that it does not satisfy opinions. And when it comes to coding styles, in the end they're just opinions and ten years from now we'll laugh about most of todays patterns.
And then there is programmatically correct, not unnecessarily wasteful with resources and easy to understand. Those are no opinions - your code either gives the rig
Re: (Score:2)
I know you jest but sometimes you DO want to re-write SQL. i.e. row store vs column store.
NewSQL vs. NoSQL for New OLTP
http://www.youtube.com/watch?v=uhDM4fcI2aI [youtube.com]
One Size Does Not Fit All in DB Systems
http://www.youtube.com/watch?v=QQdbTpvjITM [youtube.com]
Re:It's so ... wrong (Score:4, Insightful)
I jest slightly. Certainly there are applications where SQL and relational systems in general are overkill, or where they do not solve certain kinds of problems well. But I'll be frank, they're pretty rare. I will use binary search/sort mechanisms for simple hashes and other similar two column key-value problems, mainly because there's absolutely no need to truck along gazillions of bytes worth of RDBMS where quicksort and a binary search is all that is needed. But if you get beyond that, you're almost inevitably going to start wishing you had JOIN? And then you end up having to implement such functionality.
Every tool for the job, to be sure, but I just happen to think there are far fewer problems that nosql style systems solve than some like to think.
Re: (Score:1)
Every tool for the job, to be sure, but I just happen to think there are far fewer problems that nosql style systems solve than some like to think.
I strongly agree with this, and because of that I've been severely chastised by quite a few kool-aid drinkers. On my current job we have a NoSQL database (a MongoDB one, actually) and we indeed have had to reinvent some SQL here and there, including a few manual joins. The job would just have been far smoother (and faster to develop), and surely more performant, if we used a well-established SQL database, but someone decided that it wasn't buzzwordy enough.
Re:It's so ... wrong (Score:5, Funny)
"However, the articles also contained data less suited to a traditional database. For example, each article could have multiple authors, so there were actually more authors than there were articles."
Good god, how would he model invoices with multiple line items? Where, you know, there were actually more line items than invoices?! Mind blown.
Or customers that might belong to zero more demographics? There could be more customers than defined demographics to tag them with... or less... we don't even know and it could change as more of either are added!!
We need a whole new database paradigm!
Or the sample Northwind database that's been shipping with access since the 90's.
Re: (Score:2)
We need a whole new database paradigm!
Wait, don't you just draw a different arrow on the end of the line joining the two tables and the rest happens automatically?
Re: (Score:3)
Make a table of authors, make a linking table that joins authors to the article table.
Re: (Score:3)
This is completely wrong
No, it's completely right: the traditional way to use a database is to blob everything together in to one huge table, preferably with many NULLs, then limit your query to SELECT * FROM Table; and finally process the results directly in VB6, with bonus points for a buggy parser for unpicking comma separated fields.
Note: he said "traditional" not "sane relational".
Sarcasm aside, his reason for not using a relational database is that he'd need to use more than one table and then he'd ha
Re: (Score:2)
No, no, no, you let your tedious "DBAs" think they're right and do all that "normalization" and "tuning" shit they keep yammering on about (whatevs), then get the new shiny [microsoft.com] so you can blob the whole fucker up and never have to worry about anything but said "SELECT * FROM FOO." It's great because our developers no longer have to talk to our DBAs about "optimizing" all that dynamic SQL our webforms were generating. The DBAs are now screaming about resource utilization, but, HELLO, they're the ones who insiste
Re: (Score:2)
Entity Framework is good when you use it with a properly designed database. It saves a lot of work which, correct me if I'm wrong, is the whole point of computers. There are so many times that people forget that very simple fact in their rush to wave their e-peen around.
Re: (Score:2)
Oh, that moron? 10 minutes wasted checking the comments to see if TFA is worth reading..
Re: (Score:2)
For normalized databases, this is often considered a best practice, although another option would be to store multiple author IDs in the article tables—something that would require extra fields, since most articles had more than one author. That would also require that we anticipate the maximum of author fields needed, which could lead to problems down the road.
A single field with delimited index keys pointing to an author table. I learned that in 1996. Then compressing the field with a dictionary, increasing the number of keys that can fit and speed up searches through it. Learned that in 1998.
Why does that not work in NoSQL? I don't understand.
Re: (Score:3)
Having actually RTFA, it just enforces how poorly most programmers understand relational databases and shouldn't be let near them. It's so consistently wrong it could be just straight trolling (which given it's posted to post-Taco Slashdot, is likely).
"However, the articles also contained data less suited to a traditional database. For example, each article could have multiple authors, so there were actually more authors than there were articles."
This is completely wrong, that's a text book case of something perfectly suited to traditional (relational) database.
Well, based on how many things are wrong in the Java vs C# comparison, too, one can only guess that the "software developer" is just some hack who is comped by Slashdot to drive clicks to their sub-sites.
Man this place has really gone to shit in the last year -- just a waste of time to read. Sucks its hard to break 15 years of habit ...
So the gist of the article is..... (Score:4, Informative)
MongoDB would have been perfect based on the structure of the data, but the client didn't want to pay for setup and hosting costs, DynamoDB was the cheaper alternative, but more of a pain in the ass to implement. Makes we wonder if the hosting cost savings offset the additional development time.
Question from relational-land (Score:5, Informative)
As someone whose work and thinking are firmly planted in traditional RDMS, a few of those decisions did not make sense.
I understand what he's saying about normalized tables for author, keywords, and categories. But then when he has to build and maintain index tables for author, keyword, and categories, doesn't that negate any advantage of not having those tables?
I understand he's designed things to easy retrieval of articles, but it seems the trade-offs on other functions are too great. It's nice an author's bio is right there in the article object, but when it's time to update the bio, that does mean going through and touching every article by that author?
I've I got a bunch of similar examples, and I would not be at all surprised if they all boiled down to 'I don't understand what this guy is doing,' but basically, isn't NoSQL strength in dealing with dynamic content and in this example, serving static articles, the choice between NoSQL and traditional RDMS essentially up to personal preference?
Re: (Score:1)
Maybe you should factor in the usage pattern and instance counts as well.
Someone's bio might appear in how many articles? A few hundred? And how often will the bio be updated? A couple of times a year? So, updating a bio comes down to touching a few hundred records a few times a year. Compare that with thousands of accesses per day and you've suddenly tipped the scale.
Re: (Score:1)
So... what you're saying is that the application needs a materialized view after benchmarks show that joining against the authors table is a performance bottleneck?
Re:Question from relational-land (Score:5, Insightful)
Oh come on now. Play fair. If you start throwing around advanced database features like materialized views then you will immediately invalidate 90% of the use cases commonly used for choosing NoSQL over relational databases. That is just mean.
Re: (Score:2)
Oracle's "snapshots" were renamed to "materialized views" in 1999, MSSQL gained "indexed views" in 2005, MongoDB "began development" in 2007.
Doomed to reinvent it, indeed.
Re:Question from relational-land (Score:4, Informative)
Maybe you should factor in the usage pattern and instance counts as well.
Someone's bio might appear in how many articles? A few hundred? And how often will the bio be updated? A couple of times a year? So, updating a bio comes down to touching a few hundred records a few times a year. Compare that with thousands of accesses per day and you've suddenly tipped the scale.
That's exactly the sort of answer I was looking for. Thank you. (Actually, I'd expect most bios get updated only a handful of times over the life of the author. You start with first publications as a grad student, then you leave school, maybe change jobs a couple of times, maybe a few notable achievements, then the author dies.)
That is the sort of design considerations I'd like to read about. That would give a useful comparison between platforms. As it is, this article boils down to "I went NoSQL over RDMS, because...well, just because. I went Amazon over something else because it's easier for my idiot client to administer."
Re: (Score:2)
You know, I could chop off the pinky toe of my left foot, I mean, I only use it a couple times a year!
Re: (Score:2)
Someone's bio might appear in how many articles? A few hundred? And how often will the bio be updated? A couple of times a year? So, updating a bio comes down to touching a few hundred records a few times a year. Compare that with thousands of accesses per day and you've suddenly tipped the scale.
That would make sense if you had to pull bios with an article, which should hardly be the case. At most, you'd have to pull in current authors' affiliations. A bio would ideally stay behind an author link, and be pulled in quite rarely. I for one would much rather have a list of authors immediately followed by the abstract than having to move through several pages of biographies for an article with 4-5 authors in order to find the abstract an the actual article. So for me the decision to put every bio in ev
Re:Question from relational-land (Score:5, Insightful)
Don't try to actually make sense of the decisions made in the article. I am glad that he summed up all of the reasons why he didn't go with a relational database early in the article, so I didn't have to bother reading the rest. I am an advocate of NoSQL, but this whole article is describing a project that is almost perfect for a relational database.
But considering this author's previous analysis of Java vs C#, I am not surprised that this article was hardly worth the time to read.
Re: (Score:2)
Heck yeah, it reminds me of a project I did in 2004 or 2005, which stored over a hundred thousands of articles (some of them more than 64Kb!) with multiple auth
Re: (Score:2)
In my opinion, you must have a VERY good reason before even considering giving up ACID transactions. If your RDBMS isn't fast enough, almost certainly it's because you're doing it wrong, not because there's anything fundamentally wrong with the tool.
Those who do RDBMS wrong usually do NoSQL wrong too. Shocker, I know.
Re: (Score:2)
It's nice an author's bio is right there in the article object, but when it's time to update the bio, that does mean going through and touching every article by that author?
Actually, you don't update the biographical information for an article. The biographical information in the article is supposed to reflect the biographical information for the author at the time at which the article is published. When you update the biographical information, it goes into any articles published after the bio is updated.
Re: (Score:2)
Ars Technica follows the non-traditional way, and personally, only nostalgia would be a reason to retain the original bio.
Bad planning (Score:5, Interesting)
Throughout the article the client says they don't want full-text search. The author says he can "add it later," then compresses the body text field. Metadata like authorship information is also stored in a nasty JSON format—so say goodbye to being able to search that later, too!
About that compression...
That compression proved to be important due to yet another shortcoming of DynamoDB, one that nearly made me pull my hair out and encourage the team to switch back to MongoDB. It turns out the maximum record size in DynamoDB is 64K. That’s not much, and it takes me back to the days of 16-bit Windows where the text field GUI element could only hold a maximum of 64K. That was also, um, twenty years ago.
Which is a limit that, say, InnoDB in MySQL also has. So, let's tally it up:
So what the hell is this database for? It's unusable, unsearchable, and completely pointless. You have to know the title of the article you're interested in to query it! It sounds, honestly, like this is a case where the client didn't know what they needed. I really, really am hard-pressed to fathom a repository for scientific articles where they store the full text but only need to look up titles. With that kind of design, they could drop their internal DB and just use PubMed or Google Scholar... and get way better results!
I think the author and his team failed the customer in this case by providing them with an inflexible system. Either they forced the client into accepting these horrible limitations so they could play with new (and expensive!) toys, or the client just flat-out doesn't need this database for anything (in which case it's a waste of money.) This kind of data absolutely needs to be kept in a relational database to be useful.
Which, along with his horrible Java vs. C# comparison [slashdot.org], makes Jeff Cogswell officially the Slashdot contributor with the worst analytical skills.
Re:Bad planning (Score:4, Interesting)
Which, along with his horrible Java vs. C# comparison [slashdot.org], makes Jeff Cogswell officially the Slashdot contributor with the worst analytical skills.
OK, that's what I thought. Well, first, for anyone who hasn't read or doesn't remember that "Java vs. C#" thing, don't go back and read it now. Save your time, it's horrible.
Now, for the current article, isn't designing a database all about trade-offs? E.g. Indexes make it easier to find stuff, but then make extra work (updating indexes) when adding stuff. It's about balancing reading and writing, speed and maintenance, etc. And it seems like this guy has only thought about pulling out a single article to the exclusion of everything else.
Do we just not understand DynamoDB? How does this system pull all the articles by a certain author or with a certain keyword? What if they need to update an author's bio? With categories stored within the article object, how does he enforce integrity, so all "general relativity" articles end up with "general relativity" and not a mix of GR, Gen Rel, g relativity, etc?
What happens when they want to add full text search? Or pictures to articles? That 64k limit would seem like a deal breaker. 64k that includes EVERYTHING about an article--abstract, full text, authors and bios, etc.
My first thought was, this does not make much sense. Then I thought, well, I work with old skool RDMS, and I just don't get NoSQL. But now I think, naw, this guy really doesn't know enough to merit the level of attention his blatherings get on /.
Re:Bad planning (Score:5, Interesting)
That compression proved to be important due to yet another shortcoming of DynamoDB, one that nearly made me pull my hair out and encourage the team to switch back to MongoDB. It turns out the maximum record size in DynamoDB is 64K. That’s not much, and it takes me back to the days of 16-bit Windows where the text field GUI element could only hold a maximum of 64K. That was also, um, twenty years ago.
I didn't understand why he dismissed S3 to store his documents in the first place:
Amazon has their S3 storage, but that’s more suited to blob data—not ideal for documents
Why wouldn't an S3 blob be an ideal place to store a document of unknown size that you don't care about indexing? Later he says "In the DynamoDB record, simply store the identifier for the S3 object. That doesn’t sound like much fun, but it would be doable" -- is storing an S3 pointer worse than deploying a solution that will fail on the first document that exceeds 64KB, at which point he'll need to come up with a scheme to split large docs across multiple records? Especially when DynamoDB storage costs 10 times more than S3 storage ($1/GB/month vs $0.095/GB/month)
Re: (Score:1)
The AWS platform and the ease of scaling it offers. The application can actually scale itself with their API. I know you can scale *sql horizontally, but you cant argue that its easier.
Fom TFA:
"Our client said they didn't need a full-text search on the text or abstract of the documents; they only cared about people searching keywords and categories. That’s fine—we could always add further search capabilities later on, using third-party indexing and searching tools such as Apache Lucene.
slashdot
Re: (Score:2)
Re: (Score:2)
Interesting analysis.
I've been messing around writing my own Java NoSQL CMS called Magneato. It stores articles in XML because I use XForms for the front end (maybe a bad choice but there isn't a good forms solution yet, not even with HTML5) and I use Lucene/Bobo for the navigation and search side of things. It is focussed on facetted navigation although you can have relations between articles: parent of, sibling etc via Lucene.
It actually sounds like my efforts are better than this team have produced.
Re: (Score:2)
This article is garbage (Score:1)
TL;DR: Jeff Cogswell doesn't understand how relational databases work. Or "the cloud", for that matter.
My migration path (Score:5, Funny)
We decided that MongoDB was adequate but didn't leverage the synergies we were trying to harvest from our development methodologies.
We looked at GumboDB and found it was lacking in visualization tools to create a warehouse for our data that would provide a real-time dashboard of the operational metrics we were seeking.
Next up was SuperDuperDB which was great from a client-server-man-in-the-middle perspective but required a complex LDAP authentication matrix that reticulated splines within our identity management roadmap.
After that I quit. I hear they are using Access 95 with VBA.
Re: (Score:3)
After that I quit. I hear they are using Access 95 with VBA.
I think you're trying to be funny (or at least sarcastic) but the last time I worked on a system that stored multiple values in a field as delimted string--as this guy proposes storing mutiple authors and keywords--was for a late 90s dotcom running a web site off of an Access 97 mdb.
Ironically, I came to the opposite conclusion (Score:2)
Re: (Score:1)
Where's the irony exactly?
Re: (Score:1)
Re: (Score:1)
Yes, there is nothing ironic about that at all. Even when applying the retarded definition of irony popularized by Alanis Morissette.
Re: (Score:1)
Where's the irony exactly?
Unless Travis Brown ejaculated while reaching his conclusion, there is none.
http://www.youtube.com/watch?v=WY_amJ0YZrM [youtube.com]
Re:Ironically, I came to the opposite conclusion (Score:4, Funny)
Re: (Score:2)
Did you submit it as an article here?
If not please do.
Re: (Score:1)
Re: (Score:2)
I gave it a bump in the firehose, but who knows if it will make it to the main page.
Re: (Score:1)
His solution becomes web scale.
Comment removed (Score:4, Insightful)
I just sneezed into my punch cards (Score:3)
FTFA:
Hello, I'm a time traveller from 1973 where I've been fondly imagining you folks in the future had written software to solve this kind of problem in a more generic fashion. Back in the past we have some visionary guy by the name of Codd, and in my wilder dreams I sometimes imagine by the year 2000 someone has created some kind of revolutionary database software which is based on his "SEQUEL" ideas and does fancy stuff like maintaining its own indexes.
Then I wake up and realise it was just a flight of fantasy.