The 1-Petabyte Barrier Is Crumbling 217
CurtMonash writes "I had been a database industry analyst for a decade before I found 1-gigabyte databases to write about. Now it is 15 years later, and the 1-petabyte barrier is crumbling. Specifically, we are about to see data warehouses — running on commercial database management systems — that contain over 1 petabyte of actual user data. For example, Greenplum is slated to have two of them within 60 days. Given how close it was a year ago, Teradata may have crossed the 1-petabyte mark by now too. And by the way, Yahoo already has a petabyte+ database running on a home-grown system. Meanwhile, the 100-terabyte mark is almost old hat. Besides the vendors already mentioned above, others with 100+ terabyte databases deployed include Netezza, DATAllegro, Dataupia, and even SAS."
Yawn... (Score:1, Insightful)
Too Bad Most of that is Due to Poor... (Score:2, Insightful)
... DB design and old data that should be purged. Color me unimpressed.
Re:Oh, come on. (Score:5, Insightful)
So the fact that movies have gone from 780mb (dvdrips) to 4.8gb (straight up copies) to 25gig (blu ray) doesn't bear any significance to you?
Or how about games which have gone from 1mb to installations that are upwards of 10gigs now (warhammer IIRC is 9 something).
Not to mention MS's fiasco of their Office XML format where things take up a ridiculous amount of space in comparison to open office (10mb docx vs 2.9mb open office)...it's all about the level of tech knowledge of someone that determines their space usage.
I wouldn't mind 3-4 TB, I'd split it off into about 4 partitions or raid stripe and call it a day for a while.
However consumer use is indicative of business use, so I would expect things to head towards exabyte eventually.
Re:Oh, come on. (Score:4, Insightful)
Agreed.
And i'd also be worried about losing a PB all at once. There are TB drives at my local Best Buy, but that's a lot to lose at once. i'd rather split my files and programs between two or more smaller drives (and have a RAID).
"Barrier"? (Score:1, Insightful)
Gigabyte barrier. Petabyte barrier.
In what sense are these barriers? Does the database resist putting more data in it the closer to a petabyte you get? Is it likely to explode once it reaches 1 petabyte?
I won't call you old fashioned... (Score:4, Insightful)
... but I do wonder if you've ever heard of Sarbanes-Oxley.
Effect of the scale (Score:2, Insightful)
Imagine having tens of millions, or just millions users - all of them with their records, history, targeted ads data. Or some mail provider that stores attachments in a database. Or a file sharing service like those you and I know. That's a plenty of information to manage. Add an overhead, and it's easy to overfill even the biggest database.
Also I agree with you that bad design might be a concern. Of course there's no big database that couldn't get on a "purge" diet.
Now seems to me we might have a problem with querying such a big bucket of random data. Imagine a query taking months to complete. We're gonna be there in another ten years.
And then we lose the capacity to make electricity. And we can use our CDs, DVDs, let alone magnetic media to... well, dig trenches.
Those pesky petabytes of data are going to doom us.
s/barrier/arbitrary round number/g (Score:5, Insightful)
That is all.
Re:Yawn... (Score:1, Insightful)
A file system is a database...
Re:Oh, come on. (Score:5, Insightful)
This is kind of my point. Do companies keep libraries of pr0n, video, music? Sure, if you're a media company you will. But say you're a plumbing distributor. You'll have the usual accounting stuff, and media for marketing, and some BS overhead, but don't tell me it adds up to a TB much less a PB.
On the other hand, if you have the extra space, it invites the usual waste in the form of archive directories for closed-out years, development junk, etc. Spinning round and round, doing nothing.
Re:Yawn... (Score:3, Insightful)
IBM Boulder (Score:2, Insightful)
Is the location of IBM's Managed Storage Services (MSS) division, which deploys SAN for customers in Boulder (including IBM internal) and other locations (over high speed fibre links) on IBM "Shark" (ESS) and DS6000/DS8000 devices. When I worked at IBM their marketing materials stated they were managing over 4 petabytes of data for enterprise customers out of that location alone - that was four years ago! That doesn't count for other MSS locations either, nor all the other areas where IBM implements large amounts of storage for customers. Remember, many if not most of IBM's customers are governments and Fortune 100 companies, particularly high finance. I think they've got some data.
So you want to talk about high levels of storage - IBM has the game covered, considering they invented the [ibm.com] HDD [wikipedia.org].
Re:Oh, come on. (Score:4, Insightful)
Call me old fashioned, but I don't see why anyone but a search engine like google would need anything like a petabyte. You can have only so much useful information about anything. Sounds to me like, fill your garage with sh1t, build a bigger garage.
Unfortunately, you gather up a lot of digital stuff fast and most of the time it's not useful. Take for example my business mail, it's full of old presentations and random versions of various documents and whatnot. Is it worth cleaning up? No. Is it worth keeping? Well, from time to time clients start asking about old things and it's very useful to have it. I figure 90% of it could be deleted, only keeping final versions and important mails. Of those 90% will never be asked for again, so I keep 100% for maybe 1%. Make a company with hundreds of thousands of people all like that and you get huge, huge amounts of data. It's still cheaper than to go through those huge, huge amounts of data. That goes double for many automated data collection processes - it's cheaper to keep until it's all guaranteed useless than trying to sort it out.
Re:Porn collection (Score:3, Insightful)
No porn collection jokes please.
+1 Futile
Re:Petabyte DBs are old news to... (Score:3, Insightful)
When my unemployment was running out years ago, I took a job at a call center to pay the bills.. When I had to ask a co-worker a question, I often would hit Mute instead of hold after asking them to hold. It was pretty entertaining!