I had been a Porn Collector for a decade before I found 1-gigabyte Porn Collections to write about. Now it is 15 years later, and the 1-petabyte barrier is crumbling.
This is intended as a joke, I asume, but it also brings up the fact that it will be different sort of data that is now collected.
When I look at CRM systems, they used to contain basically the address and perhaps logs from calls they made to the call center. Now whole phone conversations are logged as well as faxes and letters that are scanned, together with images and video that is available. Faxes and letters used to have only a reference number and you could look them up in a file cabinet.
So even though there is not that much more data collected, (things were already available) they are now all put in the database. Where it used to be an entry 'customer was extremely angry and cursed a lot' it now saves the mp3 for all eternity (where legal).
So yes, the HD space it takes is bigger and thus the amount is bigger, yet it does not automaticaly mean that sort of data is bigger. e.g. do we suddenly have shoesize or other data available? Could be but it also could be that we just have different file formats we now save in the databse.
"they used to contain basically the address and perhaps logs from calls they made to the call center. Now whole phone conversations are logged as well as faxes and letters that are scanned, together with images and video that is available."
When my unemployment was running out years ago, I took a job at a call center to pay the bills.. When I had to ask a co-worker a question, I often would hit Mute instead of hold after asking them to hold. It was pretty entertaining!
by Anonymous Coward
on Monday August 25 2008, @07:44AM (#24735515)
They have many towns now with less than 50k people completely photographed, every street in high res. That has to be well over 1-petabyte, though I doubt it's all in one location, must be distributed?
Take a look at almost any large financial firm. The email retention system alone is much larger than a petabyte, and that's just dealing with the online media, not including what's spooled to tape. Due to deficiencies in RDBMS ssytems, each of the large firms usually develop their own systems for managing the archival system on top of the database.
Call me old fashioned, but I don't see why anyone but a search engine like google would need anything like a petabyte. You can have only so much useful information about anything. Sounds to me like, fill your garage with sh1t, build a bigger garage.
So the fact that movies have gone from 780mb (dvdrips) to 4.8gb (straight up copies) to 25gig (blu ray) doesn't bear any significance to you?
Or how about games which have gone from 1mb to installations that are upwards of 10gigs now (warhammer IIRC is 9 something).
Not to mention MS's fiasco of their Office XML format where things take up a ridiculous amount of space in comparison to open office (10mb docx vs 2.9mb open office)...it's all about the level of tech knowledge of someone that determines their space usage.
I wouldn't mind 3-4 TB, I'd split it off into about 4 partitions or raid stripe and call it a day for a while.
However consumer use is indicative of business use, so I would expect things to head towards exabyte eventually.
However consumer use is indicative of business use, so I would expect things to head towards exabyte eventually.
This is kind of my point. Do companies keep libraries of pr0n, video, music? Sure, if you're a media company you will. But say you're a plumbing distributor. You'll have the usual accounting stuff, and media for marketing, and some BS overhead, but don't tell me it adds up to a TB much less a PB.
On the other hand, if you have the extra space, it invites the usual waste in the form of archive directories for closed-out years, development junk, etc. Spinning round and round, doing nothing.
On the other hand, if you have the extra space, it invites the usual waste in the form of archive directories for closed-out years, development junk, etc. Spinning round and round, doing nothing.
Yep. That's exactly it. $200 today buys a 1 TB drive. $200 a few years ago bought a 1 GB drive. As the price has fallen the value of the HDD has risen relative to its cost. Those archive directories and development junk aren't being deleted because they have value. Sure, it's enough value to justify keeping them around when a 1 GB drive costs $200, but they are worth keeping around with a 1 TB drive costs that much.
They aren't "doing nothing" - they just aren't doing enough that it's worth keeping it until the price drops enough.
All of this is making the 1 TB drive considerably more valuable than the 1 GB drive, despite their original purchase price parity. This is long-tail economics at work [wired.com]. As the individual bits become worth less and less, the value in of the bits in total continues to rise, resulting in a completely new set of capabilities.
My DVR is an excellent example of this - it's a thorough change in the way that I watch television. Suddenly, it's a family event that we can all share, because when I want to comment, I can just hit pause, and share my thought. Nothing's lost, if needed we can just hit rewind a bit, and suddenly, instead of being annoyed at my daughter for wanting to comment on a point during a televised debate, I'm excited and interested! No more SHUSHSTing at my family, it's now a much more shared experience.
The price of nonlinear access media has dropped so incredibly that marginal-value bits (like video) are suddenly cheap enough to make it all possible.
And i'd also be worried about losing a PB all at once. There are TB drives at my local Best Buy, but that's a lot to lose at once. i'd rather split my files and programs between two or more smaller drives (and have a RAID).
Petabytes are actually pretty common in the sciences. I visited NCAR (National Center for Atmospheric Research [ucar.edu]) in Boulder five years ago and their main database was in the 2PB region even then. I'm sure it's a lot larger today
The LHC will generate several PB of data per year, as will the Large Synoptic Survey Telescope [lsst.org]. These projects aren't all that uncommon.
You can have only so much useful information about anything.
If you have the space available and the tools to utilize the stored data, why not? The more data you keep, the more information you will have available when techniques or routines become available to you to utilize this data.
Call me old fashioned, but I don't see why anyone but a search engine like google would need anything like a petabyte. You can have only so much useful information about anything. Sounds to me like, fill your garage with sh1t, build a bigger garage.
Unfortunately, you gather up a lot of digital stuff fast and most of the time it's not useful. Take for example my business mail, it's full of old presentations and random versions of various documents and whatnot. Is it worth cleaning up? No. Is it worth keeping? Well, from time to time clients start asking about old things and it's very useful to have it. I figure 90% of it could be deleted, only keeping final versions and important mails. Of those 90% will never be asked for again, so I keep 100% for maybe 1%. Make a company with hundreds of thousands of people all like that and you get huge, huge amounts of data. It's still cheaper than to go through those huge, huge amounts of data. That goes double for many automated data collection processes - it's cheaper to keep until it's all guaranteed useless than trying to sort it out.
Data mining is statistically based. The more information that's available to mine, the more accurate the results will be.
A minor quibble. I do data mining for a living. With most data sets, we end up sampling them down, because more data ramps up processing time faster than it improves accuracy. With most problems, more data doesn't improve accuracy measureably, once you've reached a certain critical mass size in the dataset. Simplistically, you don't need to flip the coin a billion times to figure out that it comes up heads 50% of the time.
It's a rare problem that we use more than 100,000 records for. They exist, but they're rare.
... DB design and old data that should be purged. Color me unimpressed.
I'm convinced now that regardless of attempted discrimination, HUMANS are pack-rats. THAT I can deal with, as people can be trained to actually throw shit away. The problem is when lawyers get involved in the matter. Yes, most of the shit we have today in the corporate world we are FORCED to keep due to some insane lawsuit and follow-up "fix-it-forever" law that calls for us to keep a copy of every damn thing that flows electronically for the next 7 - 70 years.
Imagine having tens of millions, or just millions users - all of them with their records, history, targeted ads data. Or some mail provider that stores attachments in a database. Or a file sharing service like those you and I know. That's a plenty of information to manage. Add an overhead, and it's easy to overfill even the biggest database.
Also I agree with you that bad design might be a concern. Of course there's no big database that couldn't get on a "purge" diet.
I remember encountering a 1+ petabyte database 10 years ago: it was the database to record and analyze particle accelerator experiment data at CERN. And it was built using a commercial object database - not relational. Oh but wait - the relational vendors have told us that OO databases don't scale....
You are mistaken. While certainly almost everything (right or wrong) has been said at some time by someone, nobody respectable who knew what they were doing ever claimed that object-oriented databases would not scale.
In fact OO and similar (CODASYL, network-style, etc. ) databases were used and continue to be used very heavily in applications where relational database do not scale.
Only problem is, where do you find an oo database with a good index and search implementation, that don't cost to much that when you ask the company for a price, they don't even want to reply.
Point well taken. The problem now is the reality that OO databases database products were decimated by their failure to explain their value to the market. However, there is a little bit of a resurgence. See http://www.service-architecture.com/products/object-oriented_databases.html [service-architecture.com]
Any organisation that wishes to be classed in any way professional knows that the value in it's databases has to be protected. That requires them to have the means to recover the data if something bad happens. A hot-mirrored copy is simply not good enough (one corruption would get written to both copies).
As a consequence, the size of commercial databases is limited by the amount of time the organisation is willing to have it unavailable while it is restored, in the case of a disaster, or the time taken to create/update secure, offline, copies.
Not by intrinsic properties of the database or host architecture
I need measurements I can understand, like how many Keanu Reeves' brains is a petabyte? And could he hold it indefinitely, or would his head explode at some point? If the latter, can we get him started on it now?
We've had petabyte databases on mainframes for a good couple of years. DB2 v9 on zSeries has two new tablespace types that make managing these humungous databases much easier.
So it may be news for the PC world but it's bordering on ancient history on IBM mainframes.
Okay, I know that the article is refering to database, but the comments seem to have gone into the way of disc storage, so I will take the bait and go off topic.
Petabyte drives would not really be that unpractical of an application for people who like to archive stuff. I just filled up a 300 gig drive and a 750 gig drive with just stuff off of the DVR in under a year. While National Geographic HD may be compressed so badly that it barely looks better than HD, and a one hour show is under 2 gig, try archivin
Porn collection (Score:4, Funny)
No porn collection jokes please.
Re: (Score:3, Insightful)
No porn collection jokes please.
+1 Futile
Won't somebody think of the children.... (Score:2, Funny)
Oh wait, that was petabyte...
Fixed it for you... (Score:5, Funny)
Noob (Score:5, Funny)
Parent
Re:Noob (Score:5, Funny)
It has an event horizon and is actively acquiring porn on it's own?
Parent
Re: (Score:3, Funny)
...event horizon...
Awesome! That's what I'm going to call it now! My "event horizon"!
"Here it comes baby, the point of no return!"
Petabyte DBs are old news to... (Score:3, Funny)
Petabyte DBs are old news to techie porn collectors. They always mix their two favorite subjects into one. Tech + Porn = Petabyte+ Porn Database
Re:Petabyte DBs are old news to... (Score:5, Interesting)
This is intended as a joke, I asume, but it also brings up the fact that it will be different sort of data that is now collected.
When I look at CRM systems, they used to contain basically the address and perhaps logs from calls they made to the call center. Now whole phone conversations are logged as well as faxes and letters that are scanned, together with images and video that is available.
Faxes and letters used to have only a reference number and you could look them up in a file cabinet.
So even though there is not that much more data collected, (things were already available) they are now all put in the database. Where it used to be an entry 'customer was extremely angry and cursed a lot' it now saves the mp3 for all eternity (where legal).
So yes, the HD space it takes is bigger and thus the amount is bigger, yet it does not automaticaly mean that sort of data is bigger. e.g. do we suddenly have shoesize or other data available? Could be but it also could be that we just have different file formats we now save in the databse.
Parent
Re:Petabyte DBs are old news to... (Score:4, Interesting)
"they used to contain basically the address and perhaps logs from calls they made to the call center. Now whole phone conversations are logged as well as faxes and letters that are scanned, together with images and video that is available."
Reminds me of David brin's Transparent society
http://www.davidbrin.com/tschp1.html [davidbrin.com]
http://www.amazon.com/Transparent-Society-Technology-Between-Privacy/dp/0738201448/ [amazon.com]
Parent
Re: (Score:3, Insightful)
When my unemployment was running out years ago, I took a job at a call center to pay the bills.. When I had to ask a co-worker a question, I often would hit Mute instead of hold after asking them to hold. It was pretty entertaining!
Oh s***! I'm calling my Congressman! (Score:5, Funny)
I have to find my kid. Last time I saw her, she was with her Uncle Micky while he was having his morning martini.
Re: (Score:2)
http://pw0nd.com/wp-content/uploads/2008/06/pdfvspedophile-500x400.jpg [pw0nd.com]
Re: (Score:2)
Hotlinking FAIL!
Google Street View must be most massive db ever? (Score:3, Interesting)
They have many towns now with less than 50k people completely photographed, every street in high res. That has to be well over 1-petabyte, though I doubt it's all in one location, must be distributed?
Re:Google Street View must be most massive db ever (Score:5, Informative)
Parent
I am confused !! (Score:5, Funny)
How many Libraries of Congress are necessary to break the 1-petabyte barrier ??
Re:I am confused !! (Score:4, Informative)
1 Petabyte = 1,000 Terabytes
1 LoC = 10 Terabytes
100 LoC = 1,000 Terabytes
======
100 LoC = 1 Petabyte
Parent
LHC data production (Score:4, Informative)
So when active, the Large Hadron Collider will generate the equivalent volume of data of 50 Libraries of Congress every second.
Parent
No big news here.... (Score:5, Interesting)
Take a look at almost any large financial firm. The email retention system alone is much larger than a petabyte, and that's just dealing with the online media, not including what's spooled to tape. Due to deficiencies in RDBMS ssytems, each of the large firms usually develop their own systems for managing the archival system on top of the database.
Oh, come on. (Score:5, Interesting)
Re:Oh, come on. (Score:5, Insightful)
So the fact that movies have gone from 780mb (dvdrips) to 4.8gb (straight up copies) to 25gig (blu ray) doesn't bear any significance to you?
Or how about games which have gone from 1mb to installations that are upwards of 10gigs now (warhammer IIRC is 9 something).
Not to mention MS's fiasco of their Office XML format where things take up a ridiculous amount of space in comparison to open office (10mb docx vs 2.9mb open office)...it's all about the level of tech knowledge of someone that determines their space usage.
I wouldn't mind 3-4 TB, I'd split it off into about 4 partitions or raid stripe and call it a day for a while.
However consumer use is indicative of business use, so I would expect things to head towards exabyte eventually.
Parent
Re:Oh, come on. (Score:5, Insightful)
This is kind of my point. Do companies keep libraries of pr0n, video, music? Sure, if you're a media company you will. But say you're a plumbing distributor. You'll have the usual accounting stuff, and media for marketing, and some BS overhead, but don't tell me it adds up to a TB much less a PB.
On the other hand, if you have the extra space, it invites the usual waste in the form of archive directories for closed-out years, development junk, etc. Spinning round and round, doing nothing.
Parent
More long-tail economics! (Score:5, Interesting)
On the other hand, if you have the extra space, it invites the usual waste in the form of archive directories for closed-out years, development junk, etc. Spinning round and round, doing nothing.
Yep. That's exactly it. $200 today buys a 1 TB drive. $200 a few years ago bought a 1 GB drive. As the price has fallen the value of the HDD has risen relative to its cost. Those archive directories and development junk aren't being deleted because they have value. Sure, it's enough value to justify keeping them around when a 1 GB drive costs $200, but they are worth keeping around with a 1 TB drive costs that much.
They aren't "doing nothing" - they just aren't doing enough that it's worth keeping it until the price drops enough.
All of this is making the 1 TB drive considerably more valuable than the 1 GB drive, despite their original purchase price parity. This is long-tail economics at work [wired.com]. As the individual bits become worth less and less, the value in of the bits in total continues to rise, resulting in a completely new set of capabilities.
My DVR is an excellent example of this - it's a thorough change in the way that I watch television. Suddenly, it's a family event that we can all share, because when I want to comment, I can just hit pause, and share my thought. Nothing's lost, if needed we can just hit rewind a bit, and suddenly, instead of being annoyed at my daughter for wanting to comment on a point during a televised debate, I'm excited and interested! No more SHUSHSTing at my family, it's now a much more shared experience.
The price of nonlinear access media has dropped so incredibly that marginal-value bits (like video) are suddenly cheap enough to make it all possible.
Parent
Re:Oh, come on. (Score:4, Insightful)
Agreed.
And i'd also be worried about losing a PB all at once. There are TB drives at my local Best Buy, but that's a lot to lose at once. i'd rather split my files and programs between two or more smaller drives (and have a RAID).
Parent
I won't call you old fashioned... (Score:4, Insightful)
... but I do wonder if you've ever heard of Sarbanes-Oxley.
Parent
Science! (Score:5, Informative)
The LHC will generate several PB of data per year, as will the Large Synoptic Survey Telescope [lsst.org]. These projects aren't all that uncommon.
Parent
Re:Oh, come on. (Score:4, Funny)
Parent
Re: (Score:2)
You can have only so much useful information about anything.
If you have the space available and the tools to utilize the stored data, why not? The more data you keep, the more information you will have available when techniques or routines become available to you to utilize this data.
Re:Oh, come on. (Score:4, Insightful)
Call me old fashioned, but I don't see why anyone but a search engine like google would need anything like a petabyte. You can have only so much useful information about anything. Sounds to me like, fill your garage with sh1t, build a bigger garage.
Unfortunately, you gather up a lot of digital stuff fast and most of the time it's not useful. Take for example my business mail, it's full of old presentations and random versions of various documents and whatnot. Is it worth cleaning up? No. Is it worth keeping? Well, from time to time clients start asking about old things and it's very useful to have it. I figure 90% of it could be deleted, only keeping final versions and important mails. Of those 90% will never be asked for again, so I keep 100% for maybe 1%. Make a company with hundreds of thousands of people all like that and you get huge, huge amounts of data. It's still cheaper than to go through those huge, huge amounts of data. That goes double for many automated data collection processes - it's cheaper to keep until it's all guaranteed useless than trying to sort it out.
Parent
Re:Oh, come on. (Score:5, Interesting)
Data mining is statistically based. The more information that's available to mine, the more accurate the results will be.
A minor quibble. I do data mining for a living. With most data sets, we end up sampling them down, because more data ramps up processing time faster than it improves accuracy. With most problems, more data doesn't improve accuracy measureably, once you've reached a certain critical mass size in the dataset. Simplistically, you don't need to flip the coin a billion times to figure out that it comes up heads 50% of the time.
It's a rare problem that we use more than 100,000 records for. They exist, but they're rare.
Parent
Too Bad Most of that is Due to Poor... (Score:2, Insightful)
... DB design and old data that should be purged. Color me unimpressed.
Re: (Score:2, Interesting)
... DB design and old data that should be purged. Color me unimpressed.
I'm convinced now that regardless of attempted discrimination, HUMANS are pack-rats. THAT I can deal with, as people can be trained to actually throw shit away. The problem is when lawyers get involved in the matter. Yes, most of the shit we have today in the corporate world we are FORCED to keep due to some insane lawsuit and follow-up "fix-it-forever" law that calls for us to keep a copy of every damn thing that flows electronically for the next 7 - 70 years.
Could you almost call it corruption? Yes, I
Effect of the scale (Score:2, Insightful)
Imagine having tens of millions, or just millions users - all of them with their records, history, targeted ads data. Or some mail provider that stores attachments in a database. Or a file sharing service like those you and I know. That's a plenty of information to manage. Add an overhead, and it's easy to overfill even the biggest database.
Also I agree with you that bad design might be a concern. Of course there's no big database that couldn't get on a "purge" diet.
Now seems to me we might have a problem w
OO databases have done this ten years ago (Score:5, Interesting)
That was ten years ago.
Re: (Score:3, Interesting)
In fact OO and similar (CODASYL, network-style, etc. ) databases were used and continue to be used very heavily in applications where relational database do not scale.
Re: (Score:3, Interesting)
Only problem is, where do you find an oo database with a good index and search implementation, that don't cost to much that when you ask the company for a price, they don't even want to reply.
Re: (Score:3, Interesting)
Google Maps is way bigger... (Score:3, Informative)
Google Maps' database is far bigger...
A base of 8 tiles, with each becoming four more smaller tiles, in two modes (map/satellite), and 16 zoom levels.
Each tile is approx. 30kB.
(((0.03* (8 * (4^16)))/1024)/1024) == 983.04TB right there.
My calculator doesn't handle numbers big enough for streetview. O_O
Re:Google Maps is way bigger... (Score:5, Funny)
Google Maps' database is far bigger...
A base of 8 tiles, with each becoming four more smaller tiles, in two modes (map/satellite), and 16 zoom levels.
We are sorry, but we don't
have maps at this zoom
level for this region.
Try zooming out for a
broader look.
Parent
When the petafile barrier crumbles ... (Score:5, Funny)
... we'll need an army of Chris Hansens and a mountain of beartraps. God help us.
the only *real* barrier is backup time (Score:5, Interesting)
Any organisation that wishes to be classed in any way professional knows that the value in it's databases has to be protected. That requires them to have the means to recover the data if something bad happens. A hot-mirrored copy is simply not good enough (one corruption would get written to both copies).
As a consequence, the size of commercial databases is limited by the amount of time the organisation is willing to have it unavailable while it is restored, in the case of a disaster, or the time taken to create/update secure, offline, copies.
Not by intrinsic properties of the database or host architecture
s/barrier/arbitrary round number/g (Score:5, Insightful)
That is all.
The world will only ever need 5 large databases (Score:5, Funny)
The world will only need 5 large databases.
None of them will never need more than 640KB^H^HMB^H^HGBMB^H^HTB of RAM and 32MB^H^HGB^H^HTB^H^HPB of storage.
WalMart has a 4 petabyte database already (Score:4, Informative)
Johnny Mnemonic (Score:5, Funny)
I need measurements I can understand, like how many Keanu Reeves' brains is a petabyte? And could he hold it indefinitely, or would his head explode at some point? If the latter, can we get him started on it now?
How is this news? (Score:5, Interesting)
So it may be news for the PC world but it's bordering on ancient history on IBM mainframes.
Re: (Score:2, Flamebait)
Database, not filesystem. Thanks for almost bothering to read the summary, though.
Re: (Score:3, Insightful)
I could see practical applications (Score:3, Informative)
Okay, I know that the article is refering to database, but the comments seem to have gone into the way of disc storage, so I will take the bait and go off topic.
Petabyte drives would not really be that unpractical of an application for people who like to archive stuff. I just filled up a 300 gig drive and a 750 gig drive with just stuff off of the DVR in under a year. While National Geographic HD may be compressed so badly that it barely looks better than HD, and a one hour show is under 2 gig, try archivin