Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Businesses Databases Programming Software Data Storage IT

Wal-Mart's Data Obsession 581

g8oz writes "The New York Times covers Wal-Mart's obsession with collecting sales data. Fun fact: 'Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at its Bentonville headquarters. To put that in perspective, the Internet has less than half as much data, according to experts.' That much information results in some interesting data-mining. Did you know hurricanes increase strawberry Pop Tarts sales 7-fold?"
This discussion has been archived. No new comments can be posted.

Wal-Mart's Data Obsession

Comments Filter:
  • by MaxPower2263 ( 529424 ) on Sunday November 14, 2004 @05:03PM (#10814579)
    Even Walmart probably doesn't even know what all that data means. Think of the processing power needed to make sense out of it all. I'm sure there are countless interesting trends that are lost in that data ocean.
  • Seen it! (Score:5, Informative)

    by Number44 ( 41761 ) on Sunday November 14, 2004 @05:06PM (#10814610) Homepage
    As a guest of WalMart I was able to enter their data center and see this Terraplex first hand. It's massive. It's thousands upon thousands of disks in ~8' frames, rows upon rows of racks. I walked down it and across it and up it and was simply awestruck by the idea of that many disks in one spot.

    The gentleman who gave me the tour indicated they have something like 72 weeks (1 year plus 2 weeks) of purchase data on LIVE disk arrays, plus huge archives of the same data on tape. If you buy anything and use your credit, debit, or whatever card they can figure out your sales history obscenely quickly. Be afriad. Be very afraid.

    I also got to see Walmart.com (Sun E15k) and Samsclub.com (A bunch of HP boxes in a smallish frame), they were creepy, in a sense... all those sales going on at once, converging on a spot not a few feet from me.
  • by kimanaw ( 795600 ) on Sunday November 14, 2004 @05:15PM (#10814668)
    When you have 460TB of data, how the hell do you even begin to search it?

    With SQL.

    Teradata was built to handle processing very large datasets from day 1. 460 Terabytes distributed across a large number of CPUs and disks working in parallel with a robust SQL implementation isn't really the challenge. The hard part is keeping all those disks spinning when you start pushing MTBF limits, handling the thousands of concurrent users all banging away at the data, and the constant streaming of new data into the system in order to support near real-time DSS.

    For those inclined to know more, check here. [ncr.com]

  • by sql*kitten ( 1359 ) * on Sunday November 14, 2004 @05:17PM (#10814684)
    Seems like they'd need to license map-reduce from google or something.

    As the article says, they're using Teradata [ncr.com]. This is not a product that I'd expect the average Slashbot, who thinks "IT" and "internet" are synonymous, to have heard of. Nevertheless, if you work with industrial amounts of data, you will know that Teradata databases can reasonably claim to be to Oracle as Oracle is to MySQL.
  • by Hobbex ( 41473 ) on Sunday November 14, 2004 @05:17PM (#10814688)

    People who call themselves "experts" but are really just talking out of their asses do. Consider that The Internet Archive [wikipedia.org] alone contains more than a petabyte (1024 terrabyte) of data, all of it accessible, and that they are adding on the order of 20 terrabyte a day, and you start realizing how much bigger the Web is.
  • Re:Seen it! (Score:3, Informative)

    by nizo ( 81281 ) on Sunday November 14, 2004 @05:18PM (#10814696) Homepage Journal
    I wonder how many people they have running around replacing failed disks in the arrays. It would have to be at least several full-time jobs worth of people, not to mention they must have a gigantic pile of disks waiting on-site.
  • The Problem? (Score:3, Informative)

    by squirel_dude ( 810037 ) <squirrel@iraqi-cabbages.tk> on Sunday November 14, 2004 @05:19PM (#10814709) Homepage
    I hate to sound like some pro-totalitarian next generation Big Brother, but it's not as if they are collecting personal information on customers without the customer's consent. Wal-Mart are just doing some major (I agree with obsessive though) market research so as they can optimise their stores to maximise profits, exactly the same as every other business in the world.
  • by Anonymous Coward on Sunday November 14, 2004 @05:21PM (#10814722)
    What constitutes the internet anyway? I know some dc hubs on the internet that have over 100TB, sure it's p2p, but what about archive.org? I know they have at least a feqw dozen TBs by themselves. That number in the article can't be right at all.
  • by The-Bus ( 138060 ) on Sunday November 14, 2004 @05:22PM (#10814733)
    To put that in perspective, the Internet has less than half as much data, according to experts.'


    How the hell can they estimate that? Assuming "less than half" means about 45%, that gives us about 207 TB. Let's just round that up to 240.148445 TB to make it a nice, even number.

    Google is searching 8,058,044,651 "webpages"* -- who knows what that means. Now, Google isn't searching every single page on the internet, certainly. But also, they can't be searching pages that don't exist. So the 8bn Google pages aren't certainly all the internet. But Google isn't double or triple counting pages. Still, at 240.148445 TB (my rough estimate), we come up with a page size of exactly> 32KB per page.**

    Is this just counting the text? The code for this page right here (comments.pl) weighs in at about 14KB. Wal-Mart, in no way, has twice as much info as the internet. I would say the "internet" should be measured in at least petabytes. Archive.org itself already has 1PB, and I consider any of that content available to me "on the internet".

    * I'm not even counting the Google cache.
    * Which means Mr. Gates over-estimated by a factor of 20 when considering how much memory we all needed!
  • Nope, its location. (Score:5, Informative)

    by mekkab ( 133181 ) on Sunday November 14, 2004 @05:22PM (#10814736) Homepage Journal
    We learned a lot about Walmart and Data mining in my database 101 class. And the professor asks "Why do you think Walmart is so successful?"

    And everyone says something about leveraging technology and JIT delivery, etc.

    Professor Liu [jhu.edu] says "Nope. Location."
    Walmart chose most of their initial locations in cities/regions where there was no other competition. Places where there was no Kmart, no department stores, no malls. And they flourished.
  • by antifoidulus ( 807088 ) on Sunday November 14, 2004 @05:24PM (#10814746) Homepage Journal
    I know this is a joke but as far as I know, Wal-Mart does not collect individual customer names for most purchases, there is no customer card thing like there is at a lot of supermarkets. I suppose they could collect data via credit cards, but I doubt that is legal.....
  • It does have more (Score:5, Informative)

    by Jman314 ( 651648 ) on Sunday November 14, 2004 @05:25PM (#10814756)
    The Internet definately has more data than Wal-Mart. Consider this old 2002 study [berkeley.edu]. The "deep web" alone, comprised mostly of databases, comprises 91,850 TB of data. And this was a couple years ago. It doesn't include email or P2P either.
    The definition they used for "Internet" was probably "web pages indexed with a search engine" which is definately not the entire Internet.
  • by MC Negro ( 780194 ) on Sunday November 14, 2004 @05:28PM (#10814784) Journal

    Seems like they'd need to license map-reduce from google or something. (That's a distributed data correlation engine. With extremely high fault tolerence, to boot.)
    I know a guy who worked for Wal-Mart for ~8 years as some sort of data analyst and architect at the main offices in Bentonville. While he didn't go into too much detail, he told me that a lot of the back-end querying is done, surprisingly, with Perl-DBI on Oracle databases. When I asked why his team didn't use something like flat C, C++ or Java, portability was cited as a principal motivation and that, after a certain point, speed gains were only marginal. He also said when he left ~1.5 years ago, that a small cluster migration to DB2 was being talked about. I have no idea if they license search and query code, but I got the distinct impression that there was a team of software engineers who custom crafted search algorithms for the data.
  • by Chess_the_cat ( 653159 ) on Sunday November 14, 2004 @05:30PM (#10814793) Homepage
    Uh, except that Google hasn't indexed all of the publicly available WWW. It's only indexed a small fraction of it. And the WWW isn't the Internet. They're different. Secondly, the Internet Archive alone has archived 1 petabyte of data [archive.org] so the figure of 230 terabytes of data on the Internet is obviously wrong.
  • by kimanaw ( 795600 ) on Sunday November 14, 2004 @05:38PM (#10814841)
    "go on vacation for a week or ten.."

    Actually, its more like go for a long coffee break, then spend the next 10 weeks collecting and analyzing the returned result set. Teradata ain't MySQL, or Oracle. A file scan on the 460 Tbytes distributed across all the CPUs/disks wouldn't take that long. However, if you toss in about 10+ left joins on subqueries with range predicates, then you might be able to take a short vacation...

  • by Flamesplash ( 469287 ) on Sunday November 14, 2004 @05:43PM (#10814874) Homepage Journal
    To put that in perspective, the Internet has less than half as much data, according to experts.

    someone realized that the DB servers are actually accessible from the internet and then bam, instand 2x increase in the amount of data on the internet.
  • by mOoZik ( 698544 ) on Sunday November 14, 2004 @05:54PM (#10814934) Homepage
    Also, don't forget that the internet includes Usenet and other services under the protocol, which has TONS of additional data. Chances are, the internet is not 230 terabytes large and the idiot who made that claim...is an idiot.

  • by ikea5 ( 608732 ) on Sunday November 14, 2004 @06:00PM (#10814998)
    'they are adding on the order of 20 terrabyte a day'

    Your number is wrong, from their faq:

    The Internet Archive Wayback Machine contains approximately 1 petabyte of data and is currently growing at a rate of 20 terabytes per month.

    That's 20 terabytes per month, not per day.

  • by xant ( 99438 ) on Sunday November 14, 2004 @06:14PM (#10815108) Homepage
    First of all, most Walmarts don't primarily sell food, they primarily sell loads of other stuff. In fact, what they sell is a lot of stuff that people might need to survive a hurricane, including various kinds of hardware, containers, lights, reading material. So a hurricane would naturally drive lots of people into Walmart. Naturally those people will buy food products while they're in there, and the standard Walmart sells mostly junk food. So it's not as if people are seeking out pop-tarts in hurricane season, but the massive influx of people buying all kinds of things will also increase the number of people buying non-perishable junk food.

    Consider also that people will not be worrying about their diets when they're primarily worried about not being killed by their own rooftops...

    Combine a bunch of these factors together, and yes, I can easily believe 7x.
  • by mowler2 ( 301294 ) on Sunday November 14, 2004 @06:25PM (#10815193)
    The internet got substantially more data than that. Heck, only my ultra small hobby-company has around 1 TB on the internet. And privatly I have around 0.5 TB shared over the internet from home. Then add all other small hobby companies, billions of webpages, colocation-servers, communities, p2p-"seeders" etc etc, and it will quickly pass 230 TB data, many thousand times over.
  • Sharing the wealth. (Score:3, Informative)

    by azimir ( 316998 ) on Sunday November 14, 2004 @06:27PM (#10815203) Homepage
    No problem, drop on in! [playboy.com]
  • by Anonymous Coward on Sunday November 14, 2004 @06:28PM (#10815206)
    Then you would guess wrong.

    They do collect names when they can get the info and they then track that transaction to your shopping history. Its good enough that if you tend to buy the same things every week it can even mark some cash transactions as yours.

    This is the company that started buying birth certificate data so they could send out flyers a few years latter when the kids were supposed to start school.

  • by l810c ( 551591 ) * on Sunday November 14, 2004 @06:36PM (#10815259)
    I did a quick cut and paste from Suprnova PC Games into Excel and totalled the values.

    1511565 MB, ~1.5 terabytes in PC games being shared.
    There were 44977 Seeds and 196735 Downloaders, After all those torrents listed are downloaded there will be 241712 with all that data on their hard drives connected to the internet.

    I calculated that total and got 338394133 Mb, ~338 terabytes.

  • Re:Seen it! (Score:1, Informative)

    by Anonymous Coward on Sunday November 14, 2004 @06:40PM (#10815284)
    For those who don't work in accounting firms - 72 weeks is a fiscal year. The additional 2 weeks (to a month) is a grace period because you can't have data for a year if you are still in said period.

    Unless you're Enron, then all bets are off.
  • by Anonymous Coward on Sunday November 14, 2004 @07:35PM (#10815729)
    actually, they can use your credit card to track data. fred meyer does it. every time i use my moms debit card i get catfood coupons. why would they print out catfood coupons when i buy a tomato and bread? because they track the purchases on my mom's card... and she has 6 cats
  • by Pleione ( 825378 ) on Monday November 15, 2004 @12:21AM (#10817324)
    The last time I pulled up KLite, I saw at least 37 petabytes being shared.

    Keep in mind, that's only a single p2p network.
  • by inKubus ( 199753 ) on Monday November 15, 2004 @03:23AM (#10818024) Homepage Journal
    Yeah, some of their practices make Microsoft look like Jesus.

    They really are the biggest non-government thing in the world, if not on paper then in terms of land and leases they own, inventory, clout in the marketplace. No one can touch them. And it's still "family" owned and all that cash is getting shipped right to the bible belt.

    The conspiracy people are now sayign that the walmart store space will be used as internment camps when the "purges" come. Just do a search for "Walmart Camp" or "Walmart Prison". Good stuff ;)
  • by Breakfast Pants ( 323698 ) on Monday November 15, 2004 @03:30AM (#10818042) Journal
    At least read the drivel you link to. "Walmart singled handedly put Vlassic in bankruptcy by forcing them to sell a gallon of pickles for $2.97 dollars."

    According to the article: "Not long after that, in January 2001, Vlasic filed for bankruptcy--although the gallon jar of pickles, everyone agrees, wasn't a critical factor"(Emphasis added). Nice Troll.
  • by (54)T-Dub ( 642521 ) * <[tpaine] [at] [gmail.com]> on Monday November 15, 2004 @03:44AM (#10818081) Journal
    The largest land owner in the world is actually McDonalds. The corporation owns the land that every single one is built on.
  • by zakezuke ( 229119 ) on Monday November 15, 2004 @04:43AM (#10818224)
    First of all, most Walmarts don't primarily sell food

    Super Wal-Marts sell groceries. You see those in places like Florida. I was in Orlando and it was frustrating the simple fact that there was no where else to buy groceries where I was at. Ok there was a Win Dixie just across the parking lot, but its prices were insane and the quality of the produce was not so good. There were other grocery stores and a Costco but all were about 15 miles away. Trust me I did my best to stock up with Costco goods but for staples like milk, bread, eggs Wal-Mart was the only practical solution.

    Regular Wal-Marts I don't believe sell groceries. I don't honestly know because I don't shop there. Super Wal-Marts have a very respectable grocery.

"Experience has proved that some people indeed know everything." -- Russell Baker

Working...