Wal-Mart's Data Obsession 581
g8oz writes "The New York Times covers Wal-Mart's obsession with collecting sales data.
Fun fact: 'Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at
its Bentonville headquarters.
To put that in perspective, the Internet has less than half as much data, according to experts.'
That much information results in some interesting data-mining. Did you know hurricanes increase strawberry Pop Tarts sales 7-fold?"
Re:"Nothing for you to see here. Please move along (Score:4, Informative)
Seen it! (Score:5, Informative)
The gentleman who gave me the tour indicated they have something like 72 weeks (1 year plus 2 weeks) of purchase data on LIVE disk arrays, plus huge archives of the same data on tape. If you buy anything and use your credit, debit, or whatever card they can figure out your sales history obscenely quickly. Be afriad. Be very afraid.
I also got to see Walmart.com (Sun E15k) and Samsclub.com (A bunch of HP boxes in a smallish frame), they were creepy, in a sense... all those sales going on at once, converging on a spot not a few feet from me.
Re:economies of scale (Score:5, Informative)
With SQL.
Teradata was built to handle processing very large datasets from day 1. 460 Terabytes distributed across a large number of CPUs and disks working in parallel with a robust SQL implementation isn't really the challenge. The hard part is keeping all those disks spinning when you start pushing MTBF limits, handling the thousands of concurrent users all banging away at the data, and the constant streaming of new data into the system in order to support near real-time DSS.
For those inclined to know more, check here. [ncr.com]
Re:economies of scale (Score:4, Informative)
As the article says, they're using Teradata [ncr.com]. This is not a product that I'd expect the average Slashbot, who thinks "IT" and "internet" are synonymous, to have heard of. Nevertheless, if you work with industrial amounts of data, you will know that Teradata databases can reasonably claim to be to Oracle as Oracle is to MySQL.
Re:I would have thought that the Internet had more (Score:5, Informative)
People who call themselves "experts" but are really just talking out of their asses do. Consider that The Internet Archive [wikipedia.org] alone contains more than a petabyte (1024 terrabyte) of data, all of it accessible, and that they are adding on the order of 20 terrabyte a day, and you start realizing how much bigger the Web is.
Re:Seen it! (Score:3, Informative)
The Problem? (Score:3, Informative)
Re:I would have thought that the Internet had more (Score:1, Informative)
And I really hope it's not on SQL (Score:3, Informative)
How the hell can they estimate that? Assuming "less than half" means about 45%, that gives us about 207 TB. Let's just round that up to 240.148445 TB to make it a nice, even number.
Google is searching 8,058,044,651 "webpages"* -- who knows what that means. Now, Google isn't searching every single page on the internet, certainly. But also, they can't be searching pages that don't exist. So the 8bn Google pages aren't certainly all the internet. But Google isn't double or triple counting pages. Still, at 240.148445 TB (my rough estimate), we come up with a page size of exactly> 32KB per page.**
Is this just counting the text? The code for this page right here (comments.pl) weighs in at about 14KB. Wal-Mart, in no way, has twice as much info as the internet. I would say the "internet" should be measured in at least petabytes. Archive.org itself already has 1PB, and I consider any of that content available to me "on the internet".
* I'm not even counting the Google cache.
* Which means Mr. Gates over-estimated by a factor of 20 when considering how much memory we all needed!
Nope, its location. (Score:5, Informative)
And everyone says something about leveraging technology and JIT delivery, etc.
Professor Liu [jhu.edu] says "Nope. Location."
Walmart chose most of their initial locations in cities/regions where there was no other competition. Places where there was no Kmart, no department stores, no malls. And they flourished.
Re:economies of scale (Score:5, Informative)
It does have more (Score:5, Informative)
The definition they used for "Internet" was probably "web pages indexed with a search engine" which is definately not the entire Internet.
Re:economies of scale (Score:5, Informative)
I know a guy who worked for Wal-Mart for ~8 years as some sort of data analyst and architect at the main offices in Bentonville. While he didn't go into too much detail, he told me that a lot of the back-end querying is done, surprisingly, with Perl-DBI on Oracle databases. When I asked why his team didn't use something like flat C, C++ or Java, portability was cited as a principal motivation and that, after a certain point, speed gains were only marginal. He also said when he left ~1.5 years ago, that a small cluster migration to DB2 was being talked about. I have no idea if they license search and query code, but I got the distinct impression that there was a team of software engineers who custom crafted search algorithms for the data.
Re:I would have thought that the Internet had more (Score:5, Informative)
Re:economies of scale (Score:2, Informative)
Actually, its more like go for a long coffee break, then spend the next 10 weeks collecting and analyzing the returned result set. Teradata ain't MySQL, or Oracle. A file scan on the 460 Tbytes distributed across all the CPUs/disks wouldn't take that long. However, if you toss in about 10+ left joins on subqueries with range predicates, then you might be able to take a short vacation...
half as much data until... (Score:3, Informative)
someone realized that the DB servers are actually accessible from the internet and then bam, instand 2x increase in the amount of data on the internet.
Re:I would have thought that the Internet had more (Score:5, Informative)
Re:I would have thought that the Internet had more (Score:5, Informative)
Your number is wrong, from their faq:
The Internet Archive Wayback Machine contains approximately 1 petabyte of data and is currently growing at a rate of 20 terabytes per month.
That's 20 terabytes per month, not per day.
Not that high, consider other contributing factors (Score:4, Informative)
Consider also that people will not be worrying about their diets when they're primarily worried about not being killed by their own rooftops...
Combine a bunch of these factors together, and yes, I can easily believe 7x.
230 terabyte data on the internet? hah. (Score:2, Informative)
Sharing the wealth. (Score:3, Informative)
Re:economies of scale (Score:1, Informative)
They do collect names when they can get the info and they then track that transaction to your shopping history. Its good enough that if you tend to buy the same things every week it can even mark some cash transactions as yours.
This is the company that started buying birth certificate data so they could send out flyers a few years latter when the kids were supposed to start school.
Re:So, if Walmart put up a web interface... (Score:5, Informative)
1511565 MB, ~1.5 terabytes in PC games being shared.
There were 44977 Seeds and 196735 Downloaders, After all those torrents listed are downloaded there will be 241712 with all that data on their hard drives connected to the internet.
I calculated that total and got 338394133 Mb, ~338 terabytes.
Re:Seen it! (Score:1, Informative)
Unless you're Enron, then all bets are off.
Re:economies of scale (Score:2, Informative)
Re:So, if Walmart put up a web interface... (Score:2, Informative)
Keep in mind, that's only a single p2p network.
Re:Walmart does drop your income (Score:3, Informative)
They really are the biggest non-government thing in the world, if not on paper then in terms of land and leases they own, inventory, clout in the marketplace. No one can touch them. And it's still "family" owned and all that cash is getting shipped right to the bible belt.
The conspiracy people are now sayign that the walmart store space will be used as internment camps when the "purges" come. Just do a search for "Walmart Camp" or "Walmart Prison". Good stuff
Re:Walmart does drop your income (Score:4, Informative)
According to the article: "Not long after that, in January 2001, Vlasic filed for bankruptcy--although the gallon jar of pickles, everyone agrees, wasn't a critical factor"(Emphasis added). Nice Troll.
Re:Walmart does drop your income (Score:3, Informative)
Re:Not that high, consider other contributing fact (Score:3, Informative)
Super Wal-Marts sell groceries. You see those in places like Florida. I was in Orlando and it was frustrating the simple fact that there was no where else to buy groceries where I was at. Ok there was a Win Dixie just across the parking lot, but its prices were insane and the quality of the produce was not so good. There were other grocery stores and a Costco but all were about 15 miles away. Trust me I did my best to stock up with Costco goods but for staples like milk, bread, eggs Wal-Mart was the only practical solution.
Regular Wal-Marts I don't believe sell groceries. I don't honestly know because I don't shop there. Super Wal-Marts have a very respectable grocery.