Apache Hadoop Has Failed Us, Tech Experts Say (datanami.com) 150
It was the first widely-adopted open source distributed computing platform. But some geeks running it are telling Datanami that Hadoop "is great if you're a data scientist who knows how to code in MapReduce or Pig...but as you go higher up the stack, the abstraction layers have mostly failed to deliver on the promise of enabling business analysts to get at the data." Slashdot reader atcclears shares their report:
"I can't find a happy Hadoop customer. It's sort of as simple as that," says Bob Muglia, CEO of Snowflake Computing, which develops and runs a cloud-based relational data warehouse offering. "It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says. "The number of customers who have actually successfully tamed Hadoop is probably less than 20 and it might be less than 10..."
One of the companies that supposedly tamed Hadoop is Facebook...but according to Bobby Johnson, who helped run Facebook's Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a "historical glitch. That may be a little strong," Johnson says. "But there's a bunch of things that people have been trying to do with it for a long time that it's just not well suited for." Hadoop's strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it's ill-suited for running interactive, user-facing applications... "After years of banging our heads against it at Facebook, it was never great at it," he says. "It's really hard to dig into and actually get real answers from... You really have to understand how this thing works to get what you want."
Johnson recommends Apache Kafka instead for big data applications, arguing "there's a pipe of data and anything that wants to do something useful with it can tap into that thing. That feels like a better unifying principal..." And the creator of Kafka -- who ran Hadoop clusters at LinkedIn -- calls Hadoop "just a very complicated stack to build on."
One of the companies that supposedly tamed Hadoop is Facebook...but according to Bobby Johnson, who helped run Facebook's Hadoop cluster before co-founding behavioral analytics company Interana, the fact that Hadoop is still around is a "historical glitch. That may be a little strong," Johnson says. "But there's a bunch of things that people have been trying to do with it for a long time that it's just not well suited for." Hadoop's strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it's ill-suited for running interactive, user-facing applications... "After years of banging our heads against it at Facebook, it was never great at it," he says. "It's really hard to dig into and actually get real answers from... You really have to understand how this thing works to get what you want."
Johnson recommends Apache Kafka instead for big data applications, arguing "there's a pipe of data and anything that wants to do something useful with it can tap into that thing. That feels like a better unifying principal..." And the creator of Kafka -- who ran Hadoop clusters at LinkedIn -- calls Hadoop "just a very complicated stack to build on."
MapReduce is great (Score:4, Insightful)
If 1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis; AND
2) your business has a pressing need to crunch terabytes of logs or document data with no fixed schema and continually changing business needs.
For the average Fortune 500 (or even IT) shop, not so much. A '90s style data warehouse accessible through SQL queries works much better.
Re: (Score:2)
I have to say that I am less impressed with the quality of the coders at Google the more I know about them. The really good ones are leaving, are thinking about leaving or have already left a while ago. What is left is the mediocre ones that somehow managed to get in.
Re:MapReduce is great (Score:4, Funny)
You've done an incredible amount of work to reach this conclusion. Congrats. Did you use map-reduce on your data set?
Re: (Score:1)
If you make people jump through hoops like circus animals to come work at your company you only get the desperate, or the ones who want the job as a status symbol.
Re: (Score:2)
If you make people jump through hoops like circus animals to come work at your company you only get the desperate, or the ones who want the job as a status symbol.
Or the ones who like being made to jump through hoops like a circus animal. I guess if you are into that it's okay; who am I to judge?
Re: MapReduce is great (Score:1)
I went through the process a few years ago for an SRE position. It was exactly the same process used at most other tech companies: a couple of screening interviews on the office, and a half to 2/3 day of on-site, one on one, specific tech interviews with people who *seriously* know their stuff.
The campus is overall a weird cult, and they don't have other offices in places I want to live (maybe Pittsburgh, someday), so I don't work there. But they haven't done the really weird interviews that they used to be
Re:MapReduce is great (Score:5, Interesting)
Indeed. I went though their "interview-process" a while back at the request of a friend that was there and desperately wanted me for his team. Interestingly, I failed to get hired, and I think it is because I knew a lot more about the questions they asked than the people that created (and asked) these questions. For example, on (non-cryptographic) hash-functions my answer was to not do them yourself, because they would always be pretty bad, and to instead use the ones by Bob Jenkins, or if things are slow because there is a disk-access in there to use a crypto hash. While that is what you do in reality if you have more than small tables, that was apparently very much not what they wanted to hear. They apparently wanted me to start to mess around with the usual things you find in algorithm books. Turns out, I did way back, but when I put 100 Million IP addresses into such a table, it performed abysmally bad. My take-away is that Google prefers to hire highly intelligent, but semi-smart people with semi-knowledge about things and little experience and that experienced and smart people fail their interviews unless they prepare for giving dumber answers than they can give. I will never do that.
On the plus side, my current job is way more interesting than anything Google would have offered me.
Re: (Score:2)
I have heard plenty of stories like this.
And I have to say, while the questions google is asking in an interview are relevant for their business, they are rather simple.
I guess I would fail an interview, too.
On the other hand, I work freelance, so big companies are rarely interesting.
Re: (Score:2)
The problem those companies face is that they grew so fast that they're struggling with past technical decisions that are difficult to revert (e.g. Twitter and their initial RoR architecture). The wheels keep turning so they end up having to build sophisticated layers on top of their legacy garbage.
We've all been there. Someone (maybe even you) builds a throwaway Excel macro or Wordpress-driven monstrosity just to address a temporary need that is not worth spending more than 2h on, and first thing you know
Re: (Score:2)
Just because your mediocre company forces its employees to do it, doesn't make it the correct decision
In my experience, the bigger the organization gets, the more it's important to think in terms of "right" practice, not "best" practice. The correct decision is the one that makes the business successful consistently; and unless you have the psychic ability to see the future, slowing down the business to do things by the book is typically a bad idea, especially if the company is experiencing a huge growth.
Re: (Score:2)
Re:MapReduce is great (Score:5, Interesting)
For example, on (non-cryptographic) hash-functions my answer was to not do them yourself, because they would always be pretty bad, and to instead use the ones by Bob Jenkins, or if things are slow because there is a disk-access in there to use a crypto hash. While that is what you do in reality if you have more than small tables, that was apparently very much not what they wanted to hear. They apparently wanted me to start to mess around with the usual things you find in algorithm books.
No offense, but "I'd rather just use a library" seriously brings into question what you bring to the table and whether you'll just be searching experts-exchange for smart stuff other people have done..Like everybody knows you shouldn't use homegrown cryptographic algorithms, but if a cryptologist can't tell me what an S-box is and points me to using a library instead it doesn't really tell me anything about his skill, except he didn't want to answer the question. In fact, dodging the question like that would be a pretty big red flag.
Don't get me wrong, you can get there. But start off with roughly what you'd do if you had to implement it from scratch, what's difficult to get right, then suggest implementations you know or alternative ways to solve it. Because they're not that stupid that they think this is some novel issue nobody's ever looked at before or found decent answers to. They want to test if you have the intellect, knowledge and creativity to sketch a solution yourself. Once you've done that, then you can tell them why it's probably not a good idea to reinvent the wheel.
Re:MapReduce is great (Score:4, Interesting)
No offense, but you miss the point entirely. What I answered is very far from "use a library". First, it is an algorithm, not a library. That difference is very important. Second, it is a carefully selected algorithm that performs much better than what you commonly find in "libraries" in almost all situations. And third, the hash-functions by Bob Jenkins (and the newer ones bu DJB, for example) are inspired by crypto, but much faster in exchange for reduced security assurances. In fact so fast that they can compete directly with the far worse things commonly in use. "Do not roll your own crypto" _does_ apply_ though.
So while I think you meant to be patronizing, you just come across as incompetent. A bit like the folks at Google, come to think of it...
Re: (Score:1)
Or to put it another way: There are better ways to determine someone's understanding of wheels than asking them to make one.
If I were interviewing a candidate and wanted, for some reason, some sense of that person's understanding of hash functions, I'd hope for more than "just use a library", but I also wouldn't be looking for a Introduction to Algorithms exposition on them. gweihir's original post comes pretty close to the sweet spot: it shows an understanding of the problem domain, some sense of approache
Re: (Score:2)
No offense, but "I'd rather just use a library" seriously brings into question what you bring to the table.
Except that's the right answer. It's arrogant pricks who think that they're hot shit who reinvent the wheel, do it badly and then charge headlong into their next coding disaster, energy drink in hand and earbuds in ears. Meanwhile, a more responsible engineer has to come along afterwards and clean up the hot mess so that the users can actually have a working system that isn't chock full of silly bugs.
Oh yes. Of course the answer is not to use "any library", but to carefully select a good algorithm and then use a library for that. I cannot count the times some "Rockstar"-wannabe has reinvented the wheel and did it really, really badly because they were not even aware of the basics.
They want to test if you have the intellect, knowledge and creativity to sketch a solution yourself.
The best way to determine that is to ask an abstract hypothetical question, where there is no existing implementation and no risk of getting it wrong. Bringing in real world concerns that you want the candidate to ignore because "it's an interview question" is stupid because it clouds the issue and prevents the type of answer that you're looking for. Maybe the candidate is an honest guy and prefers to give you the "don't write your own encryption algorithms" answer because in reality that is the right answer. Then you pass up an otherwise excellent candidate because your interview question was poor. Is that really what you want?
While I know that this is not what Google wanted, it is what they did. And on the hash-question, I do know that I do not have what it takes to come up with a good solution (you need to be a cryptographer for that these days and
Re: (Score:2)
No it isn't. If I'm going to hire someone to link in a library, give me somebody who has some clue what the library is doing. The initial results will be better, and if there's something wrong with the chosen black box, we'll have a chance of figuring it out.
Re: (Score:2)
Interesting comments on this thread, thanks. I've learned a lot.
fwiw, I have a network engineering background and Hadoop always seemed like a clusterfsk to me...good to learn the actual story isn't far from my impressions.
Re: (Score:2)
If you make people jump through hoops like circus animals to come work at your company
They jump through hadoops, not hoops. That's how they show they're qualified to work with it.
Re: (Score:2)
Are hadoops hoops with ads in them?
Re: MapReduce is great (Score:5, Interesting)
That's because the mediocre programmers are the ones giving the interviews. Close friend interviewed last year only to sit in front of a bunch of know it all elitists. One douche rambled on about how he wishes there were monads in c++ and how great functional design is. Now my friend and his roommate are CS geeks and their spare time is doing shit like build a lisp interpreter in c++ just for fun. So he asked mr monad if the project used a functional approach which was a solid no. Idiot just wanted to show off the fact he knew what functional programming is and wasted time. He passed on the Google job for a big local company doing back end dev work. Job pays as good as Google without the pompous know nothing's with the ability to remotely work. Fuck working for Google.
Re: (Score:2)
Quite possibly these people are vastly overestimating their own skills because they "work at Google". Fortunately, I did not run into socially inept interviewers, but as to the questions asked, they did not have more than surface knowledge. That is not how you interview somebody with advanced skills and experience, because people on that level rarely run into things they have not seen before in some form and that they need to solve on an elementary level. I think this happened to me once in the last 5 years
Re: (Score:2)
Actually, A-players tend to hire A-players and B-players tend to hire C-players.
Re: (Score:3)
1) you have a staff of elite programmers like Google or Facebook, who have CS degrees from top universities and are accustomed to picking up new programming languages and tools on a continuing basis;
I disagree.
MapReduce is actually great for teaching people about parallel processing! I have been able to teach a distributed computing course to non-CS (primarily data science) MS students because it achieves parallelization without most of the complexities associated with distributed query processing. With Hadoop streaming, all you need is basic knowledge of python (or similar) to write your own custom jobs, even without Hive/Pig/etc.
That to me is one of the greatest accomplishments of MapReduce. Bringi
Re: (Score:2)
> MapReduce is actually great for teaching people about parallel processing! I
And about how _not_ to do it. The underlying expense and architecture mistakes "scalability" for actual throughput in processing. It's proven extremely unstable in tasks larger than a small proof of concept, and any task I've encountered in which the actual data to be processed has to be successfully, processed, and verified within a specified deadline.
Re: (Score:3)
The underlying expense and architecture mistakes "scalability" for actual throughput in processing. It's proven extremely unstable in tasks larger than a small proof of concept
Can you elaborate on some reasons?
I was part of a research paper some time ago, and Map Reduce does have the advantage of in the ability to resume (rather than restart) queries on failure and better handling of ad-hoc queries (compared to RDBMS).
Re: (Score:2)
> Can you elaborate on some reasons?
It has suffered from a problem common to various object oriented projects: by refusing to acknowledge the existence of lower level structures, such as the very real storage hardware and real network connections necessary to propagate the data among the nodes for effective access, the result is that it didn't scale. Backup of results from well-delineated processing steps, which is critical for debugging or re-running new versions of particular processing steps, wound up
Re: MapReduce is great (Score:2)
Re:MapReduce is great (Score:5, Insightful)
As a reminder, SQL is a query language and not a hardware technology. It doesn't dictate HOW to store data (assuming it meets certain minimum standards). You probably are referring to typical RDBMS.
Re: (Score:2)
Re: (Score:2)
I work with (multi-terabyte, not multi-petabyte) GIS databases. I am also a Haskell programmer (though not for my day job) so MapReduce doesn't scare me off at all. It's very hard to see how MapReduce specifically would help large-scale GIS.
The main benefit of MapReduce for most problems isn't the programming model, it's the principle "move your code to where the data is" in a way that's agnostic to precisely where the data is. When you have big data, you need to do that. Precisely what that code does is a
Re: (Score:2)
Removable media (e.g. tape and WORM optical disk) libraries were typical for petabyte+ storage arrays back in the late 90s. I remember the Subaru telescope facility in Hawaii had a petabyte storage facility which was primarily an automated tape library (plus a large section of wall occupied by a physically massive ~40gb RAM array) when I very briefly interned* there in the late 90s.
That was large, but not uniquely or ridiculously large. My WAG is that, globally, there were probably on the order of 1k inst
Re: (Score:2)
The new Samsung 16TB SSDs will be substantial game changers in... oh, five years. They're shipping now, but if the price drops to a grand or two per SSD, it'll be really interesting for bulk storag
Something less dismissive? (Score:1)
Re: (Score:3)
The Hadoop defenders will no doubt counter with, "but Hadoop wasn't designed to be an RDBMS!", to which I say it doesn't matter. That's what people were trying to make Hadoop into because that's what businesses thought that they needed: a drop in replacement for SQL and RDBMS that addressed their scalability problems. In the meantime SQL and RDBMS developers have answered the challenge and continued improving their tools, addressing many of the shortcomings that Hadoop was supposed to resolve while Hadoop was still over promising and under delivering. The old quip is still true, "SQL is dead. Long live SQL."
That's bullshit and obviously you're a DBA defending his turf. A Hadoop cluster will scale beyond anything a RDBMS can handle, and if the only tool in your toolbox is SQL you can use products like Hive or Hawq that will process your queries through a specialized JDBC driver and run them across as many nodes as your budget can afford.
For instance you could have petabytes of data in CSV format stored on your HDFS cluster, and you could create a relational model on top of them without rewriting a single byte,
Re: (Score:2)
Nothing against Hadoop. Every problem has a proper solution provided by a proper tool.
But petabytes isn't exactly reaching limits of Oracle or Postgresql. You start having to tune these guys & properly setting up the hardware once they get near terabytes, but I think even a vanilla Postgresql will do 1-2 Petabytes.
Now crossing 10 Petabytes... I think it makes more sense to use Teradata. Its decades old and I don't think anything really comes close to it in today's world. Even at 1+ Petabytes, I feel
1PB meh (Score:3)
I think even a vanilla Postgresql will do 1-2 Petabytes.
The maximum column size for Postgres is 1GB. The maximum table size is 32TB. So let's say you have a 1PB data set, that means you need to shard your data in at least 25 tables of 250 columns.
Let's say you want to run a query vertically; you'll need to join those 25 tables, start the query and go on vacation for a month. That's how 1PB works on Postgres.
And don't you even dare do some leaf-level manipulations on that volume of data, like a lateral join - unless you enjoy a faint smell of burnt plastic in you
Re: (Score:2)
>in my opinion the vast majority of use cases warrant for a traditional RDMBS
For my store PoS I'm working on dropping the Postgress backend and holding all the tables in memory. RAM grew faster than my tables did.
Re: 1PB meh (Score:1)
Re: (Score:2)
Please tell me you are kidding, because if you are not you need to step away from the keyboard, and STAY away from the keyboard.
No, not kidding. Now please explain how you know enough about our system to even be able to know if it's a good idea or not?
The draw to tell people on the internet that they're doing it wrong seems to be very, very strong around here, even when armed with only a couple of paragraphs of information.
Re: (Score:2)
>So.. going to write your own reporting solution as well?
I already have. The reporting code doesn't change to accommodate this. The interface to the data model doesn't change. You have heard of data abstraction before haven't you? If you mess with stored procedures, you're still tied to the nipple of a DB vendor's tit. If the business logic code access the DB through your own procedures written in in-application code, then it's easy to adjust to a different storage model.
>Or does management not need
Re: (Score:2)
Just for the sake of discussion, if I was to design a POS today I think I'd consider the new in-memory engine in MongoDB. It's pretty cool; it writes nothing to disk (ever) but it can be part of a cluster where some other members use the normal engine. Each cluster supports up to 50 members, and the client can specify a preferred read node. So I would leave the write master in the backend and all the POS would have the inventory pretty much in real time on their local read node.
Or since they bring up Kafka
Re: (Score:2)
> So I would leave the write master in the backend and all the POS would have the inventory pretty much in real time on their local read node.
That's pretty much it. So the front end is instant response for the user. Events are timestamped. The back end recombines the data in order when they are attached (normally they are attached all the time) and the inventories are kept in sync. The wrinkle is that the front end stores running state to local disk as pickled data so it can run solo (detached from the b
Re: (Score:2)
s/column size/field size
Not to sound pedantic on the terms you used, but want to make sure we don't confuse general readers.
Normalizing:
Initially, you should be Normalizing your data population. This is splitting it up into various tables. 25 isn't a lot of tables. I have seen DBs under 1TB with over 100 tables. How and what level you normalize is based on the type of data you have, their relationships, and most importantly, how you intend to utilize and extract the data to generate various kinds of inf
Re: (Score:2)
the true kings of the big data world are DB2 and TeraData.
You had me until you mentioned DB2. I've never heard of a PB-level DB2 instance, I don't even think it's possible. Last time I checked a table couldn't go over 2TB and even BLOBs can't be bigger than 2GB.
Re: 1PB meh (Score:2)
I remember back in 2002 reading about a 2PB DB2 at some research university. My google-fu isn't good enough to hunt it down.
But I hope the below provides some insight to where DB2 is at today. 500,000PB. I need to do more research because I am finding it hard to believe.
http://it.toolbox.com/blogs/db... [toolbox.com]
Anyway DB2 has always been more hardware limited than software. Every atom in DB2 can be plumped up in bits till it hits the hardware limits; multiplying its overall capacity. But too many bits and you are
Re: (Score:2)
Dang.
https://www.ibm.com/developerw... [ibm.com]
Of course as with many IBM products, the miraculous setups are always in IBM labs.
Re: (Score:2)
>For instance you could have petabytes of data in CSV format stored on your HDFS cluster
And somewhere in a tiny sub-corner of those petabytes, someone generated the CSV with Excel and the quoting is all messed up.
Re: (Score:2)
>For instance you could have petabytes of data in CSV format stored on your HDFS cluster
And somewhere in a tiny sub-corner of those petabytes, someone generated the CSV with Excel and the quoting is all messed up.
Almost all the tools default to tab-delimited (Pig, cut, etc) but yes there's usually an Excel saboteur or two in every organization.
Re: (Score:2)
If Hadoop is as amazing as you say it is then why aren't more companies enjoying success with it?
Can you provide numbers to back your statement that not many companies are "enjoying success" with it? Or are you content to repeat the same bullshit over and over?
A few interesting facts.
-Cloudera, Hortonworks, MapR, Pivotal are all in Gartner Magic Quadrant for Data Warehouse and Database Management Solutions for Analytics
-Most of the big BI products (MicroStrategy, etc) offer connectors to AWS EMR, HDInsight and various other Hadoop offerings. Do you know why? Because people use them.
-Hortonworks and Clo
Do not blame the tool(s), blame the workman... (Score:1)
"It's very clear to me, technologically, that it's not the technology base the world will be built on going forward"... [T]hanks to better mousetraps like S3 (for storage) and Spark (for processing), Hadoop will be relegated to niche and legacy statuses going forward, Muglia says.
My 4th grade English teacher used to say, "A bad workman blames his tools."
Sounds relevant to me here.
Re: Do not blame the tool(s), blame the workman... (Score:2)
Hadoop is not tools, it is one particular tool. Some tools are just bad -- I give you the magnetic stud finder as an example.
Re: (Score:2)
Tools is as tools does
Re:Do not blame the tool(s), blame the workman... (Score:5, Insightful)
My 4th grade English teacher used to say, "A bad workman blames his tools."
Sounds relevant to me here.
Apparently your 4th grade English teacher has never tried to use a hammer covered in spikes that arrived in a box labeled "Screwdriver".
Re: (Score:1)
Sounds like a Catholic school punishment tool.
Re: (Score:1)
Home Depot saw your order for a meat tenderizer [amazon.ca] and did their best to help [wmctv.com]...
Re: (Score:2)
My 4th grade English teacher used to say, "A bad workman blames his tools."
Sounds relevant to me here.
Apparently your 4th grade English teacher has never tried to use a hammer covered in spikes that arrived in a box labeled "Screwdriver".
You found Windows! Don't forget that the handle's splinters each carry a different painful virus
Re: (Score:1)
My 4th grade English teacher used to say, "A bad workman blames his tools."
Did your English teacher also explain the concept of the cliché?
This particularly tiresome one, of dubious provenance (wikiquote sites numerous variations from a host of sources), is surely mentioned at least a few times in the comments for any thread about deficiencies in a product. It seems terribly unlikely that anyone is reading it here for the first time.
It's a splendid example of sophomoric thinking. Yes. poor workers often blame tools. So do good ones, with reason. It's as uncompelling a maxim
Re: (Score:2)
It has not (Score:4, Insightful)
What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks. That means that in almost all cases, this technology is a bad choice and that was rather obvious to any actual expert right from the start.
Re: (Score:2)
What has happened instead is that quite a few "tech experts" did not understand what it actually was and had completely unrealistic expectations. Map-reduce is nice when you a) have computing power coming out of your ears and b) have very specific computing tasks.
Spot on. Hadoop is meant to run on a shitload of commodity computers, which is something most organizations don't have - if you can afford a shitload of commodity computers your sysadmins will probably choose to buy high-end SAN and top notch blade servers, and virtualize everything.
You can see it immediately when you install a packaged version like Hortonworks; the wizard will put data on all your volumes because it assumes you're running on a bunch of low-end servers with shitty RAID or even JBOD - but if
Re: (Score:2)
Precisely. Hadoop was marketed as a big data panacea and everyone tried to apply it to everything only to discover that it really wasn't a panacea and really wasn't a good solution to the problems they were throwing at it. In addition, it's not particularly easy to use and you can spend a considerable amount of time just in configuring, tweaking, and maintaining the system.
Hadoop, like any other tool, has it's uses. But like any other tool if you try to apply it outside of what it was really intended to be
Re: (Score:2)
Would not surprise me at all.
Illiterate cackwads (Score:2)
They're choosing someone to lead the merger of some high schools?
Fucking hell, unless you chew your tongue when you talk they don't even sound the same.
Re: (Score:2)
they don't even sound the same.
In Americanish they do.
Re: (Score:1)
Re: (Score:2)
your rite, the author of this artical needs to be kicked in the testical
Re: (Score:2)
A little clueless.... (Score:5, Informative)
Did nobody explain to the original poster that Spark in serious deployments is built on top of Hadoop? Or that Kafka uses the Hadoop (YARN) scheduler and is generally used to sink data to HDFS files, also built on top of Hadoop? This is kind of like someone saying that TCP/IP is no longer relevant because we now have DNS....
Re: (Score:2)
Was about to say the same, hehe ...
Just say Pachyderm (Score:4, Informative)
Re: (Score:1)
Apache Spark (Score:2)
Re: (Score:2)
As far as I know, most people are using Apache Spark for new projects.
Spark is a framework that includes ETL, in-memory computing and a machine learning library - a typical case of wheel reinventing.
Those "most" people you mention probably only use the machine learning part, and on a fairly small data set. In theory, Spark RDD can scale to "Petabytes" (says them) but I've never seen it work on even TB level volumes of data, while Hadoop scales to unlimited volumes (Yahoo used to run a 40,000 nodes cluster).
Spark is awesome but it's not a replacement for Hadoop for distributed
Re: (Score:2)
Re: (Score:2)
If you actually would work with Spark, you would know it is based on Hadoop, just saying.
Re: (Score:2)
If you actually would work with Spark, you would know it is based on Hadoop, just saying.
Even a retard with a low-speed internet access can look this up on Wikipedia and prove you wrong. Are you trolling or just stupid?
Re: (Score:2)
Are we talking about the same : http://spark.apache.org/ [apache.org] ??
Why so angry?
Re: (Score:2)
Yes. Spark can optionally run on Hadoop, which is not the same thing as being based on Hadoop. So before implying that other people would "know" something if they had worked with Spark, make sure that the thing in question is true.
Re: (Score:2)
It is the opposite around.
Spark runs by default on Hadoop, it was designed on top of Haddop.
Perhaps it can run on other things, too. I never saw one doing it, though.
What e.g. would be an example? Of such an "other file system"?
There there (Score:2)
You're a stupid motherfucker. You have nothing useful to say. You contribute nothing useful to this site or to society [...] (etc)
I was unable to read the rest of your comment because I have a policy of stopping when it becomes obvious that the other person is just throwing a tantrum.
If you disagree with the fact that Wikipedia clearly indicates that Spark is NOT based on Hadoop, support your claim with a link or citation. Otherwise there is no need to get your panties in a bunch, you clearly don't have enough trolling skills to make even a drunk Mike Tyson circa 1997 angry.
Over-integrated software sucks. (Score:2)
When your software integration prevents your software from being used in conjunction with a variety of other platforms, you drastically reduce the number of users and in turn the number of developers that will work on it. As you integrate software more and more, you exponentially decrease the number of developers interested in making tools to make operation of your software easier. I'm not saying that making a system that works with everything will attract more developers but I am saying that making an ov
Idiotic babble (Score:5, Insightful)
People who bash Hadoop without understanding at a very minimum the moving parts have obviously no experience with it.
Hadoop is not one thing. It's three:
1) a distributed filesystem (HDFS)
2) a job scheduler (Yarn)
3) a distributed computing algorithm (MapReduce)
Many tools like Hbase or Accumulo *need* HDFS. That's a core component and there's no equivalent in Spark. Anyone saying HDFS is obsolete is a clueless idiot.
Anyways the Spark vs Hadoop narrative is bullshit. A serious Spark setup usually runs on top of a Hadoop cluster, and often you can't get away entirely from MapReduce (or its actual successor, Tez) because Spark runs in-memory and doesn't scale as much; for some workloads you need the read-crunch-save aspect of MapReduce because there's just too much data, and MapReduce is also more resilient as you don't lose as much when a node crashes during a job. Spark is more advanced and has actual analytics capabilities thanks to a powerful ML library (while Hadoop is just distributed computing), but it's not a case of either/or.
For instance a common approach is to use Hadoop jobs to trim down your data (via Pig or other blunt tool) to a point where you can run machine learning algorithms on Spark.
As for Kafka, it's just a fucking message queue. It's fast and very powerful, but comparing it to Hadoop is like saying you should use Linux instead of MySQL.
Whoever considers buying services from those Snowflake morons, run away.
Re: (Score:2)
Please tell me they named the fantastic "Microsoft Bob" app after him.
Does this mean... (Score:2)
Technical experts or competitors (Score:2)
If you look at the list technical experts
1 Bob Muglia - Head of a startup competitor that trying to market data analytics product trying to steer some of that Hadoop investment into his fold. His sales model is "Look how easy we are" What you should be asking is how much does it cost and how do I get my data back.
2 Bob Johnson - Cofounder of an analytics company trying to steer some of that Hadoop investment into his pocket.
This is a beat up driven by people who wished that they had a slice of the Hadoop p
How do I get "Joe User" to to access the data???? (Score:1)
Re: (Score:1)
Hire someone competent with actual software development skills? Most data scientists I've met were glorified or relabeled data analysts. Some minor stats background and maybe they can hack together a script. That's fine and really valuable for analyzing large datasets and formatting the results into pretty figures for decision-makers to look at.
If your data is too complex for their basic ETL skills and it's taking a month to build interfaces, hire one competent and expensive developer to build those interfa
Right on time. (Score:1)
After only 5 minutes with Hadoop I could figure out it was nothing but a giant boondoggle. It only took to the end of that afternoon to be completely sure. Now, what... 3, 4 years later the rest of the industry is starting to figure it out, en-masse? Seems about right.
Re: (Score:2)
Unreasonable expetations (Score:2)
Perhaps the issue here is about unreasonable expectations.
No software, Hadoop or other, will magically extract meaning from a huge dump of data. You need work to do that, whatever the tool you use.
This rant reminds me about the people who purchased enterprise service bus to interconnect IT applications, just to discover that instead of interconnecting applications, they now need to interconnect applications with the enterprise service bus. No problem solved for free.
Hadoop has failed says ex Microsoftie (Score:1)
Here's Bob Muglia while at Microsoft describing how to 'add additional semantics' to Outlook, that is perform a detailed analysis of Lotus Notes and then clone it into Outlook.
"Notes/Domino R5 is very scary. We all saw the demo. Exchange has worked with teams around the company to put together a very det
Hadoop is easly put to shame (Score:2)
To me Hadoop was the classic solution desperately in quest of a problem. The worst problem with that being so many people who jumped onto Hadoop and thought they were ass kickers for doing so.
The simple reality is that for most corporate datasets the too
What about mongoDB? (Score:2)
Isn't mongoDB supposed to be similar to hadoop? Do the same pitfalls for hadoop apply to mongoDB?
Hadoop isnt just mapreduce and pig (Score:5, Insightful)
If you don't have problems that relate to these paradigms... dont use it. Seriously. Just because its new doesnt mean it fits every situation. Its not mysql/mariadb/postgresql... if you think its even remotely close to that simple you should run for the hills. If you have a significantly large (not talking hundreds of megs or even a couple gigs... you need to be thinking in Billions of rows here) configuration management problem then its a great base to layer other projects on top of to solve your problem.
Also, I found a large number of problems to solve using timestamped individual data cells that CANNOT be done using traditional sql methodologies. Lexicographic configuration models, analytics (obv), massive backup history just to name a few. If the management and installation of the cluster are scary... well...not everything in CS is easy... especially when it gets to handling the worlds largest datasets.... so, this probably isn't really your problem... call the sysadmins and ask them (politely) to help. Believe it or not the main companies have wizards which can help get you going across clusters... and even manage them visually (not that I ever would... UI's are for people who can't type).
When people (or just this CEO) says it doesn't deliver on its promise. You are likely trying to solve a problem wholy inappropriately. I have personally used it to solve problems like making real time recommendations in under 200ms across several gigs of personal data daily (totalling easily into terabytes). (No you don't use mapreduce... think harder... but you DO use HDFS).
So what promise were you told?
Other than real time (as illustrated above), you can do archiving, ETL of course, and things like enabling SQL lookups, or RRDs... using a number of toolkits or spark. Seriously, this is one of the best things since sliced bread when it comes to processing and managing real big data problems. Check out the Lambda processing model when you get a chance... you might be impressed, or be utterly confused. Lambda (and not talking about programming Lambda, nor AWS Lambda) applies multiple apache technologies to solve historical with real time problems in a sane manner. Also managing massively distributed backups is much simpler with HDFS
Honestly, outside of Teradata implementations, there is no where in the world you can get this kind of data resiliency, efficiency, nor management. Granted it doesn't have the 20+ years of chops in HUGE datasets Teradata does, nor the support... but its open source and won't cost you much to try.
Long long story short. What the hell! I feel like programmers today are constantly
Any commercial solution will cost you
If Hadoop seems large and frightening just wait until y
Unhappy customers: caveat emptor (Score:1)
I think many of the 'unhappy customers' the article refers to are companies where somebody who didn't quite understand the technology pushed hadoop as a replacement for (expensive) proprietary software like Oracle, to be then sorely disappointed especially on interactive performance.
I've been working with hadoop since 2007 and have successfully deployed for multiple clients. First of all, you really want to see if the use case makes sense, sometimes you're just better off with a RDBMs like mysql. Some comp
Big Data is a nice word. (Score:2)
Big Data is a nice word. The fact that the concept if it is useful for roughly 5 ginormous global internet companies and beyond pointless for everybody else is probably something that 99.9% of all people making the final decisions on which technologie stack is used have zero clue about. They haven't got the faintes what big data actually means and what problems with it solutions like hadoop actually address.
I'd bet money that 99 of 100 scenarios in which hadoop would even run better with some unspectacular
Re: Big Data is a nice wors. (Score:2)
Sorry for the typos - using a tablet just now. :-)
Bandwagons (Score:2)
In other news, Bandwagon jumpers are shocked to discover that the cool new doohickey they read about in Tech Fashion Trends Magazine, doesn't actually magically fix every problem you throw it at.
Computer technology has now been around and commonplace for several decades now. It isn't knew that this stuff is complicated, and getting even more complicated with each passing year.
And yet while a client would never demand a builder use this specific kind of scaffolding and cement to build with because they read
The first? (Score:1)
Imagine a Beowulf cluster of these!
Are these people really that stupid? (Score:2)
Are these people for real?
The whole article screams, "I don't know what I'm doing but I love jumping on bandwagons."
Apache Hadoop and Kafka are two completely different tools, intended for two COMPLETELY different workloads.
So if you used Hadoop when you should have used Kafka, that doesn't mean Hadoop is bad. It means you haven't done your job and properly vetted the tools available for suitability.