There's a new database technology called Hadoop that looks like it could topple traditional database players like Oracle, IBM, Microsoft.
But one of the smartest guys in the field -- the one that invented Hadoop -- says Oracle doesn't need to worry.
Hadoop allows companies to store and analyze mind-boggling amounts of information in all sorts of formats (documents, tweets, photos, etc.). A regular database can't deal with lots of formats (called "unstructured data"). And it does it with free open source software and super cheap hardware, so can be a lot less expensive than Oracle, too.
Because of that, big data is quickly becoming a new multi-billion market.
As companies adopt Hadoop, what happens to regular databases from companies like Oracle, IBM, Microsoft -- known as "relational databases." Do companies stop using them?
Nope. Hadoop is a whole new thing. It won't hurt them because it doesn't do what they do, said Hadoop's creator Doug Cutting.
Cutting is a database genius that invented this technology when he was at Yahoo. He's now working for a startup, Cloudera.
Hadoop is "augmenting and not replacing," regular databases, Cutting said. "There are a lot of things I don't think Hadoop is ever going to replace, things like doing payroll, the real nuts-and-bolts things that people have been using relational databases for forever. It's not really a sweet spot for Hadoop."
But he still couldn't resist a small swipe at Microsoft and Oracle who are working on Hadoop offerings of their own. "Right now these things from Oracle and Microsoft are experiments," he said.
Oracle has partnered with Cutting's current company, Cloudera, for Hadoop. Microsoft has partnered with Cutting's competitor, Hortonworks.
Where did Hadoop come from?
Mike Olson: The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google's innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.
What problems can Hadoop solve?
Mike Olson: The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.
How is Hadoop architected?
Hadoop is designed to run on a large number of machines that don't share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.
In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.
It's fair to say that a current Hadoop adopter must be more sophisticated than a relational database adopter. There are not that many "shrink wrapped" applications today that you can get right out of the box and run on your Hadoop processor. It's similar to the early '80s when Ingres and IBM were selling their database engines and people often had to write applications locally to operate on the data.
That said, you can develop applications in a lot of different languages that run on the Hadoop framework. The developer tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools so that they're able to talk to data stored in a Hadoop cluster using Hadoop APIs. There are specialist vendors that are up and coming, and there are also a couple of general process query tools: a version of SQL that lets you interact with data stored on a Hadoop cluster, and Pig, a language developed by Yahoo that allows for data flow and data transformation operations on a Hadoop cluster.
Hadoop's deployment is a bit tricky at this stage, but the vendors are moving quickly to create applications that solve these problems. I expect to see more of the shrink-wrapped apps appearing over the next couple of years.
Handoop for sure is going to create a mega tsumami in the Application side of the business.
Its better for consultants to keep an watchful eye on this technology for this time may be in two years from now you will start to feel the heat of this new database tool.
Contents collected from various websources: radar.oreilly.com, WiKi, hadoop.apache.org
Thanks & Regards,
S.Grace Paul Regan
In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.
At this point, do companies need to develop their own Hadoop applications?
It's fair to say that a current Hadoop adopter must be more sophisticated than a relational database adopter. There are not that many "shrink wrapped" applications today that you can get right out of the box and run on your Hadoop processor. It's similar to the early '80s when Ingres and IBM were selling their database engines and people often had to write applications locally to operate on the data.
That said, you can develop applications in a lot of different languages that run on the Hadoop framework. The developer tools and interfaces are pretty simple. Some of our partners — Informatica is a good example — have ported their tools so that they're able to talk to data stored in a Hadoop cluster using Hadoop APIs. There are specialist vendors that are up and coming, and there are also a couple of general process query tools: a version of SQL that lets you interact with data stored on a Hadoop cluster, and Pig, a language developed by Yahoo that allows for data flow and data transformation operations on a Hadoop cluster.
Hadoop's deployment is a bit tricky at this stage, but the vendors are moving quickly to create applications that solve these problems. I expect to see more of the shrink-wrapped apps appearing over the next couple of years.
Handoop for sure is going to create a mega tsumami in the Application side of the business.
Its better for consultants to keep an watchful eye on this technology for this time may be in two years from now you will start to feel the heat of this new database tool.
Contents collected from various websources: radar.oreilly.com, WiKi, hadoop.apache.org
Thanks & Regards,
S.Grace Paul Regan