Apache Hadoop at 20
A week or two ago, Doug Cutting wrote up a ten-year retrospective on Apache Hadoop for the project’s birthday. I enjoyed it. As co-creator of the project, Doug’s had a privileged seat from which to watch the decade unfold. I really liked the fact that he called out the contributions of the global Apache developer community so strongly. I believe that’s the key reason that Hadoop has been so successful.
Plus, I had never seen Hadoop the elephant sitting on the Stone of Destiny before.
Doug’s post was, though, backward-looking. It talked a lot about Hadoop’s history, and not so much about its future. I’d like to point the lens in the other direction, and consider what’s likely to happen over the next ten years.
What the heck is Hadoop?
The mass of software that we mean when we talk about Hadoop today has little in common with the code that Yahoo! rolled into production in 2006. The original project was based on two components developed at Google, and described in the research literature:
A new data processing layer — distributed, easy to program compared with grid systems of the day — was built on top of a large-scale, inexpensive storage layer. That combination was new, and allowed web companies to do crazy things with way more data than they’d ever been able to handle before.
Those two components weren’t enough to solve the wide variety of problems that businesses have, though. New projects have emerged in the ten years to broaden its utility. (In fact, those two components weren’t enough to solve the problems that Google had — it’s running a substantially evolved and much more diverse collection of systems today, too).
A partial snapshot of our offering today looks like:
All that blue stuff is new. It doesn’t just add new capabilities to the original Hadoop components. It’s shifted the center of gravity in the community dramatically. The platform layer in particular is dominated now by real-time support and rich interactive processing models:
In fact, the story is even more powerful for new components than the picture suggests. I didn’t have room to put in pieces like YARN, the resource scheduling framework, or Apache Sentry for security — those cross-component pieces that touch all parts of the platform are just hard to fit into the space available. They, and others, are all rolled up in the “dot dot dots” at the right end of the picture.
The original components — HDFS and MapReduce — are down to just ten percent of the code contributions in the total ecosystem, while the newer components are at 90% and growing fast. More importantly, when we look at the share of work being done by the various components, new workloads are rolling out mostly on top of the new pieces. Legacy MapReduce will always matter, but that’s just not where the action is anymore.
And yet when people say “Hadoop,” informally, they generally mean the two original Apache Hadoop components, plus all the new projects that have grown up around them. Hadoop today still shows the outlines of its original incarnation, but is a dramatically larger, more powerful and more interesting collection of technologies than it was.
Where the software is going
The last-ten-years trend is certain to continue. We’ll see the Hadoop ecosystem grow over the coming years as new projects are created to handle new data and to offer new analytic approaches.
Three years ago, Apache Spark was thought to be a deadly threat to Hadoop. Today it’s an essential part of the broader ecosystem. We’re madly in love with it, but we’re also scanning the horizon for its successor as the Cool New Project. I’ve no doubt that such a successor will come — again, and again, and again. That’s not to say that Spark is doomed, but that there’ll be new frameworks, especially in such a fast-moving open source ecosystem. Smart people are going to keep having new ideas. The flexibility that has allowed Hadoop to incorporate so many diverse projects over time bodes well for its continued growth.
There is, of course, evolutionary work to be done on parts of the platform that exist today. Spark needs to get better in a variety of ways for secure enterprise use. YARN has opened up the door to multi-tenant workloads, but it’s too prescriptive, and not responsive enough. We need that to change if we’re to deliver real multi-user, multi-workload support for big data. We’re really excited about Apache Kudu (incubating), but it’s young and has much to prove in production.
A few young open source projects already show real promise of joining the Hadoop party. Google developed its Dataflow software for internal use, but has collaborated with the Apache community to create a new project, Apache Beam (incubating), for managing data processing flows, including ingest and data integration across components. Developer tools like Apache HTrace (incubating) are aimed at testing and improving performance of distributed systems.
We’ve already seen cloud-native big data solutions built on Hadoop in the market. Amazon EMR, Microsoft’s Azure HDInsight and Google Cloud Dataproc are on-demand offerings that make it fast and easy to spin up a cluster in the public cloud. The coming years will, without question, see the platform embrace both datacenter and cloud deployments natively, including elasticity and consumption-based pricing. Users absolutely want that flexibility, and it’ll be baked into the platform over the near term.
There may well come a point when we all decide that the birds have left the dinosaurs behind. As the evolution continues, we may one day stop calling this platform Hadoop. Even if we do, I am confident that it will be defined by the key strengths that Doug and Mike Cafarella created when they started the project a decade ago:
- A vibrant open source developer community, collaborating around the world to innovate faster than any single company could alone.
- Extensibility as a fundamental design property. Hadoop’s ability to embrace SQL, and real-time processing with Spark, and new storage substrates like Apache HBase and Kudu, allow it to evolve.
- A deep systems focus. Knitting together large-scale distributed infrastructure — managing memory, disks and CPUs as a fabric — was fantastically hard when Hadoop was created. By doing that hard work for developers, the Hadoop community has allowed application programmers to concentrate on business services and interfaces, instead of on knotty infrastructure issues. Those problems get no easier in the future. Whatever Hadoop becomes, it will be driven by hard-core systems people.
The action is in hardware
Most of the action for the next ten years, though, will be in hardware.
Google — the inspiration for Hadoop — built its infrastructure to run on the cheap pizza-box systems available in the late 1990s. The fundamental design decisions in the current-generation software were made then. Everywhere you look, you see them: Disk is cheap; memory is expensive. A random disk bit is about a million times further away than a random RAM bit. You need lots of copies of data because so many things can go wrong. Processors inside the chassis are really close; processors in the same rack are pretty close; processors in other racks are far away; processors in other data centers do not exist. And so on.
Those were laws of physics in 1999. We violate them regularly today. Ten years hence, they’ll all be wrong.
Because of the relationship that we have with Intel, Cloudera gets to look at the future of hardware early. There are dramatic changes coming in pretty much every part of the hardware ecosystem. Those changes will mean much better, faster and more powerful systems. To deliver them, though, we’ll have to make fundamental changes to the Hadoop software platform.
One example, now publicly announced, is 3D XPointTM (pronounced “crosspoint”) technology. Intel solid-state drives based on the technology will begin to ship in 2016. They’re non-volatile, so they survive power outages without losing data. 3D XPointTM technology isn’t as fast as RAM, but it’s vastly faster than disk, and up to 1,000 times faster than Flash-based SDRAM technology. More importantly, it offers up to 10x the storage density of traditional RAM. You can pack a lot of bits in a very little space.
Google’s architecture was disk-heavy. It needed three times more disk than the data it stored, for redundancy (disks have lots of moving parts that fail) and for performance (latency is so bad you want to spread workloads out to reduce head contention). Disk is ravenous for power — it takes lots of electrons to spin those motors and move those arms.
3D XPointTM technology has exactly zero of those problems. Granted it costs more than disk, but that curve is pointed in the right direction, and over ten years that story will get way better.
More interestingly: beginning in the 1960s, we came up with complicated ways to organize data because we had to move it from high-latency, cheap systems like tape and disk to fast, expensive systems like RAM. We created logs and B-trees and all kinds of other page-based systems to accommodate that separation.
If 3D XPointTM technology is disky in persistence, density and price, and RAMmy in its latency and throughput, then maybe we can get rid of the split. Perhaps we can organize our data for the convenience of our algorithms and our users, instead of for our disk heads. There is enormous potential to wipe fifty years’ worth of complexity from the slate, and to build for a new generation of storage.
This is one example of many underway. The coming years will see further innovation in storage, dramatic improvements in networking and new ways to push special-purpose processing into the overall fabric of our computer systems. Ten-year-old Hadoop was built on the bedrock of Google’s pizza-box assumptions. As that foundation shifts, so must the software.
Data, data everywhere
The hardware and software are locked in a feedback loop. Each will change as a result of the other, and each will drive more changes from there.
One thing, though, won’t change. There’s going to be a ton of data.
That’ll be driven by the hardware and software, of course. We’ll have sensors and actuators everywhere, and we’ll run code that wants to talk about its status.
Most of all, though, the continued growth in data will be driven by the value of creating and collecting it. The more detailed and specific information we have about the world, the better we can understand it. The more powerful the software tools that we have for analyzing it — SQL and machine learning and all the new techniques the next decade will create — the more we’ll know, and the better we will be able to anticipate and act.