Broadening Support for Apache Spark

Six months ago, Mike Olson wrote a blog post where he articulated our belief that Spark is becoming the successor to MapReduce for Hadoop data processing. Since that time, we’ve made great strides advancing Spark in the Hadoop ecosystem. We shipped Spark as part of CDH 5.0 and integrated Spark with YARN, Kerberos security, and Cloudera Manager.  Recently, we also introduced the industry’s first and only hands-on developer training for Spark (“Cloudera Developer Training for Apache Spark”).

When Mike wrote his blog post, we were confident this was the right technical direction for the Hadoop community, but we couldn’t have predicted how quickly industry consensus would form around this same point of view. Many of our competitors have since announced support for Spark and there have been an extensive number of recent articles in the trade press consistent with Mike’s post. Today, we have the pleasure of announcing a new engineering collaboration with a number of leaders in the data management industry.

At Spark Summit, Cloudera, DataBricks, IBM, Intel, and MapR announced a collaboration to broaden support for Apache Spark as the standard data processing processing engine for the Hadoop ecosystem. Primarily, this initiative involves porting a number of open source MapReduce tools to support Spark as the underlying execution engine. A number of tools already work on Spark including MLLib, Mahout, Crunch, and Cascading (I exclude Spark Streaming, which is more inherent to Spark than it is a tool). However, users are also already familiar with tools like Hive, Pig, Sqoop, and Oozie that do not currently run on Spark. These are the principal focus of this joint development effort across Databricks, IBM, Intel, MapR, and ourselves.

For customers and users, the advantage of this initiative will be a Hadoop platform that is faster and more flexible than what was previously possible. This project has the added advantage of simplicity. A user should not have to deploy a variety of different low-level data processing engines depending on which client they favor. By ensuring Spark enjoys broad-based client support, we hope to make the platform more coherent and easier to use.

This initiative is also consistent with our strategy of driving open source industry standards. In fact, every component inside CDH is distributed and supported by at least two vendors – simplifying decisions for our partner ecosystem and protecting customers from long-term architectural lock-in.

In a recent blog post, Hortonworks attempted to sow confusion by suggesting that our efforts to integrate Hive in particular with Spark somehow signifies our lack of commitment to Impala. Nothing could be further from the truth. In fact, just four weeks ago we published an updated set of benchmarks that validated the same technical direction for SQL on Hadoop that we laid out 18 months ago and reiterated six months ago Specifically:

  • By implementing a native MPP query engine, Impala is able to run queries faster, more efficiently, with higher levels of concurrency at lower latencies than Hive.

  • Running Hive on a DAG engine like Spark or Tez incrementally improves batch Hive but it does not meaningfully diminish Impala’s advantages

  • Today’s differences are not slight. Even running against the very latest versions of Stinger (Hive 0.13 on Tez) and Shark (Hive 0.11 on Spark), Impala is an order of magnitude faster under concurrent load. In the traditional DBMS industry a 10X advantage would be unheard of.

These technical advantages, that are inherent to Impala, benefit users and customers in real and important ways. There are numerous customers in production with Impala, with a few examples being: MicroStrategy on Hadoop using Cloudera Impala Demo,  Zoomdata Technology: Cloudera Impala, d3.js and Big Data Analytics, and Tableau + Cloudera Impala: Bring Your Hadoop Data To Life!. Impala provides a powerful user experience for customers and is a key part of CDH.

After 18 months of Stinger phases and breathless (but unsubstantiated) claims of 100X gains, the Hadoop community is roughly in the same position as it was when Impala went GA. Hive is a reliable SQL batch engine that we continue to enhance and maintain, in particular because it is extensively used for ETL. Impala is the still only open source framework that enables organizations to unlock data in Apache Hadoop for BI users. Users of CDH, MapR and Amazon EMR can receive these benefits today because those platforms all ship Impala. Users of other distributions need to purchase additional licenses and replicate their data into proprietary databases so their SQL analysts can work with Hadoop data. Beyond being more costly and more complex, this approach is arguably the highest lock-in option available to a customer.

It is entirely consistent with the long term direction of the Hadoop platform to develop and support both general purpose data processing frameworks like Spark and MapReduce, as well as special purpose frameworks like Impala and Search. Special purpose frameworks tend to be stronger than general purpose ones when it comes to performance and functionality (e.g. Tenzing vs. Dremel, Impala vs. Hive, Graphlab vs. GraphX, etc). General purpose frameworks tend to be more flexible and catch all the other workloads that don’t cleanly fit a purpose built framework. The capacity to ship and support both types of frameworks is exactly what YARN is supposed to make possible. Why would anyone go through the trouble to create a plug and play architecture and then only plug one thing into it?

Our goal is to create the most compelling experience (through features and performance) for each user community (including analysts, ETL developers, data scientists, etc) in a platform that is coherent, usable, open source, and multi-vendor supported. Our investments in Spark and Impala are two strong examples of that strategy.

In the coming months we plan to keep users and customers updated on the group’s progress expanding the range of tools and clients that integrate with Spark as well as some upcoming advances for Impala.  As in all cases, we appreciate your support and welcome your participation with these projects.

The post Broadening Support for Apache Spark appeared first on Cloudera VISION.

Leave a Comment

Your email address will not be published. Required fields are marked *