Faster Batch Processing with Hive-on-Spark

Apache Spark has quickly emerged as a powerful data processing framework for Apache Hadoop, well-poised to succeed MapReduce in the ecosystem. Cloudera’s One Platform Initiative is hastening this transition with focused development on the scale, security, management, and streaming aspects necessary for Spark to support a wide range of enterprise applications. Spark’s power and popularity stems from its flexible and extensible APIs for a wide spectrum of workloads, easy development, and better performance for batch processing. This last part (i.e. faster batch processing) is what makes it highly suitable as Hive’s underlying compute engine. Hive-on-Spark (HoS) – available with C5.7 – enables Hive to run on Spark as the underlying compute engine while maintaining full compatibility with current Hive-on-MapReduce (HoMR) workloads. It rounds out the SQL-on-Hadoop technologies available within the ecosystem (including SparkSQL and Apache Impala (incubating)) by providing the best-of-breed tools for a broad user and use case base. For more information on how these technologies complement each other, check out SQL-on-Apache Hadoop – Choosing the right tool for the right job.

3x Performance Boost with Seamless Transition

With this latest release, Hive-on-Spark emerged as a production-ready and fully supported component within Cloudera’s platform.  Now Hive can run on Spark either at the cluster-level for all queries or for individual queries, which is what we recommend to begin with. With this release, Hive users are able to seamlessly run their workloads an average of 3x faster than Hive-on-MapReduce. From the beginning, a seamless transition to HoS was one of our primary design goals. HoS has fully delivered on it by attaining full parity with HoMR with regard to functionality, compatibility, as well as integration with other components within Cloudera’s platform. The 3x performance gain is only the beginning as more optimizations are on the roadmap.

Minimal Configuration Changes Required

Spark exposes multiple optimization knobs many of which affect performance of HoS. Understanding and optimizing all these knobs for optimal performance can be a daunting tasks even for seasoned Spark users. The good news is that in the 5.7 release, we have ensured that most of these are automatically tuned by Cloudera Manager. Consequently, using HoS requires minimal configuration changes. For users that require additional tuning for their cluster, we have put together a Configuration and Tuning Guide.

Maximizes Performance for Complex and Resource Intensive Workloads

HoS especially benefits multi-stage, complex and resource-intensive Hive workloads such as those with multiple joins, group by etc. Some of our customers have reported an order of magnitude gain in performance when enabling HoS for such workloads. When running on HoS, the intermediate data is maintained in-memory saving the costly step of serializing/deserializing and writing/reading from disk. This results in the most significant performance gain for complex workloads, with less pronounced results for queries that produce little intermediate data (e.g. ‘select *’). Therefore, for accurately assessing the performance benefits of HoS, we recommend that users turn on HoS for their most complex workloads first and then backtrack from there to eventually turn it on for their entire cluster. Moreover, complex workloads typically cover most Hive features currently in-use across all customer workloads. Therefore, successfully running them first ensures a seamless transition of all workloads eventually to HoS without hitting any known issues or bugs.

Continued Enthusiasm and Support from Community and Users

The supported release of HoS represents the culmination of past 1.5 years of community effort led by Cloudera and supported by Intel, MapR, IBM, and others. Since the beginning HoS has thrived on a strong community enthusiasm and widespread user as well as developer support. With more optimizations in the pipeline such as Dynamic Partition Pruning, Vectorization (for Parquet), Cost-Based Optimizer etc., the community believes that a substantial performance boost is attainable in upcoming releases.

Sparkling Future of Hive

With Spark swiftly becoming successor of MapReduce as the compute engine of choice in the Hadoop ecosystem, HoS is all set to become mainstream. Ability to seamlessly transition existing workloads with an average of 3x performance gain only points to a bright future that will be bolstered by widespread adoption and continued community enthusiasm. Just as Hive has consistently done over the past many years, we believe that HoS will prove itself to be an even more potent and trusted tool for taming the ever-growing data needs of the world.

The post Faster Batch Processing with Hive-on-Spark appeared first on Cloudera VISION.

Leave a Comment

Your email address will not be published. Required fields are marked *