Hadoop MapReduce or Spark: What if you don’t have to decide now?

This blog was penned by Tendü Yo?urtçu, General Manager, Big Data at Syncsort

2014 was a tipping point for Apache Hadoop: it graduated from being simply a distributed file system and the MapReduce engine for high-performance batch processing to becoming a multi-purpose platform capable of handling a wide variety of workloads including machine learning, social graph analysis, interactive queries, real-time data processing, and much more.

One of the primary reasons that Hadoop is such a disruptive technology is that it provides highly scalable storage and data processing capabilities at price points that are orders of magnitude lower than legacy systems. Consistent with Moore’s Law, performance and cost improvements have made mobile devices, connected consumer electronics, and the Internet pervasive in every aspect of our lives, dramatically increasing the amount of generated data that needs to be analyzed.

Hadoop has rapidly become the most affordable platform for Big Data processing, and in the early years it focused heavily on batch-oriented workloads. One of the limitations of MapReduce is its high latency when running a complex multi-stage data flow that requires multiple MapReduce jobs. This is due to job latency and Hadoop Distributed File System (HDFS) I/O. The requirement to support compute paradigms other than MapReduce at low latency led to the development of Directed Acyclic Graph (DAG) based compute frameworks over Hadoop, such as Apache Spark.

Apache Spark uses Resilient Distributed Data Sets (RDDs) for in-memory data sharing across multiple data flows, resulting in low latency. Spark, originally developed at the UC Berkeley AMPLab as an open source, in-memory distributed computing framework, has garnered strong interest, with over 400 contributors – making it  one of the most active Apache Software Foundation (ASF) projects.

Cloudera’s Accelerator Program for Spark is helping to drive both adoption and ecosystem compatibility for Apache Spark.  As more organizations and vendors participate in the Accelerator Program, more turnkey solutions will be supported. For years, Syncsort has been a prolific contributor to the Apache Hadoop family of projects. One of the advantages of Syncsort’s flagship Hadoop product, DMX-h, has been that it actually implements a DAG corresponding to the user defined data processing pipeline and, unlike legacy ETL products, can run natively within the MapReduce processing framework.

Syncsort’s newest Hadoop product release takes the next natural step of providing an “Intelligent Execution Layer” (IEL), which is designed with flexibility to select a processing engine, whether that is MapReduce, Spark, etc., while still running natively within the selected framework. Customers no longer need to be concerned with the underlying compute framework and instead can focus on the business logic and data flow, while still getting all of the performance advantages of native processing. This abstraction also “future proofs” a customer’s deployment environment, be it Hadoop or any other DAG-based compute framework, on-premise or in the Cloud, leaving the decision of how the job will be run to the IEL at runtime. This decision making will automatically evolve along with the rapidly improving big data technology stack, without requiring any changes to the data flows.

This new release also ships with a Spark Mainframe Data Connector, bringing multiple mainframe data sets to the Spark engine. The majority of corporate data still resides or originates on the mainframe, and now customers can populate a Hadoop-based enterprise data hub with previously “locked away” mainframe data. In addition to traditional databases, JSON, weblogs, NoSQL data stores, etc., Syncsort’s Hadoop product line has always provided the most comprehensive access to mainframe data sets. We’re now connecting Apache Spark with “Big Iron.”

The Syncsort mainframe connector for Apache Spark is similar to the Apache Sqoop mainframe connector that Syncsort released as open source last year. Customers simply specify the location of multiple data sets and the associated COBOL copybook metadata and then the Syncsort Spark mainframe connector automatically transfers the data sets in parallel via a secure connection. All mainframe record formats including fixed, variable, sequential and VSAM files, are supported. The connector also handles compressed data transfer, minimizing network bandwidth and optimizing overall elapsed time.

We are pleased to release these innovations to enhance Apache Spark’s value proposition, and delighted that our Hadoop product suite is part of Cloudera’s Accelerator Program for Spark.

The post Hadoop MapReduce or Spark: What if you don’t have to decide now? appeared first on Cloudera VISION.

Leave a Comment

Your email address will not be published. Required fields are marked *