Bringing Data Discovery and Analytics to Apache Hadoop: The Power of Impala

For those of you that attended Strata + Hadoop World in New York a few weeks ago, you heard Mike Olson talk about how much Apache Hadoop has changed over the years and evolved into a platform for next-generation analytics. One critical step in this evolution was opening up the data stored in Hadoop to the business analysts, enabling interactive business intelligence (BI) and data discovery. This was achieved with the introduction of Impala, the massively parallel processing (MPP) database natively built into Hadoop. Impala was designed to address the needs for end-user BI and traditional SQL-based analytics through its high-performance and multi-user concurrency.

Since its development work began nearly five years ago, Impala has emerged as the leading analytic database for Hadoop. As an open source component, Impala has experienced millions of downloads. Additionally, a majority of Cloudera’s customers use Impala today and we see it as the most popular addition to Cloudera’s enterprise Hadoop platform. Below is a look at some common use cases we see across our customer base.

Companies such as Quaero that depend on digital data to provide customer insights and drive revenue for their clients use Impala to open up petabytes of data to their analysts for data exploration and data modeling. By offloading workloads from IBM Netezza to Cloudera Enterprise and Impala, Quaero is now able to use Impala to run queries it previously couldn’t, as well as build data models using larger data sets for better accuracy.

“Previously, we had to aggregate the data before we could run queries. With Impala, we can run queries against very low level data. A lot of analytic model building traditionally involves sampling. With the Cloudera platform, we can take the analytics to the data and run, build, and execute models on the full data set.” – Dan Smith, Executive Vice President Product Development, Quaero

This capability has resulted in real business impact, with dramatic increases in registered users, content subscriptions, campaign response rate, and even doubling the CPM (cost per thousands) of its ad packages. As Naras Eechambadi, CEO at Quaero, stated, “By applying data and analytics to every interaction, you can maximize profitability and optimize the consumer experience.”

Epsilon was also looking to help their clients gain 360 degree views of customers so they could

deliver the right message to the right audience, at the right time, and on the right device or channel. Their marketing solution would need to ingest, process, and act upon massive quantities of data, much of that being unstructured text-based messaging. Not only did they embrace Hadoop and Cloudera Enterprise for this, but also were quick to realize the importance of Impala as part of this solution.

“As much as we were impressed by Hadoop, we viewed Impala as the game changer for us. It runs natively in Hadoop and enables us to give our clients the ability to expediently segment their own campaign lists using all the available data. Being able to perform complex partitioning allows for far greater granularity and personalization, and Impala’s open source, interactive SQL capabilities enable clients to greatly enhance the effectiveness of each of their campaigns.” – Bob Zurek, Senior Vice President, Products, Epsilon

Learn more about Epsilon’s experience using Hadoop for data-driven marketing first-hand from the webinar, “Big Data in Marketing

In addition to these use cases, customers across many different industries are embracing Impala to deliver the fastest time-to-insights:

National Children’s Hospital: “Impala is incredibly fast,” said the organization’s BI manager. “When I do an advanced query, I don’t have to wait for 30 minutes to see if my numbers look right. When viewed through Hue, Impala is a fantastic tool.”


Zoosk: “Even as a programmer who can write Java mapreduce, or Python Hive streaming code, it’s usually magnitudes faster to use Impala to get my results. Furthermore, all existing data analysts who are SQL trained can now actively query the Hadoop cluster and gain insights from it, instead of having to learn Java or Python. And this speaks loudly on how quickly Hadoop is maturing as an analytical platform for the masses.” said Martin Lam, Senior Director Analytics and Data Science

Even Cloudera’s internal support team uses Impala to provide world-class support to our customers and reduce the time-to-resolution by 35%. “Impala can rapidly query large data sets in specific time ranges, and do things like filtering and search matches over the data set. We have started serving Impala data through Tableau to build dashboards and answer questions from the business about customers and how they’re using Cloudera’s technology,” explained Krista Mizusaki, program manager for the customer operations tools team.

Part of Impala’s popularity and adoption is due to its focus on compatibility, not only with ANSI SQL – making it easy to run new and existing workloads using familiar SQL skills – but also with its integrations with all the leading BI tools that enterprises already rely on. Through the Impala Accelerator Partner Program, Cloudera works closely with these third-party tools to develop and certify their integrations and applications on Impala to ensure the latest innovations are available to customers. Below are just a few of the Impala Accelerator Partners:

Arcadia Data: “Impala represents the most logical and progressive step forward for better leveraging Hadoop. Cutting edge database techniques once hidden in the silos of academia or dark corners of proprietary code are finally out in the open thanks to Impala,” said Shant Hovsepian, CTO of Arcadia Data, provider of Unified BI and Analytics on Hadoop. “True disruption happens when you don’t have to choose between big and fast. The impact we’ve seen Impala have on transforming the payoff of Big Data is nothing short of that.”


AtScale: “Impala has proven to be an excellent choice for supporting analytical SQL queries on large scale data sets,” said Josh Klahr, VP of Product Management at AtScale. “We have a number of customers that have chosen Impala as their SQL-on-Hadoop engine along with AtScale’s Dynamic Cube solution. Combined with Impala, AtScale delivers interactive performance on a 50 billion row data set supporting multiple concurrent end-users.”


Pentaho: “Pentaho and Cloudera share a common history and approach to simplifying complex and powerful technologies to integrate and analyze big data. Our common open source heritage means that we can innovate at the speed of our customers’ businesses,” said Eddie White, EVP Business Development, Pentaho, a Hitachi Group Company. “We have been collaborating very closely to deliver impactful analytics capabilities with Impala and the Enterprise Data Hub. With Pentaho and Cloudera you can quickly analyze large volumes of disparate data in a governed fashion, delivering trusted analytics.”


Qlik: “For Qlik customers, Impala represents a great way to interface directly with Hadoop from Qlik applications while leveraging their existing SQL skills,” said Mike Foster, vice president of strategic partners at Qlik. “Qlik will continue to make investments in providing a better way to leverage Impala that allows our customers to continue developing best-in-industry Big Data solutions on top of Hadoop.”


SAS: “SAS and Cloudera jointly developed SAS’ connector to Impala (SAS/Access to Impala). The SAS/Access to Impala opens up data stored in Impala to all of the SAS client software including Enterprise Guide and Enterprise Miner. SAS customers regularly report seeing performance increases of 10-100x over other SQL engines on Hadoop. For SAS customers, Impala’s performance and SQL capabilities is a game changer,” said Mike Ames Sr. Director Data Science and Emerging Technologies SAS Institute


Tableau: “Tableau users want to see and understand all of their data and are excited about the enterprise-grade levels of performance and stability commonly associated with more traditional relational databases that Cloudera is now delivering to Hadoop. Impala is the engine our customers most often turn to when running live queries in Tableau against Hadoop and really opens up the world of ‘Big Data’ to our customers,” said Dan Kogan, Director of Product Marketing.


Zoomdata: “The combination of Impala and Zoomdata enables interactive data discovery for business users, but at big data scale and with sub-second response time,” said Ruhollah Farchtchi, VP of Zoomdata Labs. “With the data volumes that Impala can handle and fast visual analytic techniques like data sharpening from Zoomdata, even working with billions of rows of data is fast and interactive. At Zoomdata, we’re excited to expand our Impala support with Kudu integration, to enable both real-time monitoring and historical analytics on the same dataset.”

Impala has come a long way since its initial – unlocking SQL analysis and BI on Hadoop. Impala 2.0 in particular went beyond the core ANSI SQL-92 functionality to add analytic SQL capabilities (including ANSI SQL:2003 analytic window functions, correlated subqueries, and more). These capabilities further accelerated Impala’s adoption in enterprises by enabling even more use cases, more users, and larger data sets. Since Impala 2.0, the Impala team continues to focus on even greater reliability and usability at even greater concurrency and scale. Impala now has many customers in the million-query club, clusters ranging from tens to hundreds of nodes, and customers that are pushing the concurrency envelope into 1,000+ users.

Even with these achievements, there is still a lot of work to be done for Impala to reach its full potential. Two of the most notable enhancements for Impala that will dramatically expand the uses cases it can serve are the integration with the new Kudu storage engine and the order of magnitude performance investments underway (greater than 20x of Impala today). Together with Kudu, the Impala community is working to enable new workloads where you can query fast data that is changing in real-time with support for direct inserts, updates, and deletes. As Impala’s user growth and data volumes continue to grow rapidly, the Impala community is investing to achieve low-latency responses with even greater user and data volumes on the same hardware. More details on the roadmap can be found in this Cloudera Engineering post, “What’s Next for Impala.”

The post Bringing Data Discovery and Analytics to Apache Hadoop: The Power of Impala appeared first on Cloudera VISION.

Leave a Comment

Your email address will not be published. Required fields are marked *