Big Data Live with Impala & Tableau

This post was penned by Jeff Feng (@jtfeng), Product Manager at Tableau Software

Tableau is honored to be a part of the Cloudera Accelerator Program and deepen our partnership with Cloudera to further advance the integrations between our technologies.  Our partnership is founded on the mutual desire to put the power of data into the hands of every user in an organization in order to achieve faster and more actionable insight. Impala is critical to this endeavor with its ability to support live interactive queries enabling someone to actually have a visual data dialogue with the data stored in their Hadoop cluster using Tableau.

Tableau+Impala – Better Together

Tableau is on a mission to help users see and understand their data.  Our core belief is that “the people who know the data should be empowered to ask questions of the data.”  Our visual analytics software allows users to analyze and discover insights about their data faster than they ever could before by making it easy to move between any task in the cycle of visual analysis.

tableau impala

In the cycle of visual analysis users start off with a pre-determined task or goal. They then go through an iterative process to get the data, choose their visual structure (chart/graph), view the data, formulate insight and then act.  In reality this process is rarely as sequential as described and often involves users jumping around between different steps as new data insights (and new questions) are revealed.  Tableau supports this tightly woven process of asking questions to finding real answers to asking follow-on questions by using a person’s natural ability to identify patterns and translating actions into optimized queries.

Using Tableau and Impala together makes it possible to perform ad-hoc visual analytics on Hadoop at massive scale. Impala is an open source MPP analytic database for Apache Hadoop, and the standard that other data processing engines measure themselves against.  It is also the fastest data processing engine currently on the planet according to recent benchmarks [1][2].  Impala is able to achieve these fast speeds by by-passing the initial overheads incurred through translating SQL queries into MapReduce and leveraging the processing power of many CPUs in parallel, each with its own dedicated memory.

We are in an age where people can analyze millions or even billions of rows of data at their fingertips yet a user’s expectation is that they have near instantaneous results (see study on the 2 second rule for information retrieval).  When a user’s interactions and response times take more than 2-3 seconds, they become distracted from being “in the flow of visual analysis.”  Thus, Impala’s query speed is critical to the user experience as users seek to gain more and more insight from their Hadoop deployments.

Additionally, implementing independent and self-service visual analytics is only possible at an organizational level when issues like authentication and data access are addressed. This is why Tableau and Cloudera’s joint efforts in supporting Kerberos and Apache Sentry are also crucially important. As of our recent 8.3 release, we now offer single sign-on and delegated access support with Kerberos when connecting to Impala.  This extends the previous support using native Active Directory, SAML, and Tableau’s built-in authentication system.  For users, this means a more seamless experience because users that are signed-in to their local machine will not need to sign-in again to either Tableau Server or any Impala live data sources.  For IT administrators, Tableau’s compatibility with Sentry ensures that sensitive data will be protected as users will only see the data that they are authorized to see.  Together, by enabling user delegation for Impala we are able to ensure that users can connect to Impala as a live data source through stable, automated back-end authentication.

When to use Tableau with Impala for Big Data applications?

Using Tableau with Impala for performing visual analytics on Big Data makes sense for many applications:

  • Self-service data access & visualization: Tableau enables business users to access and visualize their data without writing SQL or complex MapReduce jobs.
  • Data blending across data sources: Tableau allows you to blend data together with other data sources.  This together with our hybrid-data architecture allows organizations to keep their data assets where they reside.
  • Mixed database performance workloads: Tableau has a hybrid data architecture which allows users to connect either live directly to the database or via an extract into our in-memory data engine.  Connecting live works great when you are connecting to a data processing engine like Impala that can handle massive datasets. In-memory extracts provide an extra dimension of flexibility to accelerate smaller datasets or connections to databases that have slow query execution.

Looking Forward

Tableau expands the use of Hadoop to all business users.  The combination of Tableau together with Impala opens the door for interactive analytics so that users can focus on answering the questions of their data to discover new insights.  Personally, I am excited about the many customer use cases of Impala+Tableau as well as the potential to provide near-instantaneous analytics on billions of rows of data.  If you would like to get “hands-on” and try Impala and Tableau together, I would strongly recommend trying out Cloudera Live.  Within minutes, you’ll be able to spin up a CDH cluster along with an integrated trial of Tableau plus a pre-loaded tutorial.

Read more about the Cloudera and Tableau Solution.



The post Big Data Live with Impala & Tableau appeared first on Cloudera VISION.

Leave a Comment

Your email address will not be published. Required fields are marked *