Data Governance in Hadoop – Part 1
Data governance isn’t a new discipline: it’s a critical part of any enterprise application that deals with sensitive or confidential data. If an application needs to keep track of who’s accessing something or what they’re doing with it, that application deals with data governance.
In this post, we’ll look at the core elements of enterprise data governance and why data governance is so challenging in Apache Hadoop. In a future post, we’ll look at how Cloudera Navigator, as part of Cloudera’s enterprise data hub, addresses these challenges to bring comprehensive governance to Hadoop.
Core Elements of Data Governance
Data governance is important to every organization that wants to protect its data assets, and it is especially important for regulated industries, such as financial services, healthcare, and pharmaceuticals, where federal laws require that certain types of data be governed and protected.
Common data governance features include auditing, lineage, and metadata management.
- Auditing: keeping a detailed, tamper-proof record of every access attempt – by user, IP address, resource name, etc
- Lineage: tracking a data set’s origins and how it’s used over time
- Metadata management: discovering and and retrieving relevant data sets. Metadata can be technical (e.g., table name, file name, directory name, owner, permissions, date created, etc.) or business (e.g., keywords, tags, sensitivity level, retention policy, etc.)
- Data lifecycle management and policy enforcement: managing data from the point of ingest to retirement — including retention, archiving, replication, compression, and backup
- Data stewardship and curation: ensuring that data is properly catalogued, of high-quality, and accessible to all of the appropriate users
Data Governance Challenges in Hadoop
Compared to traditional enterprise applications, data governance in Hadoop presents a number of significant challenges. Let’s take a look at why this is the case.
Hadoop stores a lot of data, and a lot of different types of data
Hadoop stores all types of data – from structured (database tables) to semi-structured (e.g. log files) and even unstructured (e.g. digitized customer calls or scanned files). Of course, sensitive data can be located in any of these files, yet it’s much harder to know whether sensitive data is located in a PDF or an MP3, let alone whether a malicious user’s MapReduce job secretly extracts credit card numbers from these files.
But data in Hadoop isn’t just diverse; it’s plentiful. With growing trends around the Internet of Things and sensor data, unprecedented volumes of data are being stored at much faster rates. This makes tracking sensitive data exponentially more challenging.
Lots of users access production Hadoop clusters, and each user can access Hadoop in a different way
In its early days, Hadoop access was limited to those few data scientists at an organization who understood MapReduce and knew how to write code in Java or Python. Any security breach would be limited to a very small set of potential users who knew how to access the cluster.
These days, hundreds of business analysts and data scientists at each organization perform interactive analyses on Hadoop using familiar desktop tools, such as Tableau, QlikView, and Microsoft Excel. In addition to all these interactive users, there are command-line users, Hue users, batch ETL jobs, and even general users who are only leveraging full-text search.
What if there’s a lone, malicious interactive query, batch job, command line access, or browser access that wrongfully extracts sensitive data?
Traditional tenets of governance remain the same, but must address them at Hadoop scale
Auditing. Auditing is tricky in the Hadoop ecosystem because each compute engine – Apache Hive, Impala, HDFS, MapReduce, Apache Sqoop, Apache Spark, etc – maintains its own audit log. If a user account were compromised, you would need to sift through all these disparate audit logs to try to piece together what took place. But since each audit log maintains its own retention policy, configuration, and format, it’s possible that the audit data won’t even be available when you need it most.
Lineage. Especially in regulated sectors, it’s crucial to be able to maintain a paper trail of the decision-making that took place around data in Hadoop. Whether you need to explain how clinical trials indicated that a new drug is effective, why a number was printed in your earnings report, or you need to understand how sensitive data is used throughout the organization, you need lineage.
Metadata Management. As your data hub grows in size, you need to keep track of what data is stored in it. You must be able to answer questions such as:
- What files does a particular user own?
- Are any files in my enterprise data hub older than seven years?
- Do any files have incorrect permission?
- Which tables contain sensitive information?
- Where can I find the Q4-2009 sales transactions?
- What files are associated with high-value customers?
Metadata – both business and technical – is the key to answering these questions.
Since Hadoop consists of many different components, there are also many technical metadata repositories: Hive metastore, HCatalog, HDFS, etc. There’s even technical metadata for workflows and job executions, such as from Apache Oozie, Sqoop, MapReduce, and Spark. However, each technical metadata repository has its own search interface, meaning you have to search each one individually in order to piece together the answers to the above questions.
Additionally, with all these disparate technical metadata repositories, there’s no clear place to store all the business metadata being generated in these repositories.
Data Lifecycle Management, Policy Enforcement, Stewardship, and Curation. The only way to effectively manage data governance is to rely on a rich, unified business and technical metadata foundation. After all, how can you ensure the encryption of all sensitive files if you cannot even tag those files as sensitive in the first place?
The ability for more users to access more data, with faster time-to-value, is one of the core benefits of Hadoop. However, the flip side of these benefits is that it’s now harder to keep track of who’s accessing data and what they’re doing with it. Cloudera recognized this early on and developed Cloudera Navigator to ensure users get the best of both worlds – the flexible storage and access of Hadoop with the governance necessary for even the most stringent compliance regulations.
In the next post, we’ll look at how Cloudera Navigator specifically addresses these governance challenges at Hadoop scale.