HDFS Data Encryption at Rest on Cloudera Data Platform

Introduction:

Encryption of Data at Rest is a highly desirable or sometimes mandatory requirement for data platforms in a range of industry verticals including HealthCare, Financial & Government organizations. The capability increases security and protects sensitive data from various kinds of attack that could be internal or external to the platform.

Access to HDFS data can be managed by Apache Ranger HDFS policies and audit trails help administrators to monitor the activity. However, any user with HDFS admin or root access on cluster nodes would be able to impersonate the “hdfs” user and access sensitive data in clear text. HDFS Encryption prevents access to clear text data.  Data security and data privacy are bolstered by this approach, ensuring the protection of sensitive and personal data which, should it be exposed in an accidental or malicious breach, would result in negative impact for both the individuals concerned (customers, employees, partners) as well as the organization as a whole.

HDFS Encryption delivers transparent end-to-end encryption of data at rest and is an integral part of HDFS. End to end encryption means that the data is only encrypted and decrypted by the client. In other words, data remains encrypted until it reaches the HDFS client.

Each HDFS file is encrypted using an encryption key. To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. To add another layer of security, the file encryption key is stored in encrypted form, using another “encryption zone key”.

Configuring this feature is relatively straightforward. It protects the data by controlling the decrypt access to HDFS data with key management policies handled by Ranger.

HDFS Native encryption works in combination with solutions such as Protegrity Tokenization where encrypted data in HDFS can be tokenized and detokenized based on the policies defined by the Protegrity ESA server. What’s more, Ranger offers dynamic column masking features that include redacting, hashing, and masking data that can be applied on top of data that is already encrypted at rest on HDFS for an additional layer of security.

HDFS encryption combined with column masking features by Ranger and/or Protegrity form a complete solution where data is fully protected: at rest, over the network, and where clear text access is managed by authorization policies.

Encryption & Decryption Flow:

The way HDFS encrypts data is explained very well in Cloudera documentation and many articles. However, I am going to go over the basic flow here:

Encryption:

  • An HDFS encryption zone encryption key (EZK) needs to be created to encrypt files in HDFS
  • An HDFS encryption zone need to be created; this is an empty HDFS folder, associated with an EZK
  • For every file created or copied into HDFS encryption zone, a data encryption key (DEK) is created 
  • Data in the file is encrypted with DEK
  • DEK is encrypted using EZK to give rise to encrypted data encryption key (EDEK)
  • Each file will have an EDEK which is stored in the file’s metadata

Decryption:

  • Attempt to access an encrypted file requires a user to have “DECRYPT” access on the corresponding EZK
  • “hdfs dfs -cat” on the file triggers a hadoop KMS API call to validate the “DECRYPT” access
  • If the user has access on EZK, the EDEK on the file is decrypted using EZK
  • The DEK is then used for decrypting the contents of the file and display to the user

The following diagram shows how the HDFS client invokes Key provider API to decrypt the EDEK and gain access to the file contents:

HDFS client invokes Key provider API

The following diagram shows how the EZK, DEK, EDEK are related to each other:

EZK, DEK, EDEK relationship

Installation Options:

HDFS Native Encryption capability completely relies on the Ranger KMS service that is central to creation of encryption zone keys, creation of authorization policies to grant ENCRYPT, DECRYPT access on the keys.

Because of its crucial role, installing and configuring Ranger KMS is the first step to enable HDFS Native Encryption. 

Ranger KMS needs a backend infrastructure to store & retrieve encryption zone keys. Cloudera Manager offers two different options to install & operate Ranger KMS:

  1. Ranger KMS backed by an RDBMS
  2. Ranger KMS backed by Key Trustee Server

However, Cloudera Manager makes it straightforward to configure either.

Enable HDFS Data at Rest Encryption:

On the Cloudera Manager (CM) UI, click on “Clusters” and click on the cluster name ( in our case, the cluster name is “mycdp”)

Enable HDFS Data at Rest Encryption:

Click on the “Actions” drop down, and click on “Set up HDFS Data at Rest Encryption” as shown below:

Set up HDFS Data at Rest Encryption

The operation “Setup HDFS Data At Rest Encryption” in the Cloudera Manager UI, will prompt you to pick one of three choices: 

(1) Ranger KMS with RDBMS 

(2) Ranger KMS with KTS 

(3) File based Keystore

The following screenshot shows all the three options mentioned above:

Assuming the first option in the above screenshot “Ranger Key Management Service backed by Key Trustee Server” is chosen, Cloudera Manager prompts a few prerequisites to be completed along with choices on how KTS infrastructure can be stood up.

The below screenshot indicates these strong recommendations to be implemented before enabling HDFS Data at Rest Encryption:

  • Enable kerberos security
  • Enable TLS / SSL

The following are the two choices on how KTS infrastructure can be setup:

  • Add a dedicated cluster for KTS (helps manage KTS infrastructure outside the cluster, and it is a best practice as well)
  • Install KTS using parcels (it requires parcels to be downloaded from archive.cloudera.com, and configure into CM)

Once KTS is in place with one of the above two choices, 

  • Add KTS as a service by selecting “Add service” option on Cloudera manager UI
  • Add Ranger KMS with Key Trustee Server Service by selecting “Add service” on Cloudera Manager UI

Check for New Parcels

In this document, the option of “Installing KTS as a service inside the cluster” is chosen since additional nodes to create a dedicated cluster of KTS servers is not available in our demo system.

Parcels Configuration for KTS:

Download the parcels for KTS as they are not part of the CDP parcels.

$ wget https://username:password@archive.cloudera.com/p/keytrusteeserver7/7.1.1.0/parcels/KEYTRUSTEE_SERVER-7.1.1.0-1.keytrustee7.1.1.0.p0.3050880-el7.parcel

$ wget https://username:password@archive.cloudera.com/p/keytrusteeserver7/7.1.1.0/parcels/KEYTRUSTEE_SERVER-7.1.1.0-1.keytrustee7.1.1.0.p0.3050880-el7.parcel.sha

Copy the parcel files into /opt/cloudera/parcel-repo folder on the Cloudera Manager server.

Once the files are copied, change the ownership to the user “cloudera-scm”

Now, in CM, click on Parcels, and click on “Check for New Parcels”. You should see a new parcel “”KEYTRUSTEE_SERVER”.

Check for New Parcels

Distribute and activate the KEYTRUSTEE_SERVER parcel. Once it is activated, you will the status as “Distributed, Activated” on the parcels page:

Distribute and activate the KEYTRUSTEE_SERVER parcel

Installation & Configuration:

Now, select the “Add Service” option in Cloudera Manager, and select KeyTrustee Server.  Select hosts for Active and Passive KTS servers.

Check entropy using the command : 

$ cat /proc/sys/kernel/random/entropy_avail

Entropy should be greater than 500; if not, we need to install other software packages to increase entropy levels.

Setup Entropy

If the entropy available is low, you must increase the entropy available. Otherwise, subsequent cryptographic operations can take a long time. View More Details

To determine the amount of available entropy on the target machines, run these commands:

ssh root@ccycloud-4.cdpvcb.root.hwx.site
cat /proc/sys/kernel/random/entropy_avail
If the result is below 500, you may want to consider this workaround by installing an entropy generator such as rng-tools. Consult the security policies, procedures, and practices in your organization before proceeding.
Install rng-tools
yum install rng-tools # For Centos/RHEL 6, 7+ systems
apt-get install rng-tools # For Debian systems
zypper install rng-tools # For SLES systems
For Centos/RHEL 6, Debian, SLES systems
echo 'EXTRAOPTIONS="-r /dev/urandom"' >> /etc/sysconfig/rngd
service rngd start
chkconfig rngd on
cat /proc/sys/kernel/random/entropy_avail
For Centos/RHEL 7+ systems
cat /proc/sys/kernel/random/entropy_avail
cp /usr/lib/systemd/system/rngd.service /etc/systemd/system/
sed -i -e 's/ExecStart=/sbin/rngd -f/ExecStart=/sbin/rngd -f -r /dev/urandom/' /etc/systemd/system/rngd.service
systemctl daemon-reload
systemctl start rngd
systemctl status rngd
# if the status command returns the service is loaded and enabled, skip the following step
systemctl enable rngd

Generate private key on the Active KTS by running the below command:

[root@ccycloud-4 ~]# ktadmin init
INFO:keytrustee.server.util:Creating self-signed cert
INFO:keytrustee.util:`/usr/bin/openssl req -nodes -new -days 3650 -subj /C=US/ST=TX/L=Austin/CN=ccycloud-4.cdpvcb.root.hwx.site/E=keytrustee@ccycloud-4.cdpvcb.root.hwx.site -x509 -out /tmp/tmpeKpVfQ.csr -keyout /tmp/tmpdO8Bnp.key`
INFO:keytrustee.server.util:Generating PGP key, this may take a while
Initialized directory for 4096R/0A9F3FEBEE343FEA839DA417F1516D034C0E2E78

Install “rsync” on both active and passive KTS servers.

Run below command to keep the private key of the active KTS in sync between both active and passive KTS.

[root@ccycloud-4 ~]# rsync -zav --exclude .ssl /var/lib/keytrustee/.keytrustee ccycloud-3.cdpvcb.root.hwx.site:/var/lib/keytrustee/
root@ccycloud-3.cdpvcb.root.hwx.site's password:
sending incremental file list
.keytrustee/
.keytrustee/gpg.conf
.keytrustee/keytrustee.conf
.keytrustee/logging.conf
.keytrustee/pubring.gpg
.keytrustee/pubring.gpg~
.keytrustee/random_seed
.keytrustee/secring.gpg
.keytrustee/trustdb.gpg
sent 11,286 bytes  received 172 bytes  2,546.22 bytes/sec
total size is 12,317  speedup is 1.07

Initialize the Passive Key Trustee Server with the same private key. Ensure both ktadmin commands output the same initialized directory.

[root@ccycloud-4 ~]# ssh ccycloud-3.cdpvcb.root.hwx.site
root@ccycloud-3.cdpvcb.root.hwx.site's password:
Last login: Thu Feb 25 19:47:10 2021 from 172.27.172.135
[root@ccycloud-3 ~]# ktadmin init
INFO:keytrustee.server.util:Creating self-signed cert
INFO:keytrustee.util:`/usr/bin/openssl req -nodes -new -days 3650 -subj /C=US/ST=TX/L=Austin/CN=ccycloud-3.cdpvcb.root.hwx.site/E=keytrustee@ccycloud-3.cdpvcb.root.hwx.site -x509 -out /tmp/tmpVxzRKB.csr -keyout /tmp/tmpC6NvaB.key`
Initialized directory for 4096R/0A9F3FEBEE343FEA839DA417F1516D034C0E2E78
[root@ccycloud-3 ~]#

The Initialized directory values must be identical on both active and passive KTS. 

Install Ranger KMS with KTS Backend:

On Cloudera Manager UI, click on Add Service, and choose “Ranger KMS with Key Trustee Server”. 

Ranger KMS with Key Trustee Server

Setup Authorization Secret

This step helps you create an organization and retrieve the “auth_secret” value for this Ranger KMS with Key Trustee Server to use. An organization is required to register with Key Trustee Server. 

The following screenshot indicates where to enter the “Org Name” and where the generated “auth_secret” is to be entered. 

Ranger KMS with Key Trustee Server

Org Name

Generate Instruction

 Enter a name for “Org name” say, “qa-test”. ( You can choose any name here)

And proceed further

Switch to the primary Key Trustee Server and run the following commands

[root@ccycloud-4 ~]# keytrustee-orgtool add -n qa-test -c root@localhost
Dropped privileges to keytrustee
2021-02-26 12:32:53,561 - keytrustee.server.orgtool - INFO - Adding organization to database
2021-02-26 12:32:53,564 - keytrustee.server.orgtool - INFO - Initializing random secret
2021-02-26 12:32:53,584 - keytrustee.server.util - ERROR - An exception of type error occurred. Arguments:(111, 'Connection refused'). This probably happened because there is no Mail Transfer Agent setup. You will not receive any emails you were to receive from the Key Trustee Server.
[root@ccycloud-4 ~]# keytrustee-orgtool list
Dropped privileges to keytrustee
{
    "qa-test": {
        "auth_secret": "ZJ76qlaTev6ehyP/D9GJ/Q==",
        "contacts": [
            "root@localhost"
        ],
        "creation": "2021-02-26T12:32:53",
        "expiration": "9999-12-31T15:59:59",
        "key_info": null,
        "name": "qa-test",
        "state": 0,
        "uuid": "I8eCm6jxihRwJmFJJkthYk9CUgFf10o94dYsgTWPxHB"
    }
}

Copy the above “auth_secret” value and enter it on the Cloudera Manager screen where it asks for “auth_secret” as shown in the above screenshot. Upon this action, CM takes you to the next page “Setup TLS for Ranger KMS for KeyTrustee Server”

Here, the best practice is to enable TLS across all nodes of the CDP cluster with certificates signed by a well-known Certificate Authority. However, we can continue without enabling TLS for the purpose of this blog. For the same reason, I have also chosen to run both the “Active KeyTrustee server” and the “Ranger KMS with KTS” on the same host for the sake of simplicity.

Upon clicking NEXT, it will prompt you to review your changes. the authentication method is to be chosen as Kerberos; click next to complete installation.

If Ranger KMS with KTS is not started automatically, start the service along with any other stale services.

Troubleshooting:

Incase Ranger KMS does not start, please go through the following logs:

  • Cloudera manager agent logs at: /var/log/cloudera-scm-agent/ on the host where Ranger KMS is installed
  • Ranger KMS server logs at: /var/log/ranger/kms/

Install & Configure Ranger KMS with RDBMS Backend:

Ranger KMS installation with RDBMS backend is a much simpler installation. The prerequisite to implement this option is to have an RDBMS installed and configured. In this article, we will provide instructions on how to install and configure a MySQL instance as a backend for Ranger KMS.

Install & Configure RDBMS:

In case option 1 above is chosen, the following instructions help to stand up an RDBMS instance that can act as a backend to Ranger KMS. Some customers might use the same RDBMS that forms the backend for the Hive or Ranger metastore. However, it is a best practice to have a dedicated RDBMS for Ranger KMS since it stores sensitive information like encryption zone keys and the master secret key.

Ranger KMS supports MySQL, Postgresql as well as Oracle. In this article, we will install a dedicated MySQL instance as a backend for Ranger KMS.

Run below command to install MySQL 5.7 from the internet. If you don’t have access to internet from the cluster linux host, download the file and SCP it onto the Linux host.

$ yum localinstall https://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm
$ yum install -y mysql-community-server
$ systemctl enable mysqld
$ systemctl start mysqld
The initial password is found in the log file; it can be found as follows:
[root@ccycloud-1 ~]# grep 'temporary password' /var/log/mysqld.log
2021-02-16T02:50:34.638064Z 1 [Note] A temporary password is generated for root@localhost: E;Pm;YgNp3zh
[root@ccycloud-1 ~]#

Run the following command to enter the default password and change it to a new password by following the prompts:

$ mysql_secure_installation

Once the root password is entered, try to login and create a database and user for Ranger KMS.The default password for mysql root user is set to “Hadoop_123”

CREATE DATABASE rangerkms DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON rangerkms.* TO 'rangerkms'@'%' IDENTIFIED BY 'Hadoop_123';
GRANT ALL ON rangerkms.* TO 'rangerkms'@'localhost' IDENTIFIED BY 'Hadoop_123';

Download and install mysql java connector jar:

$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.46.tar.gz
tar zxvf mysql-connector-java-5.1.46.tar.gz
sudo mkdir -p /usr/share/java/
cd mysql-connector-java-5.1.46
sudo cp mysql-connector-java-5.1.46-bin.jar
/usr/share/java/mysql-connector-java.jar

Create /usr/share/java folder on all the hosts:

$ for i in $(cat hosts);do echo $i;ssh $i ‘mkdir -p /usr/share/java’;done

Copy the mysql connector jar to the /usr/share/java folder on all the hosts:

$ for i in $(cat hosts);do echo $i;scp /root/mysql-connector-java-5.1.46/mysql-connector-java-5.1.46-bin.jar $i:/usr/share/java/mysql-connector-java.jar;done

Install Ranger KMS:

Click on the “Add Service” option in Cloudera Manager UI:

and chose “Ranger KMS” as shown below:

Configure the Ranger KMS backend database, using the MySQL instance along with the user, database and access to the user on the database. 

The below screenshot shows the configuration page along with a successful test connection. 

If the cluster is kerberized, chose “kerberos” as authentication method, otherwise chose “simple”

Also, enter “Ranger KMS Master Key Password” and save this password. This field is not auto populated, the user will have to enter the master secret password.

Follow the prompts on Cloudera Manager to complete the installation and start Ranger KMS. 

Troubleshooting:

If Ranger KMS service does not start, please look into cloudera manager agent logs on the host @ /var/log/cloudera-scm-agent, or the ranger kms logs @ /var/log/ranger/kms

Usually, it could be the connectivity with the backend database (MySQL). Verify the host name and port number of the database server and whether the database admin user has sufficient privileges on the database created for storing the encryption keys

Ranger User Roles:

Once Ranger KMS is started, go to Ranger service, open up the Ranger Web UI. 

**** Please note that the Web UI URL is the same for both Ranger UI and Ranger KMS UI. Depending on the user’s role, the URL takes the user to either Ranger UI or Ranger KMS UI. This is to ensure separation of duties between Info security team that manages encryption keys, KMS policies, and Cluster Admin or Data Steward team that manages Ranger Hive, HDFS, HBase policies ****

Here are the roles users can have in Ranger / Ranger KMS :

  • User
  • Admin
  • KeyAdmin
  • KMSAuditor

Users with KeyAdmin role can login to Ranger KMS UI and create encryption keys, create KMS policies to define which users, groups can decrypt files in encryption zones

When Ranger KMS is installed and configured, Cloudera Manager asks for the password of the “keyadmin” user. Please save the password so that you can login to Ranger KMS UI later.

Configure Ranger KMS service:

Login to Ranger KMS UI with keyadmin user credentials, and the URL would look something like this: http://ccycloud-4.cdpvcb.root.h:6080/login.jsp

Click on the edit button of the KMS policy service, and modify the KMS URL in the “Config Properties” section 

From 

kms://http@localhost:9292/kms

To 

kms://http@ccycloud-4.cdpvcb.root.hwx.site:9292/kms

Click on “save” to save the changes.

Validate HDFS Data Encryption:

HDFS Data Encryption at Rest works with the construct of “Encryption Zones”. An encryption zone is a HDFS folder where all the files in that folder or its sub folders are encrypted using an encryption zone key.

An encryption zone is created by associating an empty HDFS folder with an encryption zone key  

Create Encryption key:

To create the encryption key, the administrator needs to login to Ranger KMS UI with the “keyadmin” user or any user with “keyadmin” role.

  • Click on Encryption button at the top
  • Click on Key Manager
  • Select the service “cm_kms” from drop down menu
  • Click on “Add Key” button

Click on save, and you have successfully created an encryption key

Create Encryption Zone:

Login to one of the cluster nodes, and kinit with “hdfs” user or any user that has the privileges on “Generate Metadata” & “Generate_EEK” operations

In our case, “hdfs” user has access to create keys as shown in the below screenshot, so we will kinit as “hdfs” user, and try to create an encryption zone

[root@ccycloud-1 ~]# hdfs crypto -createZone -keyName myenckey -path /user/anatva/protected
Added encryption zone /user/anatva/protected
[root@ccycloud-1 ~]# hdfs crypto -listZones
/user/anatva/protected  myenckey

Since “hdfs” user does not have access to DECRYPT the key, hdfs user cannot write or read files in encryption zone:

[root@ccycloud-1 ~]# hdfs dfs -put /etc/passwd /user/anatva/protected/
put: User:hdfs not allowed to do 'DECRYPT_EEK' on 'myenckey'

However, since the user “anatva” has access to “DECRYPT_EEK” privilege, anatva should be able to read files within encryption zone

[anatva@ccycloud-1 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1002
Default principal: anatva@EXAMPLE.COM
Valid starting       Expires              Service principal
02/23/2021 20:01:01  02/24/2021 20:01:01  krbtgt/EXAMPLE.COM@EXAMPLE.COM
[anatva@ccycloud-1 ~]$ hdfs dfs -cat /user/anatva/protected/passwd | head -3
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin

Replication of Encrypted Data:

With third-party encryption systems, simple read/write operations may not decrypt/encrypt the data automatically. The replication of encrypted data between two on-prem clusters or between on-prem & cloud storage usually fails citing the file checksums not matching if the encryption keys are different on source and destination clusters. In order to make distributed copy work, either the “skipcrccheck” flag is to be used, or maintain the same encryption key on source and destination, which is not recommended.

With HDFS native encryption, the read/write operations on files within encryption zones automatically decrypt/encrypt the data provided the user has “DECRYPT_EEK” access on the encryption zone. The replication process (eg: distributed copy) automatically decrypts data from source while reading and encrypts data while writing to the target cluster. While the file checksum dont match in this scenario either, it allows for different encryption keys on source and target

Conclusion:

Cloudera strongly recommends customers to enable encryption of data at rest as it protects sensitive data within the enterprise against external as well as internal threats.  Since Cloudera supports the Key Trustee Server cluster to reside outside the main cluster, it can be managed by Info Security teams in the enterprise on separate hardware as well as separate network if required.

The post HDFS Data Encryption at Rest on Cloudera Data Platform appeared first on Cloudera Blog.

Leave a Comment

Your email address will not be published. Required fields are marked *