Supporting Transformation with an Integrated Data Platform. Three Common Questions Answered.
In recent years there has been increased interest in how to safely and efficiently extend enterprise data platforms and workloads into the cloud. CDOs are under increasing pressure to reduce costs by moving data and workloads to the cloud, similar to what has happened with business applications during the last decade.
Our upcoming webinar is centered on how an integrated data platform supports the data strategy and goals of becoming a data-driven company. Before that, companies should think about whether the right foundations for your data strategy are in place. In this blog post we consider three of the biggest challenges being considered today by enterprise data platform owners, architects and engineers. They are, how can an organisation:
- Efficiently take advantage of cloud computing in an accelerated time frame?
- Minimise the integration effort across an enterprise data platform while avoiding vendor lock-in?
- Efficiently achieve consistently strong security, governance and lineage to meet regulatory requirements?
Data Platform Architecture
Let us start by considering how an organisation can efficiently take advantage of cloud computing in an accelerated time frame. The options available are:
- Migrate to a single cloud provider
- Migrate to multiple cloud providers
- Migrate to hybrid cloud
- Remain on-premises
The solution will be influenced by three factors:
- Functional requirements: What the platform and its component services must do. For example, the ability to perform in-stream analytical processing.
- Non-functional requirements: A measure of quality of the platform and its component services. For example, the ability to perform a benchmark workload in a given time.
- Constraints: Limits that the platform and its component services must adhere to. For example, sensitive data must be redacted before analysis to meet regulatory requirements.
Organisations tell us these are their top constraints:
– Operational efficiency
– Accelerated time frames
– Regulatory compliance
– Use of multi-cloud
Operational efficiency across multiple public cloud providers isn’t possible without abstracting away the differences between each individual cloud provider’s data services. This challenge is compounded by the fact that most organisations cannot or will not move all their on-premises data workloads to the cloud due to a combination of regulations (constraints) or performance (non-functional requirements) for some workloads. This leads us towards solutions that are available on premises and in the cloud, ideally supporting hybrid cloud.
Putting aside operational efficiency for a moment, let us now consider the constraint “accelerated time frames”. If data flows, ETL pipelines, BI reports and machine learning pipelines all need to be rewritten or heavily modified, this can significantly extend the time to value and increase the risk of moving to the cloud. Furthermore, if there are inconsistencies between environments (on-premises vs each cloud) this further leads to operational inefficiencies.
“Is there a way to have a common platform that takes advantage of cloud native services while still providing a consistent and efficient way to manage hybrid-cloud deployments?”
Integrated Platform vs Point Solutions
A simplified enterprise data architecture looks something like the figure below.
It is unlikely that your organization’s architecture is an exact match, but you can probably recognise and identify many of the logical components. Even if each of these components adopts open standards and APIs, which historically has not always been the case, there is still considerable integration effort across a number of dimensions. One dimension is security, governance and lineage, another is proprietary storage formats leading to duplication of data and wasted resources moving and converting data.
If we focus on the data management component located at the bottom of the figure, it needs to cover each logical component under management. In the figure I have shown this as a single logical entity. In reality, organisations will often have separate management tools for each component of the data life cycle
“Is it possible to significantly reduce the integration effort across a typical enterprise data platform?”
Security, Governance and Lineage of an Organisation’s Data
As data flows through an organization, from the point of creation, to being transformed and potentially combined or enriched with other data sources, different users will access the data at various times. Even if changes are permitted, we need to know how the data has transformed over time, that is its lineage. There needs to be controls and mechanisms in place to log changes or attempts to change data to allow us to reliably and consistently perform historical operations on data to validate previous insights.
“Is there a way to provide an end-to-end security fabric that can simplify control across the entire data life-cycle?”
The Cloudera Data Platform (CDP)
The Cloudera Data Platform (CDP) provides a consistent management experience across each of these environments backed by a shared security and governance fabric.
CDP supports the entire data life cycle from data collection, engineering, reporting, serving to prediction. Entire data flows from the edge to AI can be controlled within one platform. While each CDP data service can be used independently, most meaningful use cases require chaining together several of them. CDP simplifies this process of integration and chaining by using open standards, a unified data catalogue and a data lake with a common security and governance fabric.
The security and governance fabric in CDP is provided by a data service called the Shared Data Experience or SDX. SDX controls what data and workloads can be moved between different environments while meeting controls or restrictions on data movement. Data is governed, which includes auditing and data lineage across the platform with integration capabilities for third-party products and services.
SDX provides fine-grained control over resources based on users and roles as well as inheritable attribute based policies. Derived data sets will inherit those attributes and the associated controls. This is important when we think about data as flowing and evolving over time.
Whether it be on premises or in the public cloud, CDP is based on the same cloud native architecture that uses object storage and container services. Organisations no longer have to choose between on-premises or the cloud. They can operate in both environments with a consistent user experience. This combined with the ability to replicate data, meta data and security policies between deployments makes a hybrid-cloud Enterprise data platform a reality.
Please join me as we discuss more about the considerations of deploying a data platform during the webinar “Supporting Transformation with an Integrated Data Platform”. Register here.
The post Supporting Transformation with an Integrated Data Platform. Three Common Questions Answered. appeared first on Cloudera Blog.