In Blog

The Problem

The number of enterprise data systems in your organization has grown our of control. Every day, your team gets presented with new business scenarios that require aggregating or combining data from different enterprise data sources. Also, as more and more systems get adopted within your organization becomes increasingly hard to understand the different data sources that can be used by applications as well as its underlying programming models and governance rules.

If this challenge sounds familiar, then you probably went through the exercise of building an operational data store (ODS) a few years ago. As time passed, your ODS probably started experiencing challenges incorporating new forms of unstructured and semi-structured data as well as providing a universal model to discover and access the different data sources in your enterprise in a consistent way.

The aforementioned story describes some of the current challenges of democratizing data access in today’s enterprise. Despite the diverse needs that can drive the simplification of data access in the enterprise, we think they can be summarize the following four requirements:

  • Centralized catalog of enterprise data sources
  • Consistent API model to access data from enterprise systems
  • Ability to create aggregations between different data sources using known languages like SQL
  • Ability to manage, monitor and secure the different data sources in your enterprise.

The Data Lake Solution

A data lake, is one of the emerging architecture patterns that help with the democratization of data access in the enterprise. From a functional perspective, a data lake represents the gateway and aggregation layer to access enterprise data in a consistent, secure and compliant manner. From a functional perspective, a data lake should include the following elements:

datalake

Data Definition Interfaces

A data lake should provide architects with the ability of defining new “data sources” based on underlying enterprise data. For instance, a data source can be a representation of data coming from an Oracle database or being returned from a component interface in PeopleSoft. In order to accomplish this, a data lake solution should provide connectors to the common corporate systems in an enterprise.

Data Aggregation Interfaces

To address the growing set of requirements for new data, a data lake solution should provide a SQL-like interface that users can leverage to define new data sources based on aggregations of existing data sources.

Centralized Data Store

Similar to a traditional ODS, a data lake solution should provide a model to centralize the data from various enterprise data sources. Differently from a traditional ODS, a data lake should work with structured, semi-structured and non-structured data and should not require the definition of data schemas ahead of time.

Data Catalog

A data lake solution should provide a centralize catalog of all the different data sources in your enterprise. This catalog should allow users to easily discover, test and validate the different data sources available in your enterprise.

Data Access APIs

A data lake solution should dynamically generate data access APIs from the different data sources defined in the catalog. For instance, if an architect has defined a data source for invoice information, a data lake should expose that data source using a dynamic API so that it can be consumed by different applications.

Data Search Interface

In addition to accessing data using standard queries or APIs, a data lake solution should support the indexing and search of enterprise data sources. This feature is key to allow end users to discover data records using standard search keywords.

Data Governance

A data lake solution should enable the governance, management and security of existing data sources. Using the appropriate governance models, IT organizations can setup the correct access control, SLAs, security and other policies that govern the access to data in the enterprise.

From a functional standpoint, a data lake solution should provide a universal data access gateway to your enterprise data. Differently from traditional solutions like operational data stores, a data lake solution takes advantage of the latest generation of big data, search and API management stacks to provide a robust architecture model that enables the cataloging, discoverability, consumption and governance of data in the enterprise.

I hope you like the thoughts listed here. In the next post, we will discuss how to implement a data lake solution with today’s technologies.