A data lake architecture for modern analytics and BI
April 12, 2020
April 12, 2020
Big data and data lakes only have meaning to an organization's vision when they help solve business problems through data democratization, re-use, exploration, and analytics. At Accenture, we're building intelligent search and analytics applications to gain more value from enterprise data lakes, and we're helping organizations do amazing things as a result.
<<< Start >>>
<<< End >>>
A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. An “enterprise data lake” (EDL) is simply a data lake for enterprise-wide information storage and sharing.
A data lake architecture incorporating enterprise search and analytics techniques can help companies unlock actionable insights from the vast structured and unstructured data stored in their lakes.
The main benefit of a data lake is the centralization of disparate content sources. Once gathered together (from their “information silos”), these sources can be combined and processed using big data, search and analytics techniques which would have otherwise been impossible. The disparate content sources will often contain proprietary and sensitive information which will require implementation of the appropriate security measures in the data lake.
The security measures in the data lake may be assigned in a way that grants access to certain information to users of the data lake that do not have access to the original content source. These users are entitled to the information, yet unable to access it in its source for some reason.
Some users may not need to work with the data in the original content source but consume the data resulting from processes built into those sources. There may be a licensing limit to the original content source that prevents some users from getting their own credentials. In some cases, the original content source has been locked down, is obsolete or will be decommissioned soon; yet its content is still valuable to users of the data lake.
Once the content is in the data lake, it can be normalized and enriched. This can include metadata extraction, format conversion, augmentation, entity extraction, cross-linking, aggregation, de-normalization, or indexing. Read more about data preparation best practices. Data is prepared “as needed,” reducing preparation costs over up-front processing (such as would be required by data warehouses). A big data compute fabric makes it possible to scale this processing to include the largest possible enterprise-wide data sets.
Users, from different departments, potentially scattered around the globe, can have flexible access to the data lake and its content from anywhere. This increases re-use of the content and helps the organization to more easily collect the data required to drive business decisions.
Information is power, and a data lake puts enterprise-wide information into the hands of many more employees to make the organization as a whole smarter, more agile, and more innovative.
<<< Start >>>
Benefits of an enterprise data lake
<<< End >>>
Data lakes will have tens of thousands of tables/files and billions of records. Even worse, this data is unstructured and widely varying.
In this environment, search is a necessary tool:
Only search engines can perform real-time analytics at billion-record scale with reasonable cost.
Search engines are the ideal tool for managing the enterprise data lake because:
Radiant Advisors and Unisphere Research released "The Definitive Guide to the Data Lake," a joint research project with the goal of clarifying the emerging data lake concept.
Two of the high-level findings from the research were:
More and more research on data lakes is becoming available as companies are taking the leap to incorporate data lakes into their overall data management strategy. It's expected that, within the next few years, data lakes will be common and continue to mature and evolve.
We've worked with two world-wide biotechnology / health research firms. There're many different departments within these organizations and employees have access to many different content sources from different business systems stored all over the world. The data includes:
Our projects focused on making structured and unstructured data searchable from a central data lake. The goal was to provide data access to business users in near real-time and improve visibility into the manufacturing and research processes. The enterprise data lake and big data architectures are built on Cloudera, which collects and processes all the raw data in one place, and then indexes that data into a Cloudera Search, Impala, and HBase for a unified search and analytics experience for end-users.
Read about how we helped ingest over 1 Petabyte of unstructured data into a pharmaceutical data lake.
Multiple user interfaces were being created to meet the needs of the various user communities. Some were simple search UIs and others were more sophisticated user interfaces (UIs), allowing for more advanced search to be performed. Some UIs were integrated with highly specialized data analytics tools (e.g. genomic and clinical analytics). Security requirements were respected across UIs.
Being able to search and analyze their data more effectively could lead to improvements in areas such as:
All content will be ingested into the data lake or staging repository and then searched (using a search engine, such as Cloudera Search or Elasticsearch). Where necessary, content will be analyzed and results will be fed back to users via search to a multitude of UIs across various platforms.
The diagram below shows an optimized data lake architecture that supports data lake analytics and search.
<<< Start >>>
Enterprise data lake reference architecture
<<< End >>>
At this point, the enterprise data lake is a relatively immature collection of technologies, frameworks, and aspirational goals. Future development will be focused on detangling this jungle into something which can be smoothly integrated with the rest of the business.
The future characteristics of a successful enterprise data lake will include:
Common, well-understood methods and APIs for ingesting content
Corporate-wide schema management
Business user’s interface for content processing
Text mining
Integration with document management
We really are at the start of a long and exciting journey! We envision a platform where teams of scientists and data miners can collaboratively work with the corporation’s data to analyze and improve the business. After all, “information is power” and corporations are just now looking seriously at using data lakes to combine and leverage all of their information sources to optimize their business operations and aggressively go after markets.