How to ingest unstructured data into a data lake at scale
September 16, 2019
September 16, 2019
Over the past few years, our team has worked with several clients to develop data lakes for storing enterprise-wide content. A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. Once the content is in the data lake, it can be searched, enriched, and used to generate insights to solve business problems and support diverse user needs.
In a recent project for a pharmaceutical client, we tackled a different problem: ingesting over one petabyte (PB) of unstructured data into their data lake. To put this into perspective, according to Computer Weekly,
<<< Start >>>
"One petabyte is enough to store the DNA of the entire population of the US – and then clone them, twice."
<<< End >>>
Unstructured data refers to images, voice recordings, videos, and text documents written by humans for humans. Text can include PDFs, presentations, memos, emails, research and regulatory reports, and social media posts. Unstructured data generally lacks a predefined model to describe their content. The absence of consistent descriptive metadata poses challenges for applications seeking to generate insights from this data.
Fortunately, there are available content processing, tagging, and connector technologies for acquiring and enriching unstructured text with metadata. However, technology alone is rarely enough. Unstructured data enrichment involves careful planning to extract the text from the content and generate metadata to help business systems (like search engines and analytics applications) make sense of the information.
Our client had over 1 PB of research data stored in various systems, ranging from Windows and Unix (NFS) file shares to Documentum and SharePoint. Numerous data types were found in these sources – PDFs, MS Office documents, email .msg files, images, and miscellaneous text formats. The data lake for this project is hosted in a Cloudera cluster. It’s worth noting that while we used Cloudera for this particular project, the tools and techniques can be applied to other data lake platforms.
Typically, pharmaceutical organizations seek to search and run analytics over unstructured data to derive insights from past research, respond to regulatory compliance requests, and fulfill other needs. So, our immediate business problem was:
<<< Start >>>
Our client had over 1 PB of research data resided in various systems. How can all of this content be stored in the data lake in a useful way to address different business needs?
<<< End >>>
In this project, we tackled two important problems related to unstructured data found in many data lake implementations:
The solution embedded Accenture’s Aspire Content Processing technology into the data lake as a Cloudera Service. It's also worth noting that, while this blog discusses a pharmaceutical use case, the technique can be extended to a wide range of domains and use cases.
Deployed as a Cloudera Parcel, Aspire resides and runs in the Cloudera Cluster as a Cloudera Service. This yields several benefits:
Aspire deployed as a Cloudera service
Aspire can be deployed as a Cloudera Service and communicate with the data lake for content storage and indexing natively.
First-class unstructured connectors can be deployed in a Hadoop cluster
The Aspire connector framework takes over responsibility for acquiring data from multiple sources.
New connectors can be added to the data lake as needed, for instance, Documentum, File shares (CIFS and NFS), SharePoint, relational databases via JDBC drivers, Impala, Hive, and Kafka.
Scalability through parallel data acquisition
Aspire can be deployed as a service across many nodes in the cluster. This means that many copies of the connector can run in parallel; delivering high-performance throughput rates.
The data flow for this system is depicted below.
<<< Start >>>
Figure 1: Dataflow Overview
<<< End >>>
A staging repository is central to this data lake architecture.
For several years now, we have been using staging repositories to augment data ingestion. This is what we call a "crawl once and reprocess as needed" approach.
So, how can a data lake make use of this architecture?
Ingestion workflow and the staging repository
First, the ingest workflow acquires the content, performs light processing such as text extraction, and then we store everything we captured, including metadata, access control lists, and the extracted full-text of the content in JSON and place it in the NoSQL staging repository. Binary files (such as PDFs) can be stored in the data lake as well to support future use cases. Connectors can use incremental update procedures to update the content in the data lake as it changes in the enterprise data sources.
<<< Start >>>
Figure 2: Ingest Dataflow
<<< End >>>
Reducing significant processing time
A repository deployed in this fashion can reduce days and weeks of processing time.
In some cases, it can take weeks to ingest data due to performance limitations or access restrictions in native sources. Storing all data in the repository allows us to crawl once and then repurpose the content as needed in the indexing workflow.
Indexing workflow
A search engine can support all stages of a data lake project. But it plays a critical role in the early stages of the data lake project. Initially, our client indexed their data lake content with search engines to understand what content was brought into the lake. In later stages, they created highly curated indexes to support specific business use cases.
A first cut analysis may support a review of the content in multiple dimensions:
<<< Start >>>
Figure 3: Indexing Dataflow
<<< End >>>
Repurposing and enriching content
As depicted in Figure 3, a search engine embedded in a big data framework can generate numerous search engine indexes. Each index can be designed specifically to meet the needs of specific user communities within the organization and allows the client to view the content from various perspectives.
Enrichment workflow
The value of the content in the staging repository increases as we know more about it. The content can be classified and tagged over time using ontologies created by the organization or from external domain-specific sources and APIs.
The data lake staging repository supports this use case nicely with the addition of an Aspire enrichment workflow. An enrichment workflow (Figure 4) passes content through REST APIs or other mechanisms to classify documents and perform named entity recognition using domain-specific ontologies or local terminologies.
The workflow can be triggered by one of two scenarios:
Enriching content in the staging repository saves time and ensures continuously updated content.
<<< Start >>>
Figure 4: Aspire Enrichment Dataflow
<<< End >>>
Aspire Content Processing helps acquire and enrich unstructured data at scale, enabling organizations to understand and make better use of the data stored in their data lakes.
While this blog discusses a framework for a pharmaceutical client, the techniques can be applied in multiple domains across industries.
What's your approach to unlocking insights from your enterprise unstructured data? Connect with us to discuss your use cases.