Over the past few years, our team has worked with several clients to develop data lakes for storing enterprise-wide content. A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. Once the content is in the data lake, it can be searched, enriched, and used to generate insights to solve business problems and support diverse user needs.

In a recent project for a pharmaceutical client, we tackled a different problem: ingesting over one petabyte (PB) of unstructured data into their data lake. To put this into perspective, according to Computer Weekly,

<<< Start >>>

"One petabyte is enough to store the DNA of the entire population of the US – and then clone them, twice."

<<< End >>>

 

What is unstructured data?

Unstructured data refers to images, voice recordings, videos, and text documents written by humans for humans. Text can include PDFs, presentations, memos, emails, research and regulatory reports, and social media posts. Unstructured data generally lacks a predefined model to describe their content. The absence of consistent descriptive metadata poses challenges for applications seeking to generate insights from this data.

Fortunately, there are available content processing, tagging, and connector technologies for acquiring and enriching unstructured text with metadata. However, technology alone is rarely enough. Unstructured data enrichment involves careful planning to extract the text from the content and generate metadata to help business systems (like search engines and analytics applications) make sense of the information.

 

The business problem

Our client had over 1 PB of research data stored in various systems, ranging from Windows and Unix (NFS) file shares to Documentum and SharePoint. Numerous data types were found in these sources – PDFs, MS Office documents, email .msg files, images, and miscellaneous text formats. The data lake for this project is hosted in a Cloudera cluster. It’s worth noting that while we used Cloudera for this particular project, the tools and techniques can be applied to other data lake platforms.

Typically, pharmaceutical organizations seek to search and run analytics over unstructured data to derive insights from past research, respond to regulatory compliance requests, and fulfill other needs. So, our immediate business problem was: 

<<< Start >>>

The "Petabyte" challenge

Our client had over 1 PB of research data resided in various systems. How can all of this content be stored in the data lake in a useful way to address different business needs?

<<< End >>>

 

Make sense of unstructured data in a data lake

In this project, we tackled two important problems related to unstructured data found in many data lake implementations:

  • How to simplify data lake ingestion, especially for large volumes of unstructured data
  • How to ensure the content can be reused and repurposed within the data lake

The solution embedded Accenture’s Aspire Content Processing technology into the data lake as a Cloudera Service. It's also worth noting that, while this blog discusses a pharmaceutical use case, the technique can be extended to a wide range of domains and use cases.

 

Simplify data lake ingestion

Deployed as a Cloudera Parcel, Aspire resides and runs in the Cloudera Cluster as a Cloudera Service. This yields several benefits:

Aspire deployed as a Cloudera service

Aspire can be deployed as a Cloudera Service and communicate with the data lake for content storage and indexing natively.

  • Aspire instances can be deployed across many nodes in the cluster to improve performance.
  • Aspire leverages the same authentication protocols defined for the cluster such as Kerberos and LDAP; simplifying integration and satisfying data lake security requirements.

First-class unstructured connectors can be deployed in a Hadoop cluster

The Aspire connector framework takes over responsibility for acquiring data from multiple sources.

  • Purpose-built connectors can acquire binaries, metadata, and access control lists related to content in enterprise data systems (PDFs, Office documents, lab notebook reports).
  • Unlike other ETL models, external systems aren't responsible for pushing content to the data lake. External systems don't need to be data-lake-aware.

New connectors can be added to the data lake as needed, for instance, Documentum, File shares (CIFS and NFS), SharePoint, relational databases via JDBC drivers, Impala, Hive, and Kafka.

Scalability through parallel data acquisition

Aspire can be deployed as a service across many nodes in the cluster. This means that many copies of the connector can run in parallel; delivering high-performance throughput rates.

The data flow for this system is depicted below.

<<< Start >>>

data lake ingestion workflow

Figure 1: Dataflow Overview

<<< End >>>

 

Efficient content re-use with a staging repository

A staging repository is central to this data lake architecture.

  • The ingestion stage uses connectors to acquire data and publishes it to the staging repository
  • The indexing stage picks up the data from the repository and supports indexing or publishing it to other sources.

For several years now, we have been using staging repositories to augment data ingestion. This is what we call a "crawl once and reprocess as needed" approach.

So, how can a data lake make use of this architecture?

Ingestion workflow and the staging repository

First, the ingest workflow acquires the content, performs light processing such as text extraction, and then we store everything we captured, including metadata, access control lists, and the extracted full-text of the content in JSON and place it in the NoSQL staging repository. Binary files (such as PDFs) can be stored in the data lake as well to support future use cases. Connectors can use incremental update procedures to update the content in the data lake as it changes in the enterprise data sources.

<<< Start >>>

data lake ingestion data flow

Figure 2: Ingest Dataflow

<<< End >>>

Reducing significant processing time

A repository deployed in this fashion can reduce days and weeks of processing time.
In some cases, it can take weeks to ingest data due to performance limitations or access restrictions in native sources. Storing all data in the repository allows us to crawl once and then repurpose the content as needed in the indexing workflow.

Indexing workflow

A search engine can support all stages of a data lake project. But it plays a critical role in the early stages of the data lake project. Initially, our client indexed their data lake content with search engines to understand what content was brought into the lake. In later stages, they created highly curated indexes to support specific business use cases.

A first cut analysis may support a review of the content in multiple dimensions:

  • By type: In many cases, a single unstructured content source may include different types of content (reports, memos, or lab experiments), as well as some low-value information like logs or working files.
  • By use case: Content may also support different use cases.
  • By provenance/authorship: Knowing the department, lab, or researcher who generated content is essential.
  • By publishing/creation date: Analyzing data lake content by date helps understand trends and developments over time.

<<< Start >>>

data lake indexing workflow

Figure 3: Indexing Dataflow

<<< End >>>

Repurposing and enriching content

As depicted in Figure 3, a search engine embedded in a big data framework can generate numerous search engine indexes. Each index can be designed specifically to meet the needs of specific user communities within the organization and allows the client to view the content from various perspectives.

Enrichment workflow

The value of the content in the staging repository increases as we know more about it. The content can be classified and tagged over time using ontologies created by the organization or from external domain-specific sources and APIs.

The data lake staging repository supports this use case nicely with the addition of an Aspire enrichment workflow. An enrichment workflow (Figure 4) passes content through REST APIs or other mechanisms to classify documents and perform named entity recognition using domain-specific ontologies or local terminologies.

The workflow can be triggered by one of two scenarios:

  • A new document arrives in the repository and needs enrichment.
  • An ontology is updated and therefore the previous tagging is “out of date” and must be repeated so that old content can benefit from new terms.

Enriching content in the staging repository saves time and ensures continuously updated content. 

<<< Start >>>

unstructured data enrichment for data lake

Figure 4: Aspire Enrichment Dataflow

<<< End >>>

Aspire Content Processing helps acquire and enrich unstructured data at scale, enabling organizations to understand and make better use of the data stored in their data lakes.

While this blog discusses a framework for a pharmaceutical client, the techniques can be applied in multiple domains across industries. 

What's your approach to unlocking insights from your enterprise unstructured data? Connect with us to discuss your use cases.

Derek Rodriguez

Technology Architecture Sr. Manager – Accenture Applied Intelligence

Subscription Center
Subscribe to Accenture's Search and Content Analytics Blog Subscribe to Accenture's Search and Content Analytics Blog