What is intelligent document analysis?

Intelligent Document Analysis (IDA) is the use of Natural Language Processing (NLP) and Machine Learning to derive insights from unstructured data – text documents, social media posts, mail, images, etc. As 80% of all enterprise data is unstructured, IDA can deliver tangible benefits across industries and business functions, such as improve compliance and risk management, increase internal operational efficiencies, and enhance business processes.

In this blog, I will describe the main NLP techniques used in IDA and provide examples of various business use cases. I will also discuss some key considerations for starting your first IDA project.

<<< Start >>>

<<< End >>>

Intelligent document analysis techniques

Below are seven common IDA techniques. Example use cases will be provided to explain each technique.

1.  Named Entity Recognition

Named Entity Recognition identifies named entity mentions within the text and classifies them into predefined categories, such as person names, organisations, locations, time expressions, monetary values, etc. There is a range of approaches for performing Named Entity Recognition:

  • Out-of-the-box entity recognition – Most NLP packages or services include pre-trained machine learning models for identifying entities. This makes it very easy to identify key entity types such as person names, organisations, and locations with just a simple API call and without the need to train a machine learning model.
  • Machine-learned entity recognition – Out-of-the-box entities are convenient but typically generic and, in many cases, it will be necessary to identify additional entity types. For example, when processing documents in a recruitment context, we would want to identify job titles and skills. In a retail context, we would want to identify product names.
  • Deterministic entity recognition – If the entities that you want to identify are finite and pre-defined then a deterministic approach will be easier and more accurate than training a machine learning model. In this approach, a dictionary of the entities is provided; then, the entity recogniser will identify in the text any instance of an entry from the dictionary. For example, the dictionary could contain a list of all products from a company. It is also possible to combine the dictionary approach with machine learning. The dictionary is used to annotate training data for the machine learning model which then learns to identify instances of the entities that were not in the dictionary. Deterministic entity recognition is not commonly supported in out-of-the-box NLP packages or services. Some NLP packages that do support this deterministic approach use an ontology rather than a dictionary. The ontology defines relationships and related terms for the entities and this enables the entity recogniser to disambiguate between ambiguous entities using the context of the document.
  • Pattern-based entity recognition – If an entity type can be defined by regular expressions then these could be identified using regular expression matching. For example, product codes or citation references could be identified using regular expressions. A simplified regex for a UK National Insurance Number is [A-Z]{2}[0-9]{6}[A-Z] (2 uppercase letters, followed by 6 digits, followed by 1 uppercase letter).

Named Entity Recognition is a key pre-processing technique for many of the other IDA techniques discussed in this blog. Other example Named Entity Recognition use cases include:

  • Identifying company and Fund names in a financial prospectus. In this example, the company names could be identified using an out-of-the-box model whereas the Fund names would be identified using a machine learning model, a deterministic approach, or a combination of the two.
  • Identifying references between documents in a corpus. In this example, the references can be identified using a regular expression – a pattern-based entity recognition approach.

2.  Sentiment Analysis

Sentiment Analysis identifies and categorises the opinion expressed within the text, such as news reports, social media content, reviews, etc. In its simplest form, it may categorise the sentiment as positive or negative; but it could also quantify the sentiment (e.g. -1 to +1) or categorise it at a more granular level (e.g. very negative, negative, neutral, positive, very positive).

Sentiment Analysis, like many NLP techniques, needs to be able to cope with the complexities of language. For example:

  • Negation – Words like “not” and “never” will change the sentiment of the words used. For example, “This film does not have a gripping plot or likeable characters.”
  • Level – Sentiment can be expressed in varying degrees. For example, there is increasing positivity in “I liked it,” “I loved it,” and “I absolutely loved it”, but where would “I really enjoyed it” fit in this progression?
  • Conflicting – The text may include both positive and negative sentiment. For example, should “Their first album was great, but their second album was rubbish” be considered as positive, negative, or neutral?
  • Implied – In the sentence “I’ll be angry if the delivery is late,” the negative sentiment is conditional on something which has not, and may not, happen. In the sentence “They used to be good,” a positive sentiment is expressed about the past, but a negative sentiment is perhaps implied about the present.
  • Slang – Slang can often have the opposite meaning to its conventional meaning. For example, the word “sick” would have a very different meaning depending on the context in which it is used (“The food at this restaurant made me sick” vs. “That new video game release is sick!”) or on the demographic of the author.
  • Entity level – Entity-level sentiment analysis provides a more granular understanding of the sentiment by considering the sentiment at an entity level rather than document or sentence level. This will resolve the ambiguity seen in the example in the “Conflicting” scenario (“Their first album was great, but their second album was rubbish.”). It does so by assigning a positive sentiment to the first album (the first entity) but a negative sentiment to the second album (the second entity).

Sentiment Analysis is often used to analyse social media posts relating to a company or its competitors. It can be a powerful tool to:

  • Track sentiment trends over time
  • Analyse the impact of an event (e.g. a product launch or redesign)
  • Identify key influencers
  • Provide an early warning of a crisis

3.  Text Similarity

Text Similarity calculates the similarity between sentences, paragraphs, and documents.
To calculate the similarity between two items, the text must first be converted into an n-dimensional vector which represents the text. This vector might contain the keywords and entities in the document or a representation of the topics expressed in the content. The similarity between the vectors and therefore the documents can then be measured by techniques such as cosine similarity.

Text Similarity can be used to detect duplicates and near-duplicates in documents or parts of a document. Here are two examples:

  • Checking academic essays for plagiarism by comparing the similarity in the content of essays.
  • Matching candidates to jobs and vice-versa. But in this case, it is concerned with the similarity between key characteristics (job titles, skills, etc) rather than a strict near-duplicate detection. For this type of use case, semantic similarity is useful because it is important to take into consideration that two skills (e.g. Artificial Intelligence and machine learning) or job titles (e.g. Data Scientist and Data Architect) may be related even though they are not identical.

4.  Text Classification

Text Classification is used to assign an item of text to one or more categories based on its content. It has two dimensions:

  • Number of classes – The simplest form of classification is binary classification where there are only two possible classes into which an item can be classified. An example of this is spam filtering where emails are categorised as either spam or not spam. Multi-class or multinomial classification has more than two classes into which an item can be classified.
  • Number of labels – Single-label classification categorises an item into precisely one class, whereas multi-label classification can categorise an item into multiple classes. Classifying news articles into multiple subject areas is an example of multi-label classification.

In general, the lower the number of classes and labels, the higher the expected accuracy.
Text Classification will use the words, entities, and phrases in the document to predict the classes. It could also consider additional features, such as any headings, metadata, or images contained in the document.

An example use case for Text Classification is the automated routing of documents, such as mail or email. Text Classification is used to determine the queue to which a document should be sent so that it can then be processed by the appropriate team of specialists thus saving time and resources (e.g. legal, marketing, finance, etc.).

Text Classification can also be applied to sections of a document (e.g. sentences or paragraphs), for example, to identify the parts of a letter where complaints are being made and what type of complaint they are.

5.  Information Extraction

Information Extraction extracts structured information from unstructured text.

An example use case is to identify the sender of a letter. The primary means of identification is the sender’s reference, identification, or membership number. If this is not found then the fall-back could be the sender’s name, postal code, and date of birth. Each of these pieces of information could be identified by Named Entity Recognition, but this on its own would be insufficient because multiple instances may be found. Information Extraction relies on Entity Recognition. It’s the understanding of the context of the entities that helps to determine which is the correct answer. For example, the letter may contain multiple dates and postal codes, so it would be necessary to determine which, if any, is the date of birth and which is the postal code of the sender.

6. Relationship Extraction

Relationship Extraction extracts semantic relationships between two or more entities. Similar to Information Extraction, Relationship Extraction relies on Named Entity Recognition, but the difference is that it is specifically concerned with the type of relationship between the entities. Relationship Extraction can be used to perform Information Extraction.

Some NLP packages and services provide out-of-the-box models for extracting relationships, such as “employee of,” “married to,” and “location born at.” As with Named Entity Recognition, custom relationship types can be extracted by training specific machine learning models.

<<< Start >>>

Relationship Extraction can be used to process unstructured documents to identify specific relationships which can then be used to populate a Knowledge Graph.

 

<<< End >>>

For example, this technique can extract the relationships between diseases, symptoms, drugs, etc. by processing unstructured medical documents.

7.  Summarisation

Summarisation shortens text to create a coherent summary of the main points. There are two different approaches to Text Summarisation:

  • Extraction-based summarisation extracts sentences or phrases without modifying the original text. This approach generates a summary composed of the top N most important sentences from the document.
  • Abstraction-based summarisation uses Natural Language Generation to paraphrase and condense the document. This is much more complex and experimental than the extraction-based approach.

Text Summarisation can be used to enable humans to quickly digest the content of large volumes of documents without the need to fully read them. An example of this is news feeds or scientific publications where there is a large volume of documents being constantly generated.

<<< Start >>>



 

<<< End >>>

Complexities of intelligent document analysis tasks

Machine learning tends to be much more complex on unstructured text than it is on structured data and so it is much harder to achieve or surpass human-level performance for analysing text documents.

1.  Language complexity

It takes humans years to understand language because of the variation, ambiguity, context, and relationships that it contains. There are many ways through which we can express the same idea. We use different styles depending on the author and audience and choose to use synonyms to add interest and avoid repetition. IDA techniques must be able to make sense of the different styles, ambiguities, and word relationships to derive accurate insights.

IDA requires the understanding of both general language and domain-specific terminology. One approach for handling domain-specific terminology is to use custom dictionaries or build custom machine learning models for entity extraction, relationship extraction, etc.

An alternative approach to tackling the problem of combining general language and domain-specific terminology is Transfer Learning. This takes an existing Neural Network which has been trained on huge volumes of general text and then adds extra layers and trains the combined model using a smaller amount of content which is specific to the problem. The existing Neural Network is analogous to the years of understanding that a human develops whilst at school. The extra layers are analogous to the domain or task-specific learning which happens when the person leaves school and starts working.

2.  Accuracy

The accuracy for IDA techniques depends on the variation, style, and complexity of the language used. It can also depend on:

  • Training data – The quality of a machine learning model depends on the volume and quality of the training data.
  • Number of classes – The accuracy of techniques such as Text Classification, Sentiment Analysis, Entity Extraction, and Relationship Extraction, will vary depending on the number of classes and types of entity/relation and the overlap between them.
  • Document size – For some techniques, such as Text Classification and Similarity, large documents are helpful because they provide more context. Other techniques including Sentiment Analysis and Summarisation are harder on large documents.

NLP-progress is a website which tracks the accuracy of state-of-the-art models on the most common NLP tasks. This provides a useful guide to the level of accuracy which is possible. The best guide, though, for whether IDA will generate accurate results is to ask yourself “How easy would it be for a human to do this?” If a human can learn to do the task accurately without years of training then IDA has the potential to deliver benefits by speeding up the process, maintaining consistency, or reducing manual labour.

How do I tackle an intelligent document analysis project?

An IDA project can be integrated into a business in one of two ways:

  • Automation – IDA is used to automate an existing or new process without any human intervention
  • Human-in-the-loop – IDA is used to provide support for a human when making a decision, but the human has the final responsibility.

The approach used should depend on the accuracy achieved by IDA and the cost of making incorrect decisions. If the cost of incorrect decisions is high, then consider starting with human-in-the-loop until the accuracy is high enough.

IDA projects are best tackled iteratively – start with a proof of concept to determine if the approach is feasible and, if so, whether the achieved accuracy indicates the use of automation or human-in-the-loop. Then iteratively add complexity until the estimated effort does not justify the expected gains.

<<< Start >>>

For your first IDA project, consider these steps:

  • pick a use case which either has a low cost of incorrect decisions or where a human makes the final decision;
  • start with a proof of concept to determine if the approach is feasible; and
  • iteratively add complexity to increase the accuracy of the application.

<<< End >

This process will allow you to become familiar with the techniques and for your business sponsors to gain confidence in them before tackling the more complex use cases with higher benefits.

With thorough planning and implementation strategy, your organisation can leverage the NLP and machine learning techniques discussed above to build IDA applications that improve business outcomes.

<<< Start >>>



 

<<< End >>>

Mark Stanger

Functional & Industry Analytics Sr. Manager – Accenture Applied Intelligence

Subscription Center
Subscribe to Accenture's Search and Content Analytics Blog Subscribe to Accenture's Search and Content Analytics Blog