Intelligent document analysis with natural language processing
June 6, 2019
June 6, 2019
Intelligent Document Analysis (IDA) is the use of Natural Language Processing (NLP) and Machine Learning to derive insights from unstructured data – text documents, social media posts, mail, images, etc. As 80% of all enterprise data is unstructured, IDA can deliver tangible benefits across industries and business functions, such as improve compliance and risk management, increase internal operational efficiencies, and enhance business processes.
In this blog, I will describe the main NLP techniques used in IDA and provide examples of various business use cases. I will also discuss some key considerations for starting your first IDA project.
<<< Start >>>
<<< End >>>
Below are seven common IDA techniques. Example use cases will be provided to explain each technique.
1. Named Entity Recognition
Named Entity Recognition identifies named entity mentions within the text and classifies them into predefined categories, such as person names, organisations, locations, time expressions, monetary values, etc. There is a range of approaches for performing Named Entity Recognition:
Named Entity Recognition is a key pre-processing technique for many of the other IDA techniques discussed in this blog. Other example Named Entity Recognition use cases include:
2. Sentiment Analysis
Sentiment Analysis identifies and categorises the opinion expressed within the text, such as news reports, social media content, reviews, etc. In its simplest form, it may categorise the sentiment as positive or negative; but it could also quantify the sentiment (e.g. -1 to +1) or categorise it at a more granular level (e.g. very negative, negative, neutral, positive, very positive).
Sentiment Analysis, like many NLP techniques, needs to be able to cope with the complexities of language. For example:
Sentiment Analysis is often used to analyse social media posts relating to a company or its competitors. It can be a powerful tool to:
3. Text Similarity
Text Similarity calculates the similarity between sentences, paragraphs, and documents.
To calculate the similarity between two items, the text must first be converted into an n-dimensional vector which represents the text. This vector might contain the keywords and entities in the document or a representation of the topics expressed in the content. The similarity between the vectors and therefore the documents can then be measured by techniques such as cosine similarity.
Text Similarity can be used to detect duplicates and near-duplicates in documents or parts of a document. Here are two examples:
4. Text Classification
Text Classification is used to assign an item of text to one or more categories based on its content. It has two dimensions:
In general, the lower the number of classes and labels, the higher the expected accuracy.
Text Classification will use the words, entities, and phrases in the document to predict the classes. It could also consider additional features, such as any headings, metadata, or images contained in the document.
An example use case for Text Classification is the automated routing of documents, such as mail or email. Text Classification is used to determine the queue to which a document should be sent so that it can then be processed by the appropriate team of specialists thus saving time and resources (e.g. legal, marketing, finance, etc.).
Text Classification can also be applied to sections of a document (e.g. sentences or paragraphs), for example, to identify the parts of a letter where complaints are being made and what type of complaint they are.
5. Information Extraction
Information Extraction extracts structured information from unstructured text.
An example use case is to identify the sender of a letter. The primary means of identification is the sender’s reference, identification, or membership number. If this is not found then the fall-back could be the sender’s name, postal code, and date of birth. Each of these pieces of information could be identified by Named Entity Recognition, but this on its own would be insufficient because multiple instances may be found. Information Extraction relies on Entity Recognition. It’s the understanding of the context of the entities that helps to determine which is the correct answer. For example, the letter may contain multiple dates and postal codes, so it would be necessary to determine which, if any, is the date of birth and which is the postal code of the sender.
6. Relationship Extraction
Relationship Extraction extracts semantic relationships between two or more entities. Similar to Information Extraction, Relationship Extraction relies on Named Entity Recognition, but the difference is that it is specifically concerned with the type of relationship between the entities. Relationship Extraction can be used to perform Information Extraction.
Some NLP packages and services provide out-of-the-box models for extracting relationships, such as “employee of,” “married to,” and “location born at.” As with Named Entity Recognition, custom relationship types can be extracted by training specific machine learning models.
<<< Start >>>
Relationship Extraction can be used to process unstructured documents to identify specific relationships which can then be used to populate a Knowledge Graph.
<<< End >>>
For example, this technique can extract the relationships between diseases, symptoms, drugs, etc. by processing unstructured medical documents.
7. Summarisation
Summarisation shortens text to create a coherent summary of the main points. There are two different approaches to Text Summarisation:
Text Summarisation can be used to enable humans to quickly digest the content of large volumes of documents without the need to fully read them. An example of this is news feeds or scientific publications where there is a large volume of documents being constantly generated.
<<< Start >>>
<<< End >>>
Machine learning tends to be much more complex on unstructured text than it is on structured data and so it is much harder to achieve or surpass human-level performance for analysing text documents.
1. Language complexity
It takes humans years to understand language because of the variation, ambiguity, context, and relationships that it contains. There are many ways through which we can express the same idea. We use different styles depending on the author and audience and choose to use synonyms to add interest and avoid repetition. IDA techniques must be able to make sense of the different styles, ambiguities, and word relationships to derive accurate insights.
IDA requires the understanding of both general language and domain-specific terminology. One approach for handling domain-specific terminology is to use custom dictionaries or build custom machine learning models for entity extraction, relationship extraction, etc.
An alternative approach to tackling the problem of combining general language and domain-specific terminology is Transfer Learning. This takes an existing Neural Network which has been trained on huge volumes of general text and then adds extra layers and trains the combined model using a smaller amount of content which is specific to the problem. The existing Neural Network is analogous to the years of understanding that a human develops whilst at school. The extra layers are analogous to the domain or task-specific learning which happens when the person leaves school and starts working.
2. Accuracy
The accuracy for IDA techniques depends on the variation, style, and complexity of the language used. It can also depend on:
NLP-progress is a website which tracks the accuracy of state-of-the-art models on the most common NLP tasks. This provides a useful guide to the level of accuracy which is possible. The best guide, though, for whether IDA will generate accurate results is to ask yourself “How easy would it be for a human to do this?” If a human can learn to do the task accurately without years of training then IDA has the potential to deliver benefits by speeding up the process, maintaining consistency, or reducing manual labour.
An IDA project can be integrated into a business in one of two ways:
The approach used should depend on the accuracy achieved by IDA and the cost of making incorrect decisions. If the cost of incorrect decisions is high, then consider starting with human-in-the-loop until the accuracy is high enough.
IDA projects are best tackled iteratively – start with a proof of concept to determine if the approach is feasible and, if so, whether the achieved accuracy indicates the use of automation or human-in-the-loop. Then iteratively add complexity until the estimated effort does not justify the expected gains.
<<< Start >>>
For your first IDA project, consider these steps:
- pick a use case which either has a low cost of incorrect decisions or where a human makes the final decision;
- start with a proof of concept to determine if the approach is feasible; and
- iteratively add complexity to increase the accuracy of the application.
<<< End >
This process will allow you to become familiar with the techniques and for your business sponsors to gain confidence in them before tackling the more complex use cases with higher benefits.
With thorough planning and implementation strategy, your organisation can leverage the NLP and machine learning techniques discussed above to build IDA applications that improve business outcomes.
<<< Start >>>
<<< End >>>