Part 2 of the “We’ve Got Chemistry!” blog series

 

In Part 1 of this series, we covered various data preparation needs within a chemical search application. Consider revisiting that post if you need a refresh. In this second part, we’ll cover how to implement a chemical search application.

Reference architecture for a chemical search application

The following diagram shows a simplified reference architecture for a chemical search application.

<<< Start >>>

chemical search application architecture

High-level architecture for a chemical search application - an integration of different tools

<<< End >>>

  • The left side of this diagram (Documents -> Data Loader -> Search Engine) illustrates the data preparation components that I discussed in part 1 of this series. One item that's not shown in the diagram, for simplicity, is a data repository for all content acquired for the system. You may call it a data lake. You may think of it as a staging repository. Even though it's omitted in this specific diagram, we recommend such a repository for storing raw data (‘as-is’ from the sources) along with the processed, enriched data to be published to the search engine for indexing.
  • The right side of the diagram (Search Engine-> Enterprise Search UI -> Web Browser) illustrates the specific components on the search side of the application, which I’ll be discussing in detail in this post.

Chemical search approaches

First, we need to further elaborate into the search scenarios to be served by the application. Given a molecule (or compound) as input, chemists could look for matching molecules (or compounds) in at least three possible ways:

  1. Similarity search. Match chemical structures in the index that are similar to the structure provided as input.
  2. Substructure search. When the input chemical structure is contained within the indexed chemical structures, returned those as matched larger structures.
  3. Exact match. The most restrictive of these approaches. A match is made when the indexed chemical structure is equal to that provided as input in the query.

Conceptually, the descriptions of the search types clearly differentiate the goal for each. Except for the exact match case, there are nuances in the other two that can have a big impact on the quality of the search results or the time it takes to get them back.

To understand the complexity of chemical searches, we must consider the multiple chemical properties that may be involved. For those of us that are not chemists, Wikipedia defines a chemical property as “any of a material's properties that becomes evident during, or after, a chemical reaction.” Thus, chemical properties should be included in matching a query chemical structure against those in the index.

Generating and indexing molecular fingerprints

Because there isn’t a universal standard to represent chemical structures, chemical search is usually implemented based on the statistical concept of similarity measure applied to molecular fingerprints. A molecular fingerprint is yet another representation of a molecule, in addition to those described in the Normalization section in part 1 of this blog series.

A chemical structure is mapped as a molecular fingerprint using an array of bits, where bits are on or off (true or false, 1 or 0); depending on the descriptors of that molecule (or compound). Each molecule is thus represented by a somewhat unique fingerprint. Why somewhat unique and not just unique? Well, partly depends on the fingerprint algorithm of choice or its length; partly depends on the chemical representation used to generate the fingerprint.

<<< Start >>>

generate chemical structure fingerprints to enable search

Generating a molecular fingerprint to enable search

<<< End >>>

There are two fingerprints for each molecule: similarity and substructure to support the two search approaches mentioned above (similarity search and substructure search). Each is calculated differently, therefore the need for two fingerprints for every molecule or compound. Although fingerprints may still be used, the substructure may also be determined through the consideration of one or more chemical properties (e.g. stereochemistry, tautomers, isotopes, sizes, etc.) as additional criteria in identifying matches. It's important to clearly determine the types of substructure search that your chemical search application should offer so that you can build it accordingly.

The exact match is a special case where the entire molecule or compound in the query is identical to the one matched in the retrieved document. As a result, no specific representation is necessary for an exact match, but some validation is required.

Searching for chemical structures

Likewise, a query with a chemical structure requires the calculation of a query fingerprint. Either the substructure or the similarity fingerprint will be calculated, depending on which search is intended. This query chemical structure fingerprint is used by the search engine to match against the corresponding indexed fingerprints. Let’s call the matched search results “the candidate fingerprints.”

<<< Start >>>

search for chemical structures

Searching for chemical structures

<<< End >>>

Next comes the need for a similarity measure, which I mentioned earlier. This is a post-processing step to reduce false positives in the search engine results. Essentially, each candidate fingerprint in the search results is compared against the query fingerprint through the calculation of a similarity coefficient between the two of them.

If you are familiar with the spell-checking or did-you-mean search features, you would quickly realize that the same technique is applied there. For example, with a search for “asprin,” the search engine would evaluate multiple candidates to suggest, possibly selecting “aspirin” as the suggestion to present. But if you had typed in “aspring” instead, the suggestion may be “aspiring” because the similarity coefficient changed as the query changed (although both queries were likely misspelled). Unlike spell-checking, a chemical search application would likely return multiple results and not only the one with the highest coefficient score.

Mixed searches

Now that we know how the chemical structures are handled on both the index and query sides, let’s review the last part of the architecture. The diagram below shows a search process that expects queries combining text keywords with any of the multiple representations of a chemical structure. Imagine that the query is for “caffeine effects.”

<<< Start >>>

chemical search queries containing keyword and SMILE

Handling chemical search queries containing keywords and SMILES notations

<<< End >>>

  1. Extract SMILES would separate "effects" as a keyword, and "caffeine" as a compound (though not in SMILES notation but there is a relationship between this term and a SMILES)
  2. The Chemical Toolkit would generate the appropriate fingerprint to search for
  3. The Construct Search Query would take the keyword(s) and fingerprint(s) to formulate the request against the appropriate search fields (along with other desired query conditions)
  4. The Search Engine would return the candidate results (those with the candidate fingerprints mentioned before)
  5. The Fine-grain similarity scoring and re-sort is the final part of the process, described later

Matching without false positives

It's a mechanism to improve accuracy in the search results. This post-processing would likely be done through features of the toolkit(s) integrated for data preparation. Various toolkits include validation features that allow for comparison of molecules or compounds to ensure that:

  • The two are similar enough;
  • One is a substructure of the other; or
  • The two are exactly the same

Additionally, the toolkit may be used to validate that the input is a chemically correct molecule or compound. This is particularly important to ensure that the query input is appropriate to search with. Also, it can be used to ensure that the data to be indexed is correct because a conversion may not have gone right or the original data may have a typo in it.

These validations may be computationally expensive, especially in the query side: there may be too many results to compare against the query chemical structure. A compromise may need to be made on how much validation is done; or expectations must be set with users that searches may take seconds or minutes to resolve depending on how large your index is, how many "candidate results" your query returns, or other variables.

Improving chemical search application performance

We know in advance that chemical search may require a significant amount of processing search time to match and post-process search results. The response times would likely be affected by additional content or additional search features over time. Tuning a search application is a constant process that is specific to each individual application. There are some few considerations for reducing response times:

  • Pre-calculate at index time as much as possible. Identify opportunities to index chemical properties or other data points that the search engine may use in filtering false positives out of search results.
  • Parallelize the post-processing into your search nodes. Rather than aggregating all results from all search nodes and then doing the post-processing, push post-processing into each search node. This way, the filtering of false positives would happen at each individual node at about the same time rather than after all search results are collected. An additional benefit would be a smaller search response traveling through the network.
  • Deploy your search nodes with more processing cores than what would be used for regular text search. The additional cores would allow for more parallelization in post-processing, minimizing the effect on other work occurring in the search node at the same time.

Combining chemical search with natural language search

You may be at a point in your journey with chemical search where your search scenarios are limited to matching data based on chemical structure representations only. This may be because it's already hard enough to implement a chemical search application, thus limiting your solution seems right. Yet, your real needs are likely to implement a chemical search application capable of addressing questions that combine chemical structures with natural language (beyond just keywords!).

Think about the use cases you have around chemical structures. In part 1 of this blog series, I listed a few business cases, such as reducing cost and accelerating research, innovating, and discovery. We also covered the processing of your data because it combines chemical and non-chemical content. Your queries will also combine chemical and non-chemical parts. Using similar known techniques to parse a query request would allow for recognition of chemical structures as part of that query. Chemical structures in the query would become another entity to be used in formulating a better request to the search engine since parts of the query can be directed to the parts identified within your indexed data. This would increase findability and accuracy in the search results.

There are multiple resources that would help you design and implement a solution that handles natural language in the query side. An example framework - an R&D initiative - is Accenture’s Saga Natural Language Understanding. Combine natural language processing (NLP) with chemical awareness described in this blog, you can build a modern, powerful, and scalable chemical search application.

Ways forward for chemical search: incorporating quantum computing

There's a chance that the complexity of some of your use cases, your volume of data, the expected accuracy of the results, or a combination of factors may require a more customized solution. It may require much more speed than it's possible with general-purpose search engines and the typical machines in which they are usually implemented.

For those cases, it's already possible to implement those solutions in much more powerful machines using quantum computing and specialized matching software. As with any new technology, quantum computing is still in the experimental phase for most enterprises, but Accenture has started to leverage it to help some clients solve chemical search challenges quickly. Many of the aspects mentioned in this blog are still applicable in such solution, mostly in acquiring and preparing the right data to search against.

Gone are the days, a few years back, when my wife and I could answer pretty much any of our boys’ questions. That didn’t last long at all! Now we’re helping them learn how to ask the right questions and find the answers themselves. At Accenture, we work to help clients build or improve their chemical search platforms and tools that allow their scientists to do the same. Happy finding!

<<< Start >>>



<<< End >>>

Carlos Maroto

Functional & Industry Analytics Manager – Accenture Applied Intelligence

Subscribe to Accenture's Search and Content Analytics Blog Subscribe to Accenture's Search and Content Analytics Blog