Data preparation for chemical search
June 5, 2019
June 5, 2019
I live with the curiosity of my two sons’ fresh, young minds, constantly wondering how things work, where they come from, etc. They are now old enough to find some answers through their own thinking, exploration, or experimentation. Other times though, they look to mom or dad for answers.
At ages 7 and 9 though, my wife and I keep running out of answers much more often than we used to. Naturally, I turn to a search engine when I need help. Still, those answers sometimes don’t come as easily or as quickly as their next question does!
As search and analytics professionals, my colleagues and I have collaborated with many similarly inquisitive minds, though not as young, in our daily work. While some of them look for answers to existing questions, others look for tools that help them formulate better questions, ultimately enabling them to efficiently find the answers themselves!
The inquisitive minds about which I’d elaborate in this blog series are those of scientists. More specifically, these are chemical scientists who leverage chemical search to find molecules or compounds similar to other molecules or compounds they are working with. And they need to find more, in less time (just like the rest of us!)
This blog series is a small contribution to the cheminformatics space based on my experience and that of other Accenture colleagues in building chemical search applications. Spoiler alert: no simple answer. Yet, we believe that the stars are now aligned for much better chemical search (or should I say that the atoms are aligned to be bonded?).
For those of you non-chemists out there, like me, you might decline a hot cup of this compound... until you find out that it is one of the chemical names given to caffeine.
If you are a chemist, would you have recognized caffeine when shown this image?
<<< Start >>>
<<< End >>>
Or perhaps you would have identified caffeine by seeing:
Chemical search is the process of quickly and efficiently sifting through multiple data sources to find all relevant data related to a molecule or compound, such as the examples above:
Below are just a few chemical search use cases that chemical scientists and their business partners may have within their cheminformatics practice.
<<< Start >>>
<<< End >>>
<<< Start >>>
<<< End >>>
Today, answering these questions involves searching against multiple disconnected repositories. It takes a long time and quite often yields many incorrect hits to further sift through. Some hurdles are well-known from other search applications, such as dealing with data silos. Other hurdles, like reducing search time may be solved with more computing power. Finally, chemical-specific hurdles continue to be tackled by lots of people invested in further developing cheminformatics.
Having a well-thought-out data preparation strategy can help alleviate some of these hurdles.
The diagram below illustrates a high-level data preparation process for a chemical search application.
<<< Start >>>
Data preparation steps for a chemical search application
<<< End >>>
1. Identifying data sources
Once the purposes or goals of a chemical search application are established, the next step is to identify all the sources of chemical data to include in your system:
Dream on and list all the sources you can think of or find. The more you identify, the better. This will help you build a platform in phases, growing over time, accommodating more data sources as time and money allow you to.
A few words about chemical databases out there:
There are multiple public and commercial databases with chemical data out there. Content coverage varies across them. Search strategy and search features are also different between those databases. Regardless of their content coverage or search capabilities, they may not your needs because they cannot access your organization’s content. You may still consider those databases as data sources to complement or extend your own search solution.
2. Identifying content purpose
At this point, you should have a lengthy list of potential sources, likely more than you had thought about initially. Does that list include data sources for vocabularies, dictionaries, taxonomies, or other resources that could be used for cleansing, normalizing or otherwise enriching your data? This is particularly important for chemical data as there are various notations used to represent molecules or compounds, as well as commercial or popular names for many of them. You’d also likely have to deal with codes, abbreviations, or others that you may find in the data to be indexed or in the users’ queries.
You may need to revisit your list of data sources from the previous step and annotate those sources that can provide both the data to index as well as the resources to use in the preparation of that data before indexing. Add to the list any missing sources that would only provide you with resources data.
3. Getting disparate data into the same place
There are multiple methods for acquiring and storing data from different sources, one of which is using data connectors. I’m just going highlight a few aspects of the data sources you may want to search against:
In short, the ideal goal is to provide a single place to search it all. In practice, you may not be able to index all the desired external sources.
4. Identifying and extracting entities
There are some entity extraction methods that we can explore in order to improve the system. Some call this tagging or, if you are familiar with Natural Language Processing (NLP), you may think of named-entity recognition (NER).
For simplicity, assume that there is a vocabulary for each and during text processing, each term (or a sequence of terms) would be looked up in each vocabulary, marking up the text when a match is made. We’ll go into SMILES and Chemical Names processing in more detail in the rest of the blog.
5. Identifying and extracting molecules and compounds
All search implementations have specific needs for preparing the data for indexing, to increase findability. Using representative samples from your content sources, you’d identify that molecules or compounds appear in different parts of each document to index. The short sample list below is ordered from simplest-to-hardest for identification and extraction:
Ultimately, the goal is to identify molecule or compound entities anywhere in the document so that they can be used to improve findability or relevance, or to allow for filtering, analytics, etc. How does your own data look like? Are molecules or compounds available in structured or unstructured data? I’d imagine you will have it in all flavors, depending on where the data is coming from.
6. Normalizing names: Time for another cup of Guaranine?
Normalizing Guaranine or its many other variants to caffeine is a common data processing feature of search implementations. Any useful chemical search application should allow users to search for “caffeine” for example and find all documents containing either that word specifically or any of its possible variants. This is not a specific chemical search problem, of course. There are well-known techniques to allow for implementing this.
The problem of normalization of molecule or compound names is twofold: complexity and volume. Complexity because the variants go beyond acronyms, words or phrases; which are more common on synonyms handling of other types of data. Volume because there could be too many variants for each molecule or compound name; which can significantly affect performance at both, indexing and search time.
Our caffeine example from PubChem illustrates this as it contains:
7. Normalizing chemical structure representations: SMILES, InChI, or Others?
Remember the other few variants mentioned above for caffeine, such as its canonical SMILES notation CN1C=NC2=C1C(=O)N(C(=O)N2C)C? Although SMILES is a specification intended to standardize chemical structure descriptions, it is not the only one. Therefore, the data or searches that the system will work with, require conversions between different descriptors of the chemical structure: InChI, SMILES, IUPAC name, etc.
It is likely that you’d need to implement conversions between multiple notations to increase findability. There are various cheminformatics toolkits available for implementing such conversions. These toolkits may also offer features that allow for drawing molecules or compounds from notations or, generating notations from a drawing, to mention a couple of examples.
As if this cross-representation normalization wasn’t hard enough, you’d need to decide on a toolkit to use to do the work. The Wikipedia page for cheminformatics toolkits has over 20 in its list, more than 10 of which are open source. You’d also need to answer multiple questions while deciding on a toolkit to integrate into your system. Here are just a few, though I imagine your organization’s chemists would help define more and deeper questions for your solution:
From our experience helping clients with multiple data normalization projects, this process is complex and often requires specialized content processing technology (an example is Accenture's Aspire Content Processing framework) and implementation. Keep the complexity of this aspect in mind as you prioritize work to incrementally build data normalization into your solution. Ultimately, a search solution is as good as the indexed data enabling it! Also, don’t forget to include tuning normalization performance to minimize the impact during indexing or search times.
Now that I’ve discussed the need for good data preparation, I’ll dive into how to build a chemical search application and enable search results on the user interface in the second blog of this series. Connect with us to learn more about this use case.
<<< Start >>>
<<< End >>>