Part 1 of the “We’ve Got Chemistry!” blog series
I live with the curiosity of my two sons’ fresh, young minds, constantly wondering how things work, where they come from, etc. They are now old enough to find some answers through their own thinking, exploration, or experimentation. Other times though, they look to mom or dad for answers.
At ages 7 and 9 though, my wife and I keep running out of answers much more often than we used to. Naturally, I turn to a search engine when I need help. Still, those answers sometimes don’t come as easily or as quickly as their next question does!
As search and analytics professionals, my colleagues and I have collaborated with many similarly inquisitive minds, though not as young, in our daily work. While some of them look for answers to existing questions, others look for tools that help them formulate better questions, ultimately enabling them to efficiently find the answers themselves!
The inquisitive minds about which I’d elaborate in this blog series are those of scientists. More specifically, these are chemical scientists who leverage chemical search to find molecules or compounds similar to other molecules or compounds they are working with. And they need to find more, in less time (just like the rest of us!)
This blog series is a small contribution to the cheminformatics space based on my experience and that of other Accenture colleagues in building chemical search applications. Spoiler alert: no simple answer. Yet, we believe that the stars are now aligned for much better chemical search (or should I say that the atoms are aligned to be bonded?).
- In this first part of the series, I’ll discuss the business cases and data preparation best practices in a chemical search implementation.
- In the second part, I’ll dive into the approaches and reference architecture for building the application.
Care for some 1,3,7-trimethylpurine-2,6-dione before we move on?
For those of you non-chemists out there, like me, you might decline a hot cup of this compound... until you find out that it is one of the chemical names given to caffeine.
If you are a chemist, would you have recognized caffeine when shown this image?
<<< Start >>>
<<< End >>>
Or perhaps you would have identified caffeine by seeing:
- the chemical formula C8H10N4O2, or
- its canonical SMILES (Simplified Molecular-Input Line-Entry System) notation, CN1C=NC2=C1C(=O)N(C(=O)N2C)C, or
- its InChI (International Chemical Identifier) 1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3?
Chemical search is the process of quickly and efficiently sifting through multiple data sources to find all relevant data related to a molecule or compound, such as the examples above:
- Drawn molecule image
- Chemical name
- Popular name
- Chemical formula
- SMILES or InChI representation
Business cases for chemical search
Below are just a few chemical search use cases that chemical scientists and their business partners may have within their cheminformatics practice.
<<< Start >>>
<<< End >>>
<<< Start >>>
<<< End >>>
- Are we going down a path already explored with this molecule or compound? Reduce costs or accelerate research: pick up where a previous experiment left off; stop or modify research since a prior experiment had failed under similar conditions; etc.
- Could we expand our offering by entering other markets or applications with this molecule or compound? Innovate: identify opportunities for developing products in new markets or for new purposes based on current experience with some molecules or compounds; enter a partnership with another organization to use strengths from both on developing new products; etc.
- What else is known by us or others about this molecule or compound? Discover: find information related to the molecule or compound in question from third-party sources or other areas of the organization.
Today, answering these questions involves searching against multiple disconnected repositories. It takes a long time and quite often yields many incorrect hits to further sift through. Some hurdles are well-known from other search applications, such as dealing with data silos. Other hurdles, like reducing search time may be solved with more computing power. Finally, chemical-specific hurdles continue to be tackled by lots of people invested in further developing cheminformatics.
Having a well-thought-out data preparation strategy can help alleviate some of these hurdles.
Acquiring and preparing chemical data
The diagram below illustrates a high-level data preparation process for a chemical search application.
<<< Start >>>
Data preparation steps for a chemical search application
<<< End >>>
1. Identifying data sources
Once the purposes or goals of a chemical search application are established, the next step is to identify all the sources of chemical data to include in your system:
- Internal. You may want to include your own research material; your own patent information; data generated or stored by experimentation instruments or the systems integrated to them (like Electronic Lab Notebook systems); scientists’ blogs, annotations or similar; etc.
- External. You may use public data or private data (subscription-based or bought) from third parties: published patents from other organizations; databases, research papers or other content from universities, partners, industry analysts, government, or other publishers; etc.
Dream on and list all the sources you can think of or find. The more you identify, the better. This will help you build a platform in phases, growing over time, accommodating more data sources as time and money allow you to.
A few words about chemical databases out there:
There are multiple public and commercial databases with chemical data out there. Content coverage varies across them. Search strategy and search features are also different between those databases. Regardless of their content coverage or search capabilities, they may not your needs because they cannot access your organization’s content. You may still consider those databases as data sources to complement or extend your own search solution.
2. Identifying content purpose
At this point, you should have a lengthy list of potential sources, likely more than you had thought about initially. Does that list include data sources for vocabularies, dictionaries, taxonomies, or other resources that could be used for cleansing, normalizing or otherwise enriching your data? This is particularly important for chemical data as there are various notations used to represent molecules or compounds, as well as commercial or popular names for many of them. You’d also likely have to deal with codes, abbreviations, or others that you may find in the data to be indexed or in the users’ queries.
You may need to revisit your list of data sources from the previous step and annotate those sources that can provide both the data to index as well as the resources to use in the preparation of that data before indexing. Add to the list any missing sources that would only provide you with resources data.
3. Getting disparate data into the same place
There are multiple methods for acquiring and storing data from different sources, one of which is using data connectors. I’m just going highlight a few aspects of the data sources you may want to search against:
- Content usage restrictions. There may be permissions, governing the data accessible to a scientist. Implementing secure search would satisfy this requirement. You’d find a lot of details about securing search results in this blog.
- External content limitations. There may be licensing conditions that your organization cannot fully control, which may affect your indexing of their data:
- Is the data already duplicated in your environment or hosted by the provider of the content
- Whether hosted by your organization or by the provider, are you allowed to index it in your own search engine?
- Are there any special conditions for using the external data or portions of it in your own applications?
In short, the ideal goal is to provide a single place to search it all. In practice, you may not be able to index all the desired external sources.
4. Identifying and extracting entities
There are some entity extraction methods that we can explore in order to improve the system. Some call this tagging or, if you are familiar with Natural Language Processing (NLP), you may think of named-entity recognition (NER).
For simplicity, assume that there is a vocabulary for each and during text processing, each term (or a sequence of terms) would be looked up in each vocabulary, marking up the text when a match is made. We’ll go into SMILES and Chemical Names processing in more detail in the rest of the blog.
5. Identifying and extracting molecules and compounds
All search implementations have specific needs for preparing the data for indexing, to increase findability. Using representative samples from your content sources, you’d identify that molecules or compounds appear in different parts of each document to index. The short sample list below is ordered from simplest-to-hardest for identification and extraction:
- Clearly marked metadata field(s) containing chemical data. The PubChem page for caffeine has multiple fields that illustrate this structured data scenario: IUPAC Name, InChI, Molecular Formula, etc.
- Specific sections of the document that contain molecules or compounds along with other data. Perhaps you are just interested in the top of the Wikipedia page for caffeine along with the Use, Adverse Effects, Pharmacology, and Chemistry sections but not the rest of that article.
- Anywhere in the free text of the document.
Ultimately, the goal is to identify molecule or compound entities anywhere in the document so that they can be used to improve findability or relevance, or to allow for filtering, analytics, etc. How does your own data look like? Are molecules or compounds available in structured or unstructured data? I’d imagine you will have it in all flavors, depending on where the data is coming from.
6. Normalizing names: Time for another cup of Guaranine?
Normalizing Guaranine or its many other variants to caffeine is a common data processing feature of search implementations. Any useful chemical search application should allow users to search for “caffeine” for example and find all documents containing either that word specifically or any of its possible variants. This is not a specific chemical search problem, of course. There are well-known techniques to allow for implementing this.
The problem of normalization of molecule or compound names is twofold: complexity and volume. Complexity because the variants go beyond acronyms, words or phrases; which are more common on synonyms handling of other types of data. Volume because there could be too many variants for each molecule or compound name; which can significantly affect performance at both, indexing and search time.
Our caffeine example from PubChem illustrates this as it contains:
- 4 “Computed Descriptors”
- 11 different “Other Identifiers”
- Over 400 “Synonyms”:
- 14 Medical Subject Heading (MeSH) alternative names for caffeine alone
- It also lists over 410 synonyms obtained from the various sources compiled by PubChem from its sources of caffeine data. The word “synonym” is used loosely here, as many are codes or other ways used to tag caffeine as an entity in different data sources. Still, all of these may be helpful for your application to do its matching job.
7. Normalizing chemical structure representations: SMILES, InChI, or Others?
Remember the other few variants mentioned above for caffeine, such as its canonical SMILES notation CN1C=NC2=C1C(=O)N(C(=O)N2C)C? Although SMILES is a specification intended to standardize chemical structure descriptions, it is not the only one. Therefore, the data or searches that the system will work with, require conversions between different descriptors of the chemical structure: InChI, SMILES, IUPAC name, etc.
It is likely that you’d need to implement conversions between multiple notations to increase findability. There are various cheminformatics toolkits available for implementing such conversions. These toolkits may also offer features that allow for drawing molecules or compounds from notations or, generating notations from a drawing, to mention a couple of examples.
As if this cross-representation normalization wasn’t hard enough, you’d need to decide on a toolkit to use to do the work. The Wikipedia page for cheminformatics toolkits has over 20 in its list, more than 10 of which are open source. You’d also need to answer multiple questions while deciding on a toolkit to integrate into your system. Here are just a few, though I imagine your organization’s chemists would help define more and deeper questions for your solution:
- Which toolkit is more widely used in the data sources?
- Are there proprietary toolkits used by the majority of sources, or by the more important data source to index?
- Are there incompatibilities or inaccurate conversions resulting from applying a different toolkit in your solution from that used in the data source?
- Should two or more toolkits be used? What’s the impact in performance or accuracy of using more than one toolkit? How do you decide when to favor one over the other, if two or more are applied?
- How can the toolkit integration be made to allow your system to swap a toolkit or use more than one toolkit in future versions of the software with minimal impact?
From our experience helping clients with multiple data normalization projects, this process is complex and often requires specialized content processing technology (an example is Accenture's Aspire Content Processing framework) and implementation. Keep the complexity of this aspect in mind as you prioritize work to incrementally build data normalization into your solution. Ultimately, a search solution is as good as the indexed data enabling it! Also, don’t forget to include tuning normalization performance to minimize the impact during indexing or search times.
Now that I’ve discussed the need for good data preparation, I’ll dive into how to build a chemical search application and enable search results on the user interface in the second blog of this series. Connect with us to learn more about this use case.
<<< Start >>>
<<< End >>>