type to search

What kind of similarity search should be implemented in the database cartridge?

Asked by [ Editor ] , Edited by Mikhail Rybalkin [ Editor ]

Hello all,

I'd like to understand what do people expect when they are making similarity search using different chemistry database cartridges. Opposite to substructure search which is quite strict and well defined, similarity search is not so well defined in terms that you cannot predict exact similarity measure between different molecules when you are making search. So the search is made by trial and error to find similar molecules that you need. But what do people really need when they are doing similarity search? Can you give some real use cases?

Usually molecule similarity is measured by fingerprints similarity, but there are different types of fingerprints. This leads to a another question: is similarity measure based on  fingerprints is what do people need? When you are making a similarity search, I expect you don't really need "molecules with fingerprint tanimoto distance less then 0.312", but you need similar molecules in some different sense. For example, there are well defined graph distances and maybe it is better to use them? But they are computationally harder. Do you know some toolkits, applications or etc. that measures molecule similarity not based on fingerprints?

If fingerprints are really what people need then what kind of fingerprint should be used? Should molecules size or features count also be encoded into fingerprint?

My main question is to understand in what direction can we extend/improve similarity search in our database cartridge Bingo. Currently we are using our developed fingerprints and shortly then can be described in the following way: "Bingo fingerprints, as compared to Daylight fingerprints, are built not from bond paths, but from trees and rings". Sorry for a long question.

or Cancel

1 answer


dalke [ Editor ]

The main use case is the idea that two structures which look similar - according to the user's considered intuition - should function similarly. For example, given an unknown structure X, find the 3 nearest structures with similarity scores above 0.83 and use their measured values (plus some scaling function based on score distance) to predict structure similarity. Similarly, someone thinks that structure Y might be interesting and wants to know what's known about Y or closely similar structures.

People also use similarity as a measure of diversity.

A good similarity score system depends on the functionality that one wants to compare (ie, something for pharma might be poor for laser dyes or for similarity of synthesis). Scores are dependent on the similarity itself (a 0.9 with Daylight might correspond to 0.7 with a hypothetical "Nightlight"), which means some tuning/experience for people to get a good idea of what those scores mean.

A good system should also have the properties that two structures which are functionally dissimilar also get a low similarity score. Hashing sometimes violates this rule, yielding "similar" structures which are not similar. LINGO violates the other rule, since structures which are structurally very similar may end up having bad LINGO scores. (Read the LINGO paper by Haigh et al (I think) though to see how it compares to Daylight fingerprints for a given use case.)

I’ve heard people talk about graph edit distances as a measure of molecular similarity but not heard of anything which has come of it. There’s of course 3D/shape similarity, and pharmacophore similarity based on range overlaps.

About 12 years ago I worked at a company where we used MCS size as a way to judge similarity. In that case we were looking to see if the MCS from a set of actives could help identify a common substructure, indicating that the given fragment might play a role in the activity.

Andreas Bender did his PhD thesis on “STUDIES ON MOLECULAR SIMILARITY” and has worked on the problem since then. Those should be useful links for you.
or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.