Feedback

type to search

Similarity calculations between circular fingerprints

Asked by [ Editor ]

Hi, given a circular fingerprint, such as ECFP’s what is the accepted way to evaluate the similarity between two such fingerprints? My first thought is that the features (usually unsigned int’s derived from  atom environments) get mapped into a fixed length bit strings, after which one proceeds as usual.


But given that one of the claims to fame of circular fingerprints is there better performance in terms of resolution and the very large feature space of such fingerprints, a mapping to a fixed length bit string would lead to increased collisions and hence loss of resolution.

So it seems that mapping to a bit string is a bit of a waste, but I can’t see how else one would evaluate a Tanimoto score between two such fingerprints.

Any pointers would be appreciated.
or Cancel

3 answers

2

jeliazkova.nina [ Editor ]

One can calculate Tanimoto scores without explicit mapping to a bit string, as the Tanimoto formula asks for the number of features / number of common features, not necessary bits.
For example the usual formula can be used Tanimoto=common(NA,NB)/( NA+NB-common(NA,NB)), where NA is the number of fragments ( (circular fingerprints) in molecule A, NB is the number of fragments in molecule B and common(NA,NB) is the number of common fragments between the two molecules.

This means that the length of the (implicit) bit string will be different for different pairs of molecules.



NN comments
rajarshi guha
-

Thanks. But actually doesn’t this approach mean thtat the length of the ‘implicit’ bit string is actually constant and very large (essentially, one bit for each possible feature that can be generated)?

baoilleach
-

The bitstring representation makes one think that the 0s have some meaning. The Tanimoto coefficient is for comparing sets; we are comparing one set of fragments with another set of fragments.

jeliazkova.nina
-
Yes indeed. But 0s might be interpreted as “missing fragments”, if the bit string corresponds to the presence of particular structural features.

jonalv
-
.nina This is only for bit fingerprints, correct? If so, can you extend your answer with a formula for count fingerprints?

jeliazkova.nina
-
The formula should be applicable to counts as well, if we recall the Tanimoto is about comparing sets (including multisets – sets where members can be repeated)

Tanimoto= number-of-members(intersectionofsets(fragments-in-A,fragments-in-B))/( NA+NB-numberofmembers(intersectionofsets(fragments-in-A,fragments-in-B)))

e.g. if fragments-in-A = {c,c,c,c,n,o} and fragments-in-B = {c,c,c,o,o} then intersectionofsets(fragments-in-A,fragments-in-B) = {c,c,c,o} and Tanimoto = 4 / ( 6 + 5 – 4) = 0.57

or Cancel
1

baoilleach [ Admin ]

I guess you just need to assume that if two molecules share several bits in common, then it is not by chance (i.e. a collision of two distinct fragments) but rather is due to a real similarity in structure. Makes one think about the e-values that Baldi worked on for significance of matches depending on database size (a la Blast).

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.