type to search

Efficient way of calculating similarity between two large datasets?

Asked by , Edited by Egon Willighagen [ Admin ]

Hi All,

I was wondering if anybody can suggest the best (i.e. fastest) way of identifying similar compounds from two large datasets?

For example, one question I would like to answer is to identify how many of the Chembl compounds (~500,000) are commercially available by searching against the Zinc database (~8 million) or have very similar analogs available.

I used the Overlap Analysis in the Insta-Jchem option to identify identical compounds and that was very fast, however when I tried to do a similarity search it was much much slower.

I am wondering if there is anything similar for compounds to the CD-HIT programme in sequence analysis which can really rapidly identify similar sequences (on the order of millions) from one database in another.

As always, thanks a million for any pointers.

or Cancel

3 answers


rajarshi guha [ Editor ]

In general, to get a complete answer for identical (or similar) molecules common to the 2 datasets, you’d need to do a full pairwise analysis. This applies either to isomorphism methods (to check for identical mols) or fingerprint methods (similarity calculation).

An alternative approach is to approximation methods, assuming you’re OK with a certain number of false positives. One possibility is to use something like geometric hashing to identify nearest neighbors – which should correspond to similar molecules. But even then you’d need to loop over each entry in the query database to find similar molecules in the target database.

One other approach could be to ‘map’ the target db onto the query db – using something like FastMAP or even PCA, and then identify the nearest points in the projection.

Finally, if you’re target db supports parallel queries (say via sharding etc), then you could just chunk up the query set and run it in parallel. Yeah, it’s brute force :)
NN comments

To find all identical molecules it is too expensive to make full pairwise matching. You can compute some canonical representation of a molecule and then compare these canonical representations as strings. For example, you can use canonical SMILES or InChI keys. But this is not the answer to the main question.

rajarshi guha
Yes, you’re correct. My primary point was that a complete comparison is going to be pairwise – you can speed up the computation for each pair (such as doing string matching as you pointed out) – but you don’t get away from the O(n^2) time

or Cancel

matteo floris [ Editor ]

First, you need a computer cluster.

Second, you could follow the Dalke’s suggestion on how improving the performance of Tanimoto calculation.

I would also suggest to have a look at Baldi’s work about faster calculation of similarity.
NN comments

For reference: I have implemented that technique from Baldi in my chemfp package. With a 0.8 Tanimoto similarity I can search 100,000 structures from PubChem in about 7 minutes. There’s a factor of 2-3 performance I can get from the algorithm, but we’re still talking two days (if there’s enough memory) on a single threaded machine.

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.