Feedback

type to search

Search for similar molecules

Asked by

Hi,

given a certain set of molecules (around 100), I got a set of the ZINC database, around 10G data. Which are the best and fastest open source tools for searching for similar molecules to the onest of my set in the ZINC set?

Thanks

Imported from: http://blueobelisk.stackexchange.com/questions/312

NN comments
baoilleach
-

Please accept one of the answers if you are happy with it.

chem-bla-ics
-

Not sure about best and fast… perhaps we can start providing some scripts? But for that, you need to add some more details… input formats (all MDL SD files)… would setting up a DB work for you, using MySQL or PostgreSQL? etc…

or Cancel

4 answers

3

matteo floris [ Editor ]

Hi, you can use OpenBabel "fastsearch index" (see tutorial)

but you can also write a simple Python script: Pybel for FP generation and Python sets for Tanimoto calculation (see also my tutorial):

import sys, os
from sets import Set

fp_A = list("110011")
fp_B = list("101011")

set_a, set_b = Set([]), Set([])
i = -1
try:
 while 1:
  i = fp_A.index("1", i+1)
  set_a.add(i)
except ValueError: pass
i = -1
try:
 while 1:
  i = fp_B.index("1", i+1)
  set_b.add(i)
except ValueError: pass

tanimoto = float( len(set_a.intersection(set_b)) ) / float( len(set_a.union(set_b)) )
NN comments
baoilleach
-

Didn’t realise you had a blog. Do you want to add it to the BlueObelisk blogs?

or Cancel
2

dmitry pavlov [ Editor ]

Hello,

There are two options with Indigo toolkit.

  1. You could use Oracle (which is free for non-commercial use, albeit not open-source), installing Bingo cartridge on it. You could then import SMILES or SDF data set into your database, index your table, and perform similarity search with Bingo. The indexing will take some time, but the search results will come up very fast.
  2. You could write up a command-line utility basing on Indigo's code base (for example, taking the Bingo source code). From the 10GB dataset size, I assume you have an SDF file, right? Then this example application should be almost the thing you need. And here is another example showing different similarity metrics between two SMILES structures. The disadvantage will be that the program will recalculate the fingerprints of the molecules in the dataset 100 times (as you are willing to run 100 similarity searches). This can be managed by pre-calculating the fingerprints and storing them somewhere. Or using Bingo, which does this job on Oracle.

With best regards,

Dmitry

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.