Hi All,
Vladimir Chupakhin
[ Editor ]
Just out of press: SketchSort – Google Code and MolInf article
Hi All,
Just out of press: SketchSort – Google Code and MolInf article
In general, to get a complete answer for identical (or similar) molecules common to the 2 datasets, you’d need to do a full pairwise analysis. This applies either to isomorphism methods (to check for identical mols) or fingerprint methods (similarity calculation).
To find all identical molecules it is too expensive to make full pairwise matching. You can compute some canonical representation of a molecule and then compare these canonical representations as strings. For example, you can use canonical SMILES or InChI keys. But this is not the answer to the main question.
Yes, you’re correct. My primary point was that a complete comparison is going to be pairwise – you can speed up the computation for each pair (such as doing string matching as you pointed out) – but you don’t get away from the O(n^2) time
First, you need a computer cluster.
For reference: I have implemented that technique from Baldi in my chemfp package. With a 0.8 Tanimoto similarity I can search 100,000 structures from PubChem in about 7 minutes. There’s a factor of 2-3 performance I can get from the algorithm, but we’re still talking two days (if there’s enough memory) on a single threaded machine.