Feedback

type to search

Source code for tanimoto similarity based on flat file input?

Asked by [ Editor ] , Edited by Joerg Kurt Wegner [ Editor ]

Which source code exists for calculating a tanimoto similarity for text input flatfiles of the form

CompoundID Bit1 Bit2 Bit3 Bit4 Bit5  … BitN
cmpd1 0 1 1 0 1 … 0
cmpd2 0 1 0 1 0 ... 0
or the sparse version using only ‘on bits’
CompoundID OnBit
cmpd1 2 3 5… 
cmpd2 2 4 ...

I am most interested in speed and source code doing exactly that, and nothing else.
Ideally, the code would use some recent optimizations like Baldi, GPUs, …
Ideally, the code would take not only one flat file calculating all-pairwise similarities or thresholded similarities, but rather in two flatfiles delivering the similar ones in one file to the other.
When using a sparse-encoding the number of bits might be very large, and the conversion in a dense binary vector (hashing step), has its own challenges, see

Here some source code I stumbled upon, but I do not like the GUI, I would just like to have a batch executable
or SUBSET, though it looks like 'old' C-code being focussed on the clustering, so how to disentangle it?

More reading
or Cancel

2 answers

0

rajarshi guha [ Editor ] from United States of America

You can do this in R via the fingerprint package, but is not going to be fast (certainly doesn’t make use of the GPU)

NN comments
joergkurtwegner
-

How fast/slow is it? How many vectors can be compared in which time and how long are they?

Besides, I just added two links to source code I have stumbled upon. Unfortunately is neither of the two solutions 'clean' enough to use.

rajarshi guha
-

As an example, evaluating the similarity between a query fp and 5000 target fps (1024 bit, with ~ 512 bits set to 1) takes 1.35 sec (R 2.14, Macbook Pro). Not the fastest code (the underlying C code is not optimized). Here’s code to try it locally:

library(fingerprint) 

fps <– sapply(1:5000, function(i) random.fingerprint(nbit=1024, on=512) ) 

fpq <– random.fingerprint(nbit=1024, on=512) 

system.time(sims <– lapply(fps, distance, fp2=fpq))

Of course this can be parallelized trivially (especially in R 2.14)

chem-bla-ics
-

By just replacing lapply with the parallel version, mcapply I think? Or does 2.14 have something even fancier?

joergkurtwegner
-

Well, I was more thinking in a million range and much longer sparse encodings, which have not been hashed to a short 1024 bit full binary vector, yet.

So, let us assume we have already a hashed short binary vector, for one million compounds the query runs 270 seconds.

I added more text and references to my original query and consider it as open challenge by now ;–)

dalke
-

I’ll gladly take the challenge. Can you provide a data set?

joergkurtwegner
-

I have to double-check, but think we can release a benchmarking data set with around one million molecules. It has a size of around 10GB in the sparse encoding. I am concerned about space and would never store the data as dense binary vectors, there are too many zeros. Let me come back on this … in two weeks … I need some time.

dalke
-

Since you store in ASCII counts (~5 bytes per bit), your sparse format takes more space than a hex encoding of a dense format when your bit density is higher than 0.5 %. I’ve also found that gzip compresses chemfp’s hex-encoded FPS format better than keeping the raw bytes around. It looks like you have around 2000 bits set per fingerprint. Is that over a range of 2**32 possible bits?

BTW, Rajarshi’s fingerprint package also supports the FPS format.

or Cancel
0

dalke [ Editor ]

Try chemfp. It doesn’t support your format, but it’s easy to convert from your format to the hex-encoded one it does use.


It has Baldi optimization (well, one of them). The popcount code is some of the fastest around; I’m working on a version right now which uses a hardware popcount if your processor supports it, but even the non-chip-specific code is quite fast. (It uses the CPU, not the GPU.)

Do you really want “all-pairwise similarities”, or all similarities over a threshold? The first doesn’t take advantage of the Baldi optimization, and will generate a lot of data. (How do you want that data?)

The timings are highly dependent on what you want to do. Try it out and see. My test case simply counts the number of similar matches which are at least 0.9 similar to the query. The query and target size is 110891 RDKit fingerprints of size 2048 bits, which takes 7 seconds (single threaded) on my desktop, which has special POPCNT hardware. Otherwise it takes 13 seconds.

Also, I’m looking for project funding. Some of the future directions are parallelization, OpenMP, faster support for when the targets and queries are identical, new methods for large sparse fingerprints, and more. Let me know if you’re interested in contributing funding to the effort.
NN comments
joergkurtwegner
-

Agreed, for taking advantage of hashing or other pre-computed optimizations a thresholding will be required, which might not have been done in this flat file input. So, I assume that sparse fingerprints can be very large, and encoding them as dense binary vectors might create very long vectors. The assumption in chemfp is that all (very long) sparse or full binary vectors have been already transformed to an optimized and much shorter hashed fingerprint format (max 1024 bits), right? If not some extra step will be required, see e.g. http://chembioinfo.com/2011/11/04/cdk-hashed-fingerprinter/

With respect to an all-pair similarity calculations I just added a reference section to my initial query.

joergkurtwegner
-

And I remember and older eMail discussion about...


from chemfp import decoders

hex=decoders.frombinarymsb(“1001001110”).encode(“hex”)

or

hex=binascii.hexlify(decoders.frombinarymsb(“1001001110”))

… which was never really straightforward for me ;–)

I am less interested in a 'standard fingerprint', but in using any sparse binary vector in any combination whenever I feel like it.

dalke
-

If you’re using the first of those two input formats for large, sparse fingerprints then I don’t think you’re all that concerned about space. :) How bit is your “N”? chemfp supports any size N (not “max 1024”), but it’s best to be dense. Few methods produce more than 4K useful bits so folding to 16Kbit should give you almost full precision.

Regarding “encode()” vs. “binascii.hexlify”; the latter is a micro-optimization I use to squeeze out a few percent faster encoding times. You should use encode() and not worry about it.

Thanks for the reference to SketchSort. I knew about MinHash but hadn’t seen anyone apply it to this problem. At 0.95 similarity and PubChem 881 bits, it’s faster than my code starting around 100,000 compounds (26s vs. 41s), and it has roughly linear scaling vs. my quadratic (445s for 1 million compounds, vs 5745; note 802762 unique fingerprints and 128 which are all zeros). I don't understand why SketchSort has 0 scores in the output, like "908 909 0", but I'll assume it's a meaningful hit. It has a hit count which is ~1-2% less than what I find.

What similarity values do you usually use? MinHash is fastest for high similarity, so if you do around 0.8 similar then you’ll have more false negatives or slower times. Testing now, SketchSort on 100,000 fingerprints takes 387s vs. 164s for my code, and it finds 5835738 hits vs. my 5835944. (I haven't verified if they return the same hits.)

kojitsuda
-

Hi, I am one of the authors of SketchSort. Actually threshold 0.8 is pretty hard for SketchSort as you mentioned. The search problem gets harder rapidly as you lower the threshold. So it would be difficult to tackle with this level of similarity for any kind of algorithms. In default, we use very tight threshold for false negative rates. You can set a relaxed FN rates to improve speed.

joergkurtwegner
-

Koji, is your implementation available as source code and easy to use? Just checking … ;–)

joergkurtwegner
-

Koji, excellent! I will let you know how it goes, this might be just what I was looking for. Stay tuned…

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.