Which source code exists for calculating a tanimoto similarity for text input flatfiles of the form
rajarshi guha
[ Editor ]
from United States of America
You can do this in R via the fingerprint package, but is not going to be fast (certainly doesn’t make use of the GPU)
How fast/slow is it? How many vectors can be compared in which time and how long are they?
Besides, I just added two links to source code I have stumbled upon. Unfortunately is neither of the two solutions 'clean' enough to use.
As an example, evaluating the similarity between a query fp and 5000 target fps (1024 bit, with ~ 512 bits set to 1) takes 1.35 sec (R 2.14, Macbook Pro). Not the fastest code (the underlying C code is not optimized). Here’s code to try it locally:
library(fingerprint)
fps <– sapply(1:5000, function(i) random.fingerprint(nbit=1024, on=512) )
fpq <– random.fingerprint(nbit=1024, on=512)
system.time(sims <– lapply(fps, distance, fp2=fpq))
Of course this can be parallelized trivially (especially in R 2.14)
By just replacing lapply with the parallel version, mcapply I think? Or does 2.14 have something even fancier?
Well, I was more thinking in a million range and much longer sparse encodings, which have not been hashed to a short 1024 bit full binary vector, yet.
So, let us assume we have already a hashed short binary vector, for one million compounds the query runs 270 seconds.
I added more text and references to my original query and consider it as open challenge by now ;–)
I have to double-check, but think we can release a benchmarking data set with around one million molecules. It has a size of around 10GB in the sparse encoding. I am concerned about space and would never store the data as dense binary vectors, there are too many zeros. Let me come back on this … in two weeks … I need some time.
Since you store in ASCII counts (~5 bytes per bit), your sparse format takes more space than a hex encoding of a dense format when your bit density is higher than 0.5 %. I’ve also found that gzip compresses chemfp’s hex-encoded FPS format better than keeping the raw bytes around. It looks like you have around 2000 bits set per fingerprint. Is that over a range of 2**32 possible bits?
BTW, Rajarshi’s fingerprint package also supports the FPS format.