(This is a concrete question on datasets and scoring calculations, not a request for understanding how similarity methods can be measured.)
For chemfp I’ve implemented interface to 8 or so different fingerprint generation methods in several different toolkits, plus come up with my own cross-platform variation of the PubChem/CACTVS fingerprints.
Now I want to estimate their respective effectiveness for finding compounds which a chemist would agree is “similar.” For example, how effective are the RDKit circular fingerprints compared to OpenBabel’s FP2 fingerprint for task XYZ?
There’s a huge number of published ways for doing this, and I don’t know the literature that well. Ideally I would like to implement a few of the most common comparisons … if only I knew what they were.
Could you tell me how to evaluate fingerprint similarity effectiveness? Preferably in the form “download dataset T, generate fingerprints Tfp, for each query Q and fingerprint Qfp find … and score the results as …”