type to search

Structure queries for substructure and similarity search benchmarking

Asked by [ Editor ]

The most common structure-based chemical database search UI is to sketch a structure and do either a similarity search or a substructure search. (Way down the list is to type SMARTS queries in by hand.)

If you write a database search program then you really want to tune it and measure its performance for real-world data. That’s better than making synthetic queries based on fragments of what’s in the database, or (worse) records in the data set.

I don’t know of any. I’ve asked PubChem, but that’s confidential by law. I asked (a couple of years ago) ChemSpider, but they didn’t have enough at that point.

Do you know where these can be found, or can you donate that data? Preferably it would contain at least a few thousand queries (10s or 100s of thousands would also be fine), along with target information. For example, “14,000 structure queries done against ZINC” would be perfect.

I know of two places which would use it immediately. 1) My chemfp project is at the point where I want to report performance numbers. 2) Greg Landrum started working last week to optimize the substructure fingerprint screening in the RDKit PostgreSQL cartridge, and wants a way to test his work.

(Earlier this year, Dimitris asked for a large collection of SMARTS. That is a related task, but for now I’m only looking for example structures, which are used as database search queries.)
NN comments
matteo floris

Hi Andrew,

just yesterday I finished a substructure mapping on the whole dataset of MMsINC 1.0 entries: 1) first, I wrote a CDK fragmenter based on a set of known rules (RECAP, and others); 2) then, I fragmented the 4M of entries (about 40M of total pairs fragment/compound); 3) finally, I performed a substructure search in order to “validate” the fragments. I do not know if it is exactly what you need/meant…


Ideally I would like human-generated structures, and not machine generated ones.

or Cancel

1 answer


iain.m.wallace from United Kingdom

Would the Chembl database be useful? Or the ChEBI database?

These would all be compounds of biological interest, and then you could fragment them in different ways.

Not sure if that is what you meant or not.
NN comments

Fragmentation is pretty simple. RDKit, for example, implements the RECAP algorithm, and Greg already tests his RDKit/Postgresql cartridge with fragments from PubChem and from ZINC. I want to optimize the search performance for the types of structures that humans search for, which mean human-generated structures, and not algorithmically generated ones.

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.