The most common structure-based chemical database search UI is to sketch a structure and do either a similarity search or a substructure search. (Way down the list is to type SMARTS queries in by hand.)
If you write a database search program then you really want to tune it and measure its performance for real-world data. That’s better than making synthetic queries based on fragments of what’s in the database, or (worse) records in the data set.
I don’t know of any. I’ve asked PubChem, but that’s confidential by law. I asked (a couple of years ago) ChemSpider, but they didn’t have enough at that point.
Do you know where these can be found, or can you donate that data? Preferably it would contain at least a few thousand queries (10s or 100s of thousands would also be fine), along with target information. For example, “14,000 structure queries done against ZINC” would be perfect.
I know of two places which would use it immediately. 1) My chemfp project is at the point where I want to report performance numbers. 2) Greg Landrum started working last week to optimize the substructure fingerprint screening in the RDKit PostgreSQL cartridge, and wants a way to test his work.
(Earlier this year, Dimitris asked for a large collection of SMARTS. That is a related task, but for now I’m only looking for example structures, which are used as database search queries.)
Hi Andrew,
just yesterday I finished a substructure mapping on the whole dataset of MMsINC 1.0 entries: 1) first, I wrote a CDK fragmenter based on a set of known rules (RECAP, and others); 2) then, I fragmented the 4M of entries (about 40M of total pairs fragment/compound); 3) finally, I performed a substructure search in order to “validate” the fragments. I do not know if it is exactly what you need/meant…
Ideally I would like human-generated structures, and not machine generated ones.
ok, clear.