Feedback

type to search

Public sources of chemical structures: which is the best compromise between quality and quantity?

Asked by [ Editor ]

I hope that the question is self-explaining.

Anyway, I would like to know your opinion about public (not commercial) repositories/databases/interfaces of chemical structures for virtual screening. In this category I would include all the sources from which the users can download the full data or at list query the data and download the resulting subset.
Which is the best compromise between quality and quantity (and let me include also diversity)?
or Cancel

4 answers

1

chem-bla-ics [ Admin ]

You can also look a this from another perspective. Go for the quantity, and do the validation yourself. Actually, you would be wise to do this for the smaller, curated databases too. Your particular use case is rarely an exact match with the database; therefore, you are likely in the situation where you have to preprocess your input anyway. Why not just include the validation in that process?

This validation is pretty easy to do with one of the many Open Source utilities around. For example, the CDK can validate many aspects, and do other kind of filtering. Diversity analysis will not be a problem with the free tools around either (e.g. What tools are available for clustering small molecules ?).

NN comments
or Cancel
0

gmueller [ Editor ]

As so often, the “best compromise” depends on time and effort one can spend in a project, (as you most probably know (:–) ).
However, some of my starting points come from ZINC :
NCI Diversity Set II (1364 compounds, also availably here)
NCI Plated 2007 with cluster representatives at 60% (6400 compounds)
PubChem catalog with cluster representatives  at 60% (~ 20k compounds)

Commercial, but IMHO still interesting is the emolecules database.

Regarding quality there are many 3D structures out there,
but usually I do these calculation on my own to be sure
that needle and haystack structures are treated the same way…

NN comments
matteo floris
-

What about isomers in Zinc? Are you sure that they are “the true” isomers?

gmueller
-
May be, there are errors, but hey, it’s free and there are useful lists. Isomers, tautomers, protonation states etc. are in my eyes like 3D structures, if you calculate them yourself, you get the best impressions, what limitations apply

baoilleach
-

The interesting thing about the eMolecules database is that they don’t include any chemicals which are not immediately available for purchase. Vendors that are slow to ship, or don’t have the chemicals in stock, are removed after a warning.

imants
-
floris I think when a supplier submits structures with unspecified stereo chemistry they generate isomers for each molecule. So if you see company XYZ is selling an R-isomer of some compound, what is actually available often is the racemate. This is a common approach as far as I know because you need 3D structures to dock so 2D structures in supplier catalogs need to be processed into 3D.

or Cancel
0

imants

ZINC is good choice. If you are interested in unprocessed structures or lightly processed structures then I think eMolecules offer download of all of their data of commercially available compounds. My company MolPort also works with computational chemists: see our press release with InhibOx. We provide for free all our compound database in SD file format. Once you finish the virtual screening and are ready to test a set of compounds, we can source them for you for supplier direct prices.

or Cancel
0

tony27587 [ Editor ]

I guess the question is what is the measure of “quality”? Is it is accuracy of the data associated with the chemical structures, the links to compounds which are still commercially available or the structure depictions? There are many interpretations of quality in this space. In terms of validated data associated with chemical compounds I believe that ChemSpider (www.chemspider.com) is the largest source of data with validated content that is online. It is not all validated yet as it is over 24.7 million compounds. Most people use ZINC which I think is likely the most useful for your work as it includes a large number of commercial collections. eMolecules does contain a lot of commercially available compounds but also has the NIST webbook collection that contains a lot of obscure chemistry such as the sodium chloride dimer (see here: http://www.chemspider.com/blog/aggregated-chemistry-and-quality-is-chemspider-a-good-representative.html)

NN comments
imants
-

Are ChemSpider data available for download?

tony27587
-
No, we don’t provide access to the entire database for download but in terms of just chemical structures that might be used for the purpose of virtual screening, for example, we’ve never refused someone who wanted a particular slice of data to doing the screening with. Then we provide only the structures and associated ChemSpider IDs.

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.