Feedback

type to search

Are there any head-to-head comparisons on prediction performance of CDK algorithms?

Asked by [ Editor ]

Are there any publications comparing the performance of the CDK with commercial algorithms?

I am aware of the poster: Using Open Source Descriptors and Algorithms for Modeling ADME Properties http://tinyurl.com/5wftad8 and the related publication: http://tinyurl.com/6xggpd4.

I am looking for head-to-head comparisons of logP prediction and PSA

NN comments
chem-bla-ics
-

This is the goal and scope of CDK News. A few CDK News papers have been published where such results are detailed, but I would love this widened. The incentive is typically missing, and there has been no scientific reward in publishing such results. This is what makes the new Open Research Computation so interesting!

or Cancel

1 answer

2

rajarshi guha [ Editor ]

PSA compares quite well – see http://twitpic.com/3u2lm7 for a comparison on ~ 57K molecules using the latest CDK master and v12 of ACD Physprop. The CDK still had some atom type issues on a few molecules, so the PSA’s for those cases might be wrong.


CDKs logP performance is much poorer – see http://twitpic.com/3u2os5. The AlogP implementation really needs updating. The XlogP implementation is slightly better (R2 = 0.43). But I will note that even ChemAxons logP exhibits an R2 of 0.56 with ACD

Update

Based on Antony's pointing out that a better comparison is with experimental data, I used some measured logP data (~ 10K compounds, but I cannot release values or structures). ACD does significantly better than CDKs XLogP - http://twitpic.com/3uxs1c

If I remember correctly, the ACD model is a CNN. Out of curiosity I ran a quick random forest model, using CDK topological and constitutional descriptors and only minimal feature selection (to remove descriptors with undefined values) - http://twitpic.com/3uxufn. Much better than XLogP - and not bad at all for minimal effort. With proper feature selection and moving to a CNN/SVM etc, I expect one might get close to the ACD performance

NN comments
tony27587
-

thanks Rajarshi. I would have thought a better comparison was a comparison of predictor performance of predicted versus experimental rather than predicted vs predicted? By comparing the CDK against ACD it is sending the signal that ACD is the standard…that I am sure is appreciated ! I guess the comparison of predicted vs experimental is not feasible as there is not an appropriate logP experimental data collection to compare that the CDk team has access to?

rajarshi guha
-

Hi Tony, you’re right, the comparison should be wrt exptl logP values. But, this was a quickie at 12:30am and the ACD data is what I had lying around :)

It’d be nice to have access to a exptl logP dataset – there are a number of papers (http://onlinelibrary.wiley.com/doi/10.1002/jps.2600810317/pdf) but of course they’re not machine processable. If you have some exptl data, I’d be happy to run it through

tony27587
-
guha I know that Egon was trying to collect some logP data a few months ago. I recall sending him some links. I do want to gather some exptl logP data for another project and had started looking for papers. I’ll contact a friend to see if he has any to share…

tony27587
-
-bla-ics Point me to the Semantic Wiki…I will help gather some data for you…maybe some this week while I am traveling.

chem-bla-ics
-
I decided to not collect LogP data, but pKa data instead. I started a semantic wiki with Samuel, and have some 80-90 experimental values from primary literature right now (CC0). What I would love to do, is reach about 200 values, and make a new pKa prediction model, but not sure when I’ll have time for this. At this moment, I have too much to wrap up, to make a general collaboration call.

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.