What tools are available for clustering small molecules ? (I have a set of say 1000 small molecules and I want to select a non-redundant diverse set)
Noel O'Boyle
[ Admin ]
To my mind, diverse set selection is best done directly using the distance matrix. The Kennard-Stone algorithm, for example, is easy to understand and easy to implement. You just start with the pair of molecules most distant, and keep
selecting additional molecules that are most distant to the selected
ones. Stop when you have enough diverse molecules. I don’t know of any available tool for this though.
By the way, if you are selecting a diverse set for use as a training set, you may want to reconsider – the results on the test set will be overly optimistic if you use anything but a random set.
Do you got a literature ref for the K-S algorithm?
-bla-ics: No, but you just start with the pair of moleules most distant, and keep selecting additional molecules that are most distant to the selected ones. Stop when you have enough diverse molecules.
The original ref is R.W. Kennard, L.A. Stone, Computer aided design of experiments, Technometrics 11 (1969) 137-148.
You can download the K-S algorithm as an M script file of MATLAB from http://chemometria.us.edu.pl/index.php?goto=downloads