What is the best aromaticity perception model and why different toolkits have different aromaticity models?
I’m working on extending (de)aromatization algorithm in Indigo to support double bonds that goes out of the rings, like in NC1=NC(=O)NC=C1.
And I decided to compare aromatization methods in different libraries and applications: OpenBabel, RDKit, CDK, ChemAxon Marvin (with two methods: basic and general), and CACTVS (with three methods: cactvs, tripos, and daylight).
For the evaluation set I selected all the unique potentially aromatic structure fragments from the PubChem database and got more than 3 million fragments. After processing 10% of them I created a correlation matrix, that shows the portion of cases where methods give the same results. It is attached to the question. Also I have prepared all the data: test set with 370807 structures in SMILES format, an SDF file the molecule fragments and the number of aromatic bonds for the different algorithms (excluding CACTVS "tripos" because of large number of differences), and files with structures that were not loaded by Indigo and RDKit due to valence errors. This archive is available here
The correlation matrix:
From the correlation matrix I see 2 groups of methods that works very similar: (Indigo, CDK, Marvin Basic, CACTVS cactvs), and (OpenBabel, RDKit, Marvin General, CACTVS daylight). In the attached image below the correlation matrix there is a table with three columns for each method: (1) the number of structure that has at least one aromatic bond, (2) the number of cases when there was no other result with the same number of aromatic bonds, and (3) the number of errors.
So my questions are:
- How the ideal aromatization method should be implemented?
- Do people really care about the differences in the aromaticity perception models?
- Can you suggest a representable set of aromatic and non-aromatic structures?
UPDATE: Test set were increased up to 370807 structures. OpenBabel reports that 99.3% of them are aromatic. All Indigo-specific issues with structure loading have been fixed. Now Indigo faild to load 2 structures, RDKit faild to load 117 structures, while all other libraries/applications work without issues.
UPDATE 2: 3 aromatization methods were used for CACTVS: cactvs, tripos, and daylight. "cactvs" mode appears to be in the first group (CDK, Marvin Basic, Indigo), "daylight" - in the second group (Marvin General, RDKit, OpenBabel). "tripos" mode produces a lot of outliers, and for example it cannot recognize CC1=CNC=C1 as an aromatic molecule.