Feedback

type to search

Aromaticity perception differences

Asked by [ Admin ] , Edited by Egon Willighagen [ Admin ]
What is the best aromaticity perception model and why different toolkits have different aromaticity models?

I’m working on extending (de)aromatization algorithm in Indigo to support double bonds that goes out of the rings, like in NC1=NC(=O)NC=C1.

And I decided to compare aromatization methods in different libraries and applications: OpenBabel, RDKit, CDK, ChemAxon Marvin (with two methods: basic and general), and CACTVS (with three methods: cactvs, tripos, and daylight).

For the evaluation set I selected all the unique potentially aromatic structure fragments from the PubChem database and got more than 3 million fragments. After processing 10% of them I created a correlation matrix, that shows the portion of cases where methods give the same results. It is attached to the question. Also I have prepared all the data: test set with 370807 structures in SMILES format, an SDF file the molecule fragments and the number of aromatic bonds for the different algorithms (excluding CACTVS "tripos" because of large number of differences), and files with structures that were not loaded by Indigo and RDKit due to valence errors. This archive is available here.

The correlation matrix:


From the correlation matrix I see 2 groups of methods that works very similar: (Indigo, CDK, Marvin Basic, CACTVS cactvs), and (OpenBabel, RDKit, Marvin General, CACTVS daylight). In the attached image below the correlation matrix there is a table with three columns for each method: (1) the number of structure that has at least one aromatic bond, (2) the number of cases when there was no other result with the same number of aromatic bonds, and (3) the number of errors.

So my questions are:
  • How the ideal aromatization method should be implemented? 
  • Do people really care about the differences in the aromaticity perception models?
  • Can you suggest a representable set of aromatic and non-aromatic structures?
UPDATE: Test set were increased up to 370807 structures. OpenBabel reports that 99.3% of them are aromatic. All Indigo-specific issues with structure loading have been fixed. Now Indigo faild to load 2 structures, RDKit faild to load 117 structures, while all other libraries/applications work without issues.

UPDATE 2: 3 aromatization methods were used for CACTVS: cactvs, tripos, and daylight. "cactvs" mode appears to be in the first group (CDK, Marvin Basic, Indigo), "daylight" - in the second group (Marvin General, RDKit, OpenBabel). "tripos" mode produces a lot of outliers, and for example it cannot recognize CC1=CNC=C1 as an aromatic molecule.

Attached files:
NN comments
matteo floris
-

Can you add the CACTVS results to the comparison?

mikhail-rybalkin
-
Yes, sure. I have updated my question and included CACTVS results too.

I found that I can aromatize a file with SMILES using the following code:
molfile loop “molecules.smi” eh {
puts [ens get \$eh E_SMILES {} {usearo 1}]
}

This is my first code in CACTVS. I’m note sure if it can be done in a better way or maybe I’m missing some options. 
wdiwdi
-

Cactvs supports actually a switchable aromaticity model – its own,Daylight and Tripos (the latter two implemented comparing test data sets – unfortunately, these are customer-provided sets and not public). Use

set ::cactvs(aromaticity_model) daylight

(or cactvs – the default, tripos) to switch. Note that this effects onlynew computations, existing aromaticity data on structures remains valid.

mikhail-rybalkin
-

I added all three methods. It is strange that tripos method is too different from all other results. You can find all the results in the table attached to my question. I excluded  tripos  method from the SDF file with all the differences because it is too different from others.

chem-bla-ics
-

, how did you create your ‘gold’ list? That is, is that list based on experimental data?

wdiwdi
-

The weird Tripos results are actually no error. This model recognizes essentially only aromatic rings where all atoms provide one electron to  the 4n+2 pi electron pattern, so pyridine is considered aromatic, but pyrrole is not. I am not certain how useful this is as a general approach.

or Cancel

3 answers

2

rich apodaca [ Editor ] from México

The utility and importance of aromaticity perception depends greatly on the use cases you have in mind.

These are some situations when having a good handle on 'aromaticity' matters:
  1. Encoding SMILES using lower case atom symbol notation.
  2. Decoding SMILES using lower case atom symbol notation.
  3. Interpreting structure-based queries submitted by users of a database or mining tool.
  4. Performing calculations using fragment-based methods (e.g., Topological Polar Surface Area).
To the extent that (1) is a major consideration, you might want to reconsider.

The SMILES 'aromatic' notation was introduced mainly to facilitate structure canonicalization. But it's pretty clear that InChI is well on its way eclipsing SMILES' role here.

Aromaticity in SMILES is one of those concepts in cheminformatics that's just plain broken. See this article and this rebuttal.

Taking (1) off the table leaves (2), which unfortunately is still necessary. In most cases, however, what you're trying to do is convert the SMILES representation into another more or less equivalent representation. The example SMILES you give could be considered a tautomer perception problem, which complicates matters still further. There will no doubt be a lot of exceptional cases to check for. Regardless,  I don't see any fundamental issues in arriving at either a pyridone or hydroxypyridine representation from your SMILES, each with 'localized' double bonds.

(3) is quite a different question because you're really asking "what do chemists expect?" This will to a great extent depend on the tools they've used before and their knowledge of organic chemistry. For example, I'd suspect that most chemists would be surprised to get a cyclobutadiene if they specify 'aromatic' carbon on a four-membered ring. On the other hand, a chemist wouldn't be surprised to find a pyridone in the results set for a six-membered ring query with 'aromatic' nitrogen.

(4) is where Daylight's incomplete aromaticity spec spills over to other areas. A toolkit may use a single aromaticity model for both SMILES/SMARTS notations and descriptor calculations. To the extent that this model differs, even slightly, from that used by the descriptor's author, unexpected answers will result and chemists may draw invalid conclusions.

You might consider:
  1. Supply a configurable aromaticity model with reasonable default values, a collection of commonly-used models (assuming you can find complete documentation), or some combination of the two.
  2. Excessively document how any aromaticity model(s) work.
  3. Constrain the supported use cases in which aromaticity perception plays an important role (e.g., remove 'aromatic' as a flag that can be placed on a query structure).
Edit: I cleaned up this answer based on feedback in the comments section.
PS: This Shapado editor really messes up lists. Oh well.
NN comments
baoilleach
-

This question isn’t about SMILES…

mikhail-rybalkin
-

I see that aromaticity is important at least in three cases:

1. Structure canonicalization.
2. Substructure search.
3. Structure dearomatization.

Rich Apodaca, I agree with you that for the canonicalization algorithm we can use InChI. But also if we need a unique identifier then it is not important to have a standard aromaticity model because we are generating such a  unique identifier with a specific library. For example I can use InChI for all the structure in the database, or I can use Indigo canonical SMILES, or other canonical SMILES. It is important to use the same tool for all the structures.

But for other two points 2. and 3. it is more important. We decided to improve the aromaticity model when we receive enough requests to support aromaticity perception for cycles with external double bonds. But the latest request was about dearomatization where the system is receiving an aromatic structure that should be dearomatized by the library, and currently we do not support such structures.

And the point 2 about the substructure search: lets assume that you a doing a substructure search and specified an aromatic bond in the query. What kind of structure will be found and what do you expect? Does it really important for the user that different search systems give you different results?

or Cancel
2

wdiwdi [ Editor ] from Frankfurt am Main, Deutschland

Here are my comments:

How the ideal aromatization method should be implemented?

It should be avoided where possible. There is a good reason why the default SMILES encoder mode in Cactvs does not use aromatic atoms, but rather explicit single/double bonds. This form is ae much more reliable in decoding – I have seen braindamage-inducing complications when trying to decode aromatic SMILES from tools with notably different perceptions of aromaticity.

Do people really care about the differences in the aromaticity perception models?

The most critical point are probably structure queries and related technologies (such as SMIRKS transforms). There it does matter. So the answer is yes.

Can you suggest a representable set of aromatic and non-aromatic structures?

I do not think there is a public standard set. But if you would just assemble a random but sanity-filtered collection, and publish it under an unrestrictive license, that would be very helpful.

or Cancel
1

baoilleach [ Admin ] from City of Westminster, United Kingdom

  • How the ideal aromatization method should be implemented? 

Well, there are different ideal models for different file formats. The key point is that any method should be completely described in the literature (or a webpage) and it should be implementable by others. For our part, OB would implement a described aromatization scheme for SMILES if Indigo came up with one.

  • Do people really care about the differences in the aromaticity perception models?

I do. Using a standard, would enhance interoperability and improve code, and hence data, quality.

  • Can you suggest a representable set of aromatic and non-aromatic structures?

Aha. Test cases. We could start with those agreed upon by OB, RDKit and Indigo. Or take all those with lowercase SMILES in ChEMBL.

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.