Feedback

type to search

What's the best way to match up InChIs from ChemSpider API and NCI CIR

Asked by
Or basically, "What type of InChI should I store and use to enforce uniqueness in my database?"

I have a collection of records (UV-Vis absorption spectra) with chemical names and I’m trying to use a mixture of the NCI/CADD Chemical Identifier Resolver and the ChemSpider web services API to resolve the structures (as InChIs).


The problem I’m running into is matching up the InChI results from each API. The Chemical Identifier Resolver can return two different InChIs: ‘stdinchi’ returns a standard InChi, and ‘inchi’ seems to return an InChI that also contains the fixed hydrogen layer (if relevant).

The ChemSpider API returns the InChI type that they use for enforcing uniqueness, which I gather is using the ‘RecMet’ option, which includes the reconnection layer (if relevant).

The main question is what conditions I should use to enforce uniqueness in my database, and therefore how should I store the InChIs. I think the ChemSpider choice is a good one as I don’t want to throw away the reconnection information by reducing the ChemSpider results down to standard InChIs. To best match up the ChemSpider InChIs with the CIR results, I would either have to strip out the fixed hydrogen layer from the CIR ‘inchi’, or possibly just remove the ’S' from the CIR ‘stdinchi’.

Another possibility is storing the full InChIs as returned with all possible information as using that as a unique constraint. Then calculate the standard InChI from each InChI (haven’t tried this yet but I assume it’s possible) and store it alongside the full InChI.

Some of my ‘full’ InChIs will essentially just be a standard InChI without the ’S' flag. I wonder if I should consider these InChIs as ‘parents’ of other InChIs that have the same standard InChI but also contain reconnection or fixed hydrogen layers? If a spectrum is assigned to one of these ‘parent’ InChis, should I consider also assigning it to the children? Or if a spectrum is assigned to a child, should it also be assigned to the parent or it’s siblings?

Sorry there are quite a lot of vague questions here, any general advice on how to handle this would be great.
or Cancel

1 answer

0

wdiwdi [ Editor ] from Frankfurt am Main, Deutschland

The non-std InChIs from the NCI resolver should have been computed with the flags
{DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud} set, the same as PubChem (provided Markus has not reconfigured anything).

If you want to store a structure and not just an identifier, in your  database, almost any notation is better suited than InChIs – they are not 100% reversible, and if you do not use the Std version, different flag settings make have been used to generate them. I suggest you use something both established and fully reversible, like SMILES or SLN, or even an SDF record instead.

As for filtering by uniqueness, I suggest you use Cactvs hashcodes (as PubChem and the NCI resolver do internally), not InChIs. InChIs have  known problems such as not being able to merge ionic vs. pentavalent charge forms outside a few standard types, and generally produce different strings if you place mobile charges on different locations in a pi system.

Designing your database to that there is some notion of hierarchy and crosslinking (i.e. standardized forms, main components, tautomers, etc.) is a good idea and will help users to find spectra from structures where there is some ambiguity on what was actually measured.



NN comments
mcs07
-

Thanks for the advice. So a possible strategy could be:

Use the NCI Resolver to resolve name to SD file. Get Cactvs HASHISY/FICTS/FICuS/uuuuu (can I generate these myself?). Generate (or also retrieve) SMILES and (Std)InChI.

Where the NCI Resolver fails to resolve a name, use ChemSpider API instead. There are two options at this stage: 

a) Get the SD file from ChemSpider 

b) Get some other representation (SMILES, InChI, IUPAC name) from ChemSpider and put these into NCI Resolver until it manages to resolve.

Option a) means no Cactvs hashes unless I can generate them myself from the SD file. However option b) has potential for a “loss in quality” when converting through different representations, and also means I can’t have any structures that aren’t present in the NCI database.

At this stage I can then assign spectra to individual structure records. Then when returning the spectra for a specific structure, I can also optionally return spectra for all structures that have the same FICTS/FICuS/uuuu parent as that structure.

markus.sitzmann
-

Hi Matt,

  1. I use the options WDI mentions
  2. If you want to lookup names by ChemSpider, you can do this from CIR like this:
    http://cactus.nci.nih.gov/chemical/structure/{name}/{representation}?resolver=chemspider_name
    If the name is found in ChemSpider, this loads the structure from one of ChemSpider APIs and then calculates the CIR representation you request, however this is not done by default and you have to switch it on with the URL parameter "resolver".
  3. CIR also allows you to calculate CACTVS hashcodes: http://cactus.nci.nih.gov/chemical/structure/{identifier}/hashisy
    or our FICTS/FICuS/uuuuu identifier which are based on CACTVS hashcodes:
    http://cactus.nci.nih.gov/chemical/structure/{identifier}/ficts http://cactus.nci.nih.gov/chemical/structure/{identifier}/ficus http://cactus.nci.nih.gov/chemical/structure/{identifier}/uuuuu
    They are used as the primary keys in CIR’s database. I might publish them one day in some form, but currently I have different reasons no to do so :–)
mcs07
-

Hi Markus,

Thanks, that’s very useful to know. One quick question –

If I request an SDF from CIR, what exactly is returned? Based on the presence of the hydrogens I assume it’s not an ‘original record’ and there has been some degree of normalisation performed. Does it correspond to the FICTS parent structure?

Edit: Also, I'm curious how exactly the ChemSpider name resolving works through CIR. For example, if you search the ChemSpider API for the name "223-893-0", the record with ChemSpider ID 21169872 is returned. However this doesn't work using chemspider_name through CIR, presumably because you don't have a record with the same InChI in your database:

InChI=1/C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1/rC6H14LiN/c1-5(2)8(7)6(3)4/h5-6H,1-4H3

For things to be more flexible on my end, I feel like I would be better off just getting an sdf from CIR, ChemSpider (or any other datasource I use in future) then generating my own InChIs with various options to deal with different types of de-duplication. I like the idea of the CACTVS hashcodes, I just feel a bit to dependent on your web services when using them.

Anyway I'm just learning as I go along, so I'll probably try as many approaches as I can and see what works best.

markus.sitzmann
-

Answer to the question before your edit:

The FICTS structure is indeed returned for things that require a database lookup (e.g. names or Standard InChIKeys because they don’t contain any original structure information anymore). However, if you sent, for example, an SMILES string, the original structure is preserved for the output generation, e.g. you can generate images for arbitrary SMILES – there is no structure normalization or database lookup involved (you can switch on normalization if you want to). If a SD file is requested, it is always “freshly” generated from the CIR-internal CACTVS representation (which either have been generated from the FICTS structure loaded from the database or the original input identifier). So you are right with your observation that you never see any original SD file records (might come in future) but the output structure/representation hasn’t been necessarily normalized.

Answer to the rest:

I can not answer the ChemSpider question because it seems to down right now (maybe later). How it works is: it sends your original structure identifer to http://www.chemspider.com/inchi-resolver (pretty unknown API) and loads the SD file from there, which are the original database SD files and some of them are really dirty (some are missing hydrogens, some do not, other have radicals specified for all missing hydrogen, etc.). Hence, I do some normalization for these structures and calculate the output like explained above. This also does not require that we have the ChemSpider record or its structure, respectively, in our database, too.

With regard to the CACTVS hashcodes: download the CACTVS toolkit from http://www.xemistry.com (free for academia). It allows you to calculate the hashcodes. They are actually very sensitive and have a much higher resolution than InChI (if a structure lacks a hydrogen, for instance, it has a different hashcode than the “correct” structure with hydrogen. So if you want to use them as structure identifier you must create your own normalization first (what I have done with the FICTS/FICuS/uuuuu identifier). You can do all this with CACTVS, it offers also different hashcodes which do parts of our FICTS/FICuS/uuuuu out of the box. But for using CACTVS be prepared for scripting language tcl (our FICTS/FICuS/uuuuu are just a long tcl script which I implemented a few years ago … not being very happy anymore about it, but they work for me :–) ). If you have more question about CACTVS contact Wolf-Dietrich Ihlenfeldt (user wdiwdi here)

mcs07
-

Thanks Markus, that’s all really helpful. Also thanks for providing such a great resource.

markus.sitzmann
-

I fixed also the CAS number problem you described (CAS numbers where recognized as those and were never passed to the chemspider_name resolver, I changed this now)

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.