type to search

Tool to perform graph mining?

Asked by , Edited by Miquel Duran-Frigola

Hi there,

I have two sets A and B of molecules (SMILES format). I’d like to find the largest 2D sub-graph that is over-represented (enriched) in A, without using a predefined set of substructural patterns. That is, find substructures that are significantly (e.g. Fisher's exact test) more present in A than in B. 

As an example, say that in A 3 out of the 4 molecules have a pyridine ring, and in B 2 out of 20 have a pyridine. I'd like to retrieve this pyridine (right tailed Fisher's p-value = 0.018, in this case) without any pre-definition of structural patterns.

How would you guys approach this? Is there any tool for such a purpose?


NN comments

Can you explain your problems in more details? At least why do you you need set B, because it is not mentioned in the problem statement. And what does it mean to be over-represented or enriched?

or Cancel

2 answers


mikhail-rybalkin [ Editor ] from Portland, United States of America

As you want to find substruture that is frequent in A and not frequent in B, then at least it should be frequent in A. I suggest you to look at MoSS (Molecular Substructure Miner) implementation from Christian Borgelt:

Using MoSS you can find all subgraphs with a predefined frequency level. After that you can filter this substructures by the number of matches in the set B.

On that page you can download source code, or program executable. Substructure filtering you can do in various ways, including writing a script with our cheminformatics toolkit called Indigo.

Also this be done in KNIME (open source visual workflow engine: ), where you can just install Chemistry extension there and use MoSS node interactively without writing any code. They have an example 005001_MoSS that reads set of structures, and finds frequent substructures. I never worked with MoSS and do not know how to parameters and etc., but I know that it may be useful for your case. I can explain, how to run MoSS example in KNIME, if necessary. In KNIME you can also do substructure filtering and a lot of other stuff.

If you want to use MoSS source code, then I think that CDK (Chemistry Development Kit) has also added that algorithm, and you can use it there, but I’m not sure.

Do not hesitate to ask about details, depending on what way are you going to solve this problem: via coding, or via KNIME.

or Cancel

rajarshi guha [ Editor ] from North Bethesda, United States of America

One approach is to exhaustively fragment all molecules (across 2 sets) and then run the test on each of the fragments – this way there is no need to pre-define substructures (though you will likely want to filter some common cases – benezene rings for example)

NN comments

Hey Rajarshi,

how would you “exhaustively fragment molecules”? Is it possible to, for instance, get an exhaustive list of smiles or smarts fragments from a molecule smiles? (i.e. the largest possible fragment being the molecule itself and the smallest possible an atom?)

The problem of “common cases” should be controlled with the significance test.


or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.