Feedback

type to search

How to extract selected molecules from big sdf file?

Asked by

Hi! I have a very big sdf file, and I need to extract from it only a group of selected molecules (that I have in a list, name sorted). Any ideas to solve this in python?

I try something like this, but it doesn’t work.

file = open(‘sdfbigfile’, ‘r’)
for line in file:
…if line == ‘molname’:
……… while line != ‘\$\$\$\$’:
…………..print file.readline()

Thanks for your help
or Cancel

4 answers

1

qnan

Hi Roberta,

The line as returned by the iterator contains a caret return character at the end ('\n'), which has to be removed before the line is matched against a molecule id.
Also, it's preferable not to mix iterators with file.readline(), so I'd do something like this:

f=open('file.sdf','r')
selected=False
for line in f:
  line = line.strip() # remove caret return and/or any trailing whitespace
  if line == 'molname':
    selected = True
  elif line == '$$$$':
    selected = False
  if selected:
    print line

P.S. I think such questions are better suited for http://stackoverflow.com.
or Cancel
1

wdiwdi [ Editor ] from Frankfurt am Main, Deutschland

Both the approaches by Roberta and Mikhail are dangerous, because ‘\$\$\$\$’ lines may be present in SDF record positions where they do not mean ‘end of record’. A simple line match devoid of syntactic context is not a robust and reliable algorithm.

And why does it need to be raw Python?

I suggest the use of a more chemistry-aware toolkit. There are several popular options.



NN comments
qnan
-
Agreed. That’s certainly not an approach for a serious application, but for a quick mockup I think this could do.

Btw, do you have an example SDF with a ‘\$\$\$\$’ line not marking the end of a record?.. I would be interested to see one.

wdiwdi
-

My standard anecdotal example is a file I encountered many years ago – a starting material database, where somebody helpfully had encoded (in an SD data field, in the line after the ‘>’ field definition) the general cost per mol in the range of one to five \$, just like you see for hotel booking sites… that file really left a bloody trail of killed programs…

qnan
-

I guess it would. Still, I think a v2000 MDL MOLFile cannot plausibly contain such a line, for there’re specifications as to how exactly each line starts and none of them start with four consecutive dollar signs (AFAIK). MOLFile v3000 is a different matter, since it allows line wrapping in certain cases and might, theoretically, produce something of the sort.

dalke
-

There are several places where the spec does not prohibit a ‘\$\$\$\$’ from being in the start of a line. My favorite is inside of an “S SKP” section, which is something that few readers implements. (I’ve also seen that different Accelrys interpret that section differently.) My version of the spec, from 2002, says that ‘\$\$\$\$’ and various other names cannot be used in line 1 (“Molecule Name”) but put no prohibition on line 3 (“Comment”). However, my view is that the spec is incomplete, that ‘\$\$\$\$’ should not be allowed on any line other than the end of record, and any software which writes a non-record-separated line containing only “\$\$\$\$” is just itching for trouble.

qnan
-
blueobelisk did something strange to the formatting of your message above

or Cancel
0

matteo floris [ Editor ] from Cagliari, Italia

I normally use this approach:

1. first, I spend some time for indexing my big sdf file
2. then, I extract the selected mols positions from the index file
3. finally, I extract these positions from the sdf file
If you want to do it with Python, have a look at the seek() method.


or Cancel
0

dalke [ Editor ]

Assuming you have a sufficiently well-formed SD file (‘\$\$\$\$’ only used for the end of record, don’t do strange things with “S SKP” properties, etc) then this should do what you want, and be much faster than the line-by-line reader you had:

import sys
import gzip

BLOCK_SIZE = 1024*1024
def sdf_reader(infile):
    recno = 1
    block = infile.read(BLOCK_SIZE)
    if not block:
        # Empty file
        return
    while 1:
        start = 0
        while 1:
            # assume that '$$$$' only exists as a record delimiter
            end = block.find("\n$$$$\n", start)
            if end == -1:
                # Not found
                break
            # Skip to the end of the delimiter
            end += 6
            yield block[start:end]
            recno += 1
            start = end
        next_block = infile.read(1024*1024)
        if not next_block:
            # No more new input. Was there remaining old input?
            if len(block) != start:
                raise Exception("Cannot find end of record for record %d" % (recno,))
            return
        block = block[start:] + next_block

def main():
    if not sys.argv[1:]:
        raise SystemExit("Missing SD filename")
    filename = sys.argv[1]
    # User can supply a list of identifiers to find
    ids = set(sys.argv[2:])
    if not ids:
        sys.stderr.write("That was easy\n")
        return

    # Support gzip'ed files
    if filename.lower().endswith(".gz"):
        infile = gzip.open(filename)
    else:
        infile = open(filename)

    for rec in sdf_reader(infile):
        id = rec[:rec.index("\n")]
        if id in ids:
            sys.stdout.write(rec)

if __name__ == "__main__":
    main()
or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.