Feedback

type to search

Which large-scale open source database cartridges are available?

Asked by [ Editor ] , Edited by Joerg Kurt Wegner [ Editor ]

The aim is really being able to handle double-digit millions of molecules via similarity and substructure searches.

NN comments
joergkurtwegner
-

The questions is really about database cartridges being able to deal with double-digit millions of molecules.

baoilleach
-

You should correct the question (rather than leaving a comment)…

or Cancel

4 answers

3

dmitry pavlov [ Editor ]

Bingo for Oracle and SQL Server certainly can handle large databases (was tested on PubChem, a demo site is available). pgchem for PostgreSQL is known to be able to handle PubChem, too.

NN comments
joergkurtwegner
-

Oh, I thought Bingo runs only on SQLServer, nice to see that Oracle is supported, too! Do you have any performance estimates?

joergkurtwegner
-

Have you any experience with the RDKit cartridge (PostgreSQL)? And I would love to see some large-scale performance comparison between all those tools ;–)

dmitry pavlov
-
We are now preparing a port of Bingo to PostgreSQL, too. And yes, we want to prepare the performance comparison someday. I wish we had more time to get everything done :) Until then, you can play with the demo site.

tony27587
-
I should comment that we were very impressed with the Bingo group regarding their willingness to support SQL Server as ChemSPider remains based on SQL Server

or Cancel
2

chem-bla-ics [ Admin ]

There is also the open source OrChem by Mark Rijnbeek at the EBI. It can handle datasets of at least up to the size of PubChem, and a detailed performance analysis is given in this paper.

NN comments
joergkurtwegner
-

I talked with Christoph about it, it would be really good to see some comparison of Bingo and OrChem.

dalke
-

The target data set had 3.5 million compounds. If one similarity search takes on average 1.4s for a cutoff of 0.95 then on full PubChem it will take about 15 seconds. That doesn’t seem like a speedy search to me.

or Cancel
0

rajarshi guha [ Editor ]

I’d argue that after a certain performance point, the database cartridge itself doesn’t matter a whole lot. Instead, for large scale situations such as you note, database design, indexing methods and pre-processing decisions will affect scalability and usability of such a system.


Sure, a cartridge that can do fast SMARTS matches is preferably to one that does it slower; but doing the actual SMARTS match in such a large database should not be the bottleneck – since, hopefully, you’re not doing it as a table scan. Instead, the bottleneck is in deciding which rows to perform the actual isomorphism test against.
NN comments
dmitry pavlov
-

Actually, cartridges' functionality does include the pre-processing (in terms of RDBMS, creating a domain index), and analyzing the table as a whole (with a screening phase) when doing substructure match etc.

or Cancel

Your answer

You need to join Blue Obelisk eXchange to complete this action, click here to do so.