logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

PostgreSQL & lookup & insert in bioperl-db: msg#00342

Subject: PostgreSQL & lookup & insert in bioperl-db
This is a problem I stated a couple of months ago and that has basically remained unsolved. I try here to summarize it again, and I'll point out 3 alternatives I see for solving the problem. Each of those alternatives has different advantages and disadvantages. I'd like to get a community vote on what the best and/or most preferred solution would be. If you're using bioperl on biosql or plan doing so, please read and vote if you have an opinion (some of the alternatives would affect all RDBMSs, not just Pg). Otherwise ignore ...

Here's the background.

The current code in the bioperl-db adaptors to biosql run the connection with AutoCommit off, which means the client determines the transaction. Load_seqdatabase.pl (the main script for loading databases into biosql) treats one sequence entry as one transaction. If any sql statement required for loading the sequence fails unexpectedly, the whole transaction is rolled back, otherwise it is committed once the entry and all its annotation went in successfully. The emphasis rests on 'unexpectedly': INSERTs may fail due to unique key violations, which is caught and triggers a look-up of the affected entry by unique key. E.g. a dbxref may already exist; if so, it needs to be looked up in order to establish the association with the bioentry.

Here's the problem.

If within a transaction a particular statement fails, in Oracle and MySQL/InnoDB only that particular statement fails, not the previous ones in the same transaction, nor the subsequent ones. If you commit the transaction, all the succeeded statements' results get committed. In PostgreSQL, however, the entire transaction fails and is invalidated, making all previous and all subsequent statements fail until you call rollback (you cannot call commit on a failed transaction).

For bioperl-db and the granularity used by load_seqdatabase.pl, as a simple example this means on Postgres that if insertion of a dbxref fails, at that point the entire transaction is aborted, i.e., the previously inserted bioentry with sequence, features, comments, etc all go away. In practice this doesn't happen with bioperl-db because dbxrefs are cached and looked up first; however what does occur is the multiple associations of the same bioentry to the same dbxref, to give an example, which is prohibited by a unique key constraint. If this happens, the respective sequence entry is lost.

Here's the possible solutions.

1) Turn off transactions for PostgreSQL, i.e., run in auto-commit mode. Given the fact that Pg is a natively transactional database I find this option possible but sad. In addition, there would have to be code that specifically for PostgreSQL deletes the sequence entry if there was a failure during a particular sequence entry. (Otherwise rolling back the transaction takes care of cleaning up possibly incomplete rows.) On the plus side, this is probably the simplest one to implement.

2) Look up every record before insertion. On the plus side of things, this would maintain transactions. On the downside, entities with a finite number of records are already cached and looked up before insert (examples being dbxref, reference, ontology_term). Those that are not looked up are potentially infinite in number, meaning that 95% of the time the record to be inserted will not already exist. A prior lookup will be a waste of time in 95% of cases. I.e., this option will slow down performance.

3) Separate the look-up into PL/PgSQL code in the database. Instead of issuing a direct INSERT you call a stored procedure that looks up the entry and inserts it if not yet present. The advantage of this option is that it keeps the bioperl-db layout largely unaffected and hence treats all RDBMSs relatively equal (i.e., cares least about RDBMS idiosyncrasies). The downside is that this means that the Pg version of Biosql can't be fully auto-generated anymore from the MySQL version (I don't want to auto-generate the PL/PgSQL procedures from a MySQL schema definition). Also, people say PL/PgSQL is not the fastest thing out there.

Let me know your thoughts and votes. There may easily be more options, feel free to share.

        -hilmar
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


<Prev in Thread] Current Thread [Next in Thread>