SQLObject Tweaks
SQLObject is a very sweet object-oriented relational-database wrapper for Python, primarily written and documented by the quite-generous Ian Bicking. It's great for writing quick and simple database-enabled programs. It does, however, have some issues when you try to use it for more complex things.
This is a collection of enhancements that I've compiled which could remedy some of these problems.
SQL Query Optimization
Currently, when you do a query and iterate through the results, if the objects are not in the cache, SQLObject executes a SELECT for every single object you're accessing (see this mailing list post). That's slow. There's a lot of overhead for every select statement.
Whenever you do a query, SQLObject first executes:
SELECT id FROM table WHERE [query args]
This gets all the object ids so it can look up the objects in the cache to see if they're there. If the object isn't in the cache, it executes a SELECT on the entire object. This is very slow when there ar ea lot of objects that aren't in the cache.
A faster way is to do just two queries -- one query where you retrieve all the ids, then a query where you retrieve all the rows for uncached objects as a singule result-set. The second query would look like this:
SELECT * FROM table WHERE id IN (1,2,3,5,7,11,13,17,...)
Selecting all of the ids at once should be much more efficient (mileage would vary depending on the database backend being used).
Also, when caching is disabled, it would be much better if SQLObject didn't do the whole id querying step and just select all the rows and all their columns (using LIMIT if the result set was sliced).
Caching
The caching mechanism has a problem where it can't stay in synch with the database when multi-threaded or multi-process programs are using SQLObject (see this blog entry for details). Obviously, this isn't an easy problem to solve, since there needs to be some kind of IPC mechanism.
One possible solution is to turn off the caching, but for high-load situations this is not an option. SQLObject should have the option of using a separate process as an object-cache (for example, MemCached, which is used by quite a few large web things; livejournal in particular).
Standardized Exceptions
Currently, if there are any SQL errors, SQLObject lets them fly into the calling program and expects it to handle them. This ruins the transparency that the DBConnection class is supposed to provide, and also gives confusing errors to the programmer (who shouldn't have to hunt down OperationalError 1294 in the DB's API documentation to find out that they accidentally passed a string to an IntCol()).
Each SQLObject database wrapper should handle all the common exceptions that its database layer throws, and convert them to standardized SQLObject exceptions. If SQLObject had this, errors would be much easier to interpret, and it would be trivial to switch underlying database.
(There would be a fallback exception for all unhandled DB exceptions which would just wrap up the DB's exception. Also, all handled SQLObject exceptions would contain the original DB exception inside them, for advanced debugging purposes.)
Connection Pooling
Related to Standardized Exceptions is a problem with the connection class. When connections timeout or die unexpectedly (and probably others since no exceptions are caught), it kills your program instead of gracefully trying to reconnect to the database. This could be alleviated by having a wrapper around the DBAPI connection object which is able to reconnect itself if something breaks.
Column type-checking and normalization
I've run across a problem where certain columns (IntCol(), for example) behave differently depending on which database-backend I'm using. With the MySQLdb backend, IntCol() returns a string, while psycopg always returns an int. Also, the MySQLdb backend will let you set the IntCol() to a string (as long as it's a string of numbers), while the psycopg backend doesn't!
This behaviour is bad because it doesn't let you transparently switch underlying databases.
A simple solution is to have the Col() classes normalize the data that they input and output. An IntCol() should let a user give it a string as long as that string can be cast as an integer (which it would always do before storing the value in the object). That way, when the database layer tries to access this attribute's value, it could rely on the fact that it would ALWAYS receieve an integer.
Other column issues:
ForeignKey('tablename') columns should warn you if you're trying to assign them to an SQLObject from the wrong table.
EnumCol() types should be more efficient for databases that don't support them.
Placeholder SQLObjects
One thing I'd like to do with SQLObject is create an empty SQLObject, set its attributes one at a time, and then flush it to the database when I'm finished. There isn't currently a way to do this, so I propose creating a new kind of "placeholder" SQLObject. It would be similar to using transactions and commits in SQL, but you don't have to send a query to the database for every attribute you change, and the database doesn't have to deal with transactionas at all, which are both big speed increases.
The current way of making a new object is:
newobject = SomeTable.new(attr1=value, attr2=value, attr3=value)
This can be awkward when you don't have all of the attributes yet, so you've got to make a new datatype (a dict for example) to temporarily hold them. SQLObjects are neat because they themselves could act as good temporary storage bins!
Also, if there's an error casting one of the attributes, you don't know which one it was since the exception it raises is a little cryptic:
TypeError: new() argument 1 must be string or read-only buffer, not int
In the above case, the bug was that I passed an int to a StringCol() as the third argument, which is confusing.
Obviously SQLObject should throw a better exception, but having to set all the arguments at once can get hairy when you have to use other SQLObjects as parameters (because of ForeignKeys, for example).
Imagine how annoying this would be if you didn't have all the attributes ready, or if one attribute somewhere threw an exception:
newObject = new(attr1=blah, attr2=foo, foreignattr=ForeignObject.new(attr1=bah, attr2=dahh, attr3=grr))
Concurrency issues also arise with the current implementation since SQLObject executes an UPDATE query for every attribute that is changed. If your attributes depend upon eachother (eg. ), and another thread decides to interrupt you after you've updated one attribute, you could run into troubles. On top of that, it's slow to update each attribute independantly! The placeholder object would let you do one big fat DB commit after you've finished all your little atomic changes to the placeholder.
Having each attribute validated independantly also lets you do better error-handling, and it could let you accumulate data through stages instead of requiring it all to be available at once. A good scenario where this could be used is in a web-application where multiple forms need to be filled out, and the data has to be validated collected over time. The Placeholder SQLObject could sit in the session and get updated as the user interacts with the forms.
Yet another feature of these placeholder objects is edit-locking. You could retrieve a placeholder object from the database for editing purposes, and lock the current row so that no other users could edit it. Then, once you're finished changing the data in the Placeholder object, you could commit all the changes back at once and release the lock.
Finally, the object would be pickleable so that it could be stored in a webware session.
Table Versioning
It would be great if we could work out a built-in table-versioning mechanism for SQLObject schemas! It would let you manage changes to schemas without breaking everything, and let SQLObject automatically upgrade your tables. Here's an example:
class SomeTable(SQLObject):
__revision__ = 15
oldColumn = StringCol() # from revision 14
newColumn = StringCol() # from revision 15The __revision__ attribute tells you the version of this schema. Every time you make a change, you'd just add 1 to the revision. The database backend would store the revision number of the current schema, and if they are different, SQLObject could upgrade the database's schema automatically!
But, how could it upgrade the schema automatically? Easy! You see, every time you modified the schema and incremented the schema revision, you added a function to the SQLObject called upgradeFrom##to##() which would initialize the new columns, delete old columns, transform your data if you changed a column's type, etc. It could execute whatever ad-hoc hack would be necessary to transform the old version of the database into the new one.
So, if your databse is currently on schema revision 13, and you decide to upgrade the database using an SQLObject at revision 15, it would execute upgradeFrom13to14() followed by upgradeFrom14to15().
SQLObject should also have an automatic database-backup function which would be called before every upgradeFrom##to##() call, so that you could roll-back the changes if the upgrade function had a bug in it (I've got some handy code that does this already).
(Is there a way to do diff's of a database's contents so that you wouldn't have to dump the entire thing every time you backed it up? Would it be efficient to dump it to a textfile, then commit it to an SVN repository and let SVN handle the diffs automatically?)
Transparent Polymorphic Classes
One thing that's great about OOP is subclassing -- having a simple base class with a standard API, and being able to modify its behaviour slightly for different situations.
For example: I need an SQLObject for customers whose payment info could be either a credit card, or a direct deposit account. I want to have a generic "payment object" class which gets returned from the customer account, and which would have a simple interface (.charge_money(amount)), and would throw exceptions if anything went wrong. The objects would obviously have to live in different tables...
One way of doing it would be to have the tables named: superclass.subclass1.subclass2.subclass3, and you would be allowed to assign an object of type subclass to any slot that takes an object of type superclass.
Comments?
Insert comments here...
