Everybody’s Got An ORM

If you look at the python.org Web site, there are quite a few different object relational mapping (ORM) libraries for Python. The only one I’m really familiar with it SQLObject, because that’s what’s included with TurboGears.

Ian Bicking, the principal author of SQLObject, has done a really great job creating a library that supports Python back to 2.2, lots of different databases, and probably more features than I understand. However, SQLObject differs significantly from other ORMs I’ve used, like Hibernate and CoreData.

My goals for an alternative ORM are the following:

  • No Magic — The dynamic nature of Python lends itself to magic solutions, metaclasses and twiddling under the covers.
  • Conservative Network Traffic — In order to scale smoothly, Web applications (and any other application) must be extremely careful to optimise communication with the DB server.
  • Rich Support for Idiomatic Programming — Reducing the number of compromises necessary just because the data will live in a database will make developing applications quicker, easier and ultimately more robust.

    Some Quick Examples

I just want to throw out a few quick examples before diving in. The examples all use the following model:

Model

The Python code for this model looks like this:

import orm

class Genre(orm.Entity):
    name= orm.StringCol( length=200, alternateId=True )
    books= orm.ToMany( dest="Book", inverse="genres" )

class Book(orm.Entity):
    title= orm.StringCol( length=200 )
    genres= orm.ToMany( dest="Genre", inverse="books" )
    author= orm.ToOne( dest="Author", inverse="books" )

class Author(orm.Entity):
    name= orm.StringCol( length=200, alternateId=True )
    books= orm.ToMany( dest="Book", inverse="author" )

So far this doesn’t look significantly different from entities defined when using SQLObject.

No magic

Python has some really cool features like metaclasses and first-class class types. However cool these features may be, overuse tends to leave developers feeling like magic is happening while they aren’t looking.

In the example model above, all the attributes defined on each model class are descriptor objects. Nothing fancy happens under the covers. (Yes, there’s a metaclass, and I’ll explain what it’s for later.)

For example, what’s the length specified for the title column of the Book entity?

    >>> Book.title.length
    200

There are two bits of minor magic happening in the orm library.

The first bit of magic is where the column values get stored: because the attributes are descriptors, I can’t store the value from the database in an attribute with the same name. So they are squirrelled away in a private dictionary of column values. This has a few advantages which I describe in a moment.

The second bit of magic is a metaclass (EntityMeta) that picks up the column descriptors and creates a private dictionary of columns and relations. The metaclass will also create a primary key if you didn’t specify one, although you are permitted and even encouraged to define your own primary keys. The metaclass also adds each entity class to the ClassRegistry.

Conservative network traffic

One of the things I like most about Hibernate (and it certainly isn’t the XML push-ups necessary to get it configured) is the notion that nothing gets sent to the database until I commit the Session. This is huge. If I create a bunch of objects and after considerable work discover everything is all horked up, I call rollback and I’ve only wasted my own time. Nothing ever touched the database server.

Furthermore, if I have a Genre with 10,000 books in it, when I load that Genre into the Session, those 10,000 books aren’t materialised. They are loaded lazily as I request them (or not, it’s up to me).

I’ve tried very hard to mimic this behaviour with the orm library. Nothing is transferred from the database unless you expect to use it. There are two objects that make this possible: ObjectFault and ColumnFault. Basically a fault is merely some data that I haven’t retrieved yet. So an ObjectFault is an entire object that hasn’t been retrieved (kind of the inverse of a WeakReference) while a ColumnFault is a single column, like a BLOB, that will be loaded when needed.

Because the column values are stored in a private dictionary and accessed via the column descriptors, I can convert a Fault into the real thing when first accessed. Furthermore, I can track which columns have changed and only send the modified columns when updating the database.

In the future, I’d like to load dependant objects (not discussed yet) and possibly the destination of to-one relations using joins during initial selects. I’d also like to have component objects, which are defined by columns in the same table as the component object’s container. Consider an Author’s address. You could have a separate table for addresses (usually what I do) or you could put the street, city, state, and zip code columns in the Author table. By defining a component object as defined by those columns, you get the advantage of object-oriented design and tabular simplicity.

Rich support for idiomatic programming

Let’s create a sample book:

    >>> b= Book()
    >>> b.title= "Thud"
    >>> b.title
    'Thud'

And now let’s create an author for this book:

    >>> a=Author()
    >>> a.name="Terry Pratchett"
    >>> a.books.add(b)

The books attribute supports the same interface as a set object. But unlike a regular set, this set knows that it is part of an two-way relation. When we add a book to the collection, the relation is automatically updated. In this case, the author is set for the book.

    >>> b.author.name
    'Terry Pratchett'

This knowledge of the relationship between columns extends to many-to-many relations as well as one-to-many relations.

    >>> g1= Genre()
    >>> g1.name= "Fantasy"
    >>> g2= Genre()
    >>> g2.name= "Humour"

    >>> b.genres.add( g1 )
    >>> g2.books.add( b ) 

    >>> list(b.genres)
    [<__main__.Genre object at 0x2c2cb0>,
     <__main__.Genre object at 0x2c2db0>]

    >>> list(g1.books)
    [<__main__.Book object at 0x2c2cd0>]

Although I haven’t actually written them yet, I fully expect to have both list relations and dict relations. Currently the ToMany relation is mapped as a set, which means the objects are returned in no particular order. However, a list relation would include a third column to use for ordering. In the case of a one-to-many relation (like author-books), this ordering column might be the title of the book or the publishing date (not shown in the model).

Dictionary relations would require an intermediate table (unless the dictionary key is a value in the to-many object?) which would contain three columns: id1, id2, and dict-key.

Plugging it together

Obviously there’s a bit of code under the covers. The diagram below attempts to put some of the pieces in place.

Everything

The parts in baby-blue are long-lived objects which (will) be thread-safe. And the parts in pinky-red are ephemeral objects you use and discard.

Probably another example would help. I’ll use a TurboGears-y example:

import orm

storage= Storage( "sqlite:///users/jeff/Projects/Web/orm/test.db" )

class Root( controllers.RootController ):

    @expose( html="templates.index" )
    def index( self ):
        context= storage.context()
        book= context.load( Book, 1 )
        return dict( book=book )

Loading an object isn’t quite as convenient as the equivalent code in SQLObject, but this allows entity objects to be stored in several different databases. And it’s possible to do something clever like looking for a thread-local Context and loading the book via that.

Fly in the ointment

Of course, all the code barely registers as Alpha quality code. But there’s one nasty bug that I know about: using autoincrement key generators can mean some to-one relations don’t get set correctly. There’s an easy solution to this problem: I just need to rearrange the order in which objects are saved based on the to-one relations.

This bug only rears its head when you have a to-one relation: either to-one, one-to-one or one-to-many. It never happens when you have a to-many or many-to-many relation (which have intermediate tables).

In that case one or both entities have a foreign key for the other (a ToOne column). If you’re using autoincrement keys (the only option at the moment) obviously one of these columns has to allow a null value, because one of the objects isn’t going to have a primary key when the other is inserted.

I’m fairly certain libraries like SQLObject also have this problem, but I’ve not bothered to write a test case for it.

So the trick will be to determine which entity allows a null value in the ToOne column and save that one first. Then save the other entity and update the first. Not perfect, but Relations (from which ToOne is derived) already have a notion about which side is the primary relation. So I can probably hook into that.

Take a look

You’re welcome to take a look at the source code. It’s available from my subversion server (I love TextDrive):

http://newburyportion.com/svn/orm/

I try to keep things reasonably up to date.

Comments

Robert Brewer November 18th, 2005 @ 2:43 am

Change some names and add thread safety to your context object, and you pretty much have Dejavu, which hasn’t been alpha for some time now. ;) Feel free to steal ideas and code from it. Or vice-versa.

Jeff Watkins November 18th, 2005 @ 3:26 pm

Robert, I’ll definitely check out Dejavu. It looks interesting from what little I’ve seen of it by browsing the source in SVN. Is there any more information available?

I’m not committed to writing my own ORM — some have suggested I should be committed for even thinking about writing my own ORM.

Robert Brewer November 18th, 2005 @ 7:31 pm

You can browse the Dejavu docs online at: http://projects.amor.org/docs/dejavu/

I don’t think you should be “committed”—I found writing my own to be a lot of fun. But having worked on the CherryPy project now for 6 months, I find open source development is a lot more fun with more than one person on the team.

Jeff Watkins November 19th, 2005 @ 8:30 am

The DejaVu documentation mentions that it only runs under CPython. I’m new enough to Python (1.5+ months now) that I’ve no real idea what limiting access to CPython implies. I know there’s Jython, but is that a real platform with a large user base?

I’m also not certain I’m excited about using Python as my query language. In the long run, it seems reasonable to be able to pull queries out into resource files so that they can be modified without having to dive into the source code. Of course, from a simple coolness factor, byte-code inspection is right up there.

cp November 21st, 2005 @ 6:23 pm

CPython is normal Python. It’s just referring to the fact that the standard Python implementation is in C. The other 2 are Jython (way behind) and Iron Python (in C# for .NET/Mono, probably not 100% there yet).

As for SQLObject, I’ve had some problems with it. One is that it seemed to leak memory — IIRC, keeping the cache on meant using LIMIT did no good as everything was retained anyway. Also with cache on, if you modify the DB anywhere else but in the program running SQLObject, it won’t pick it up. If you turn cache off, your DB gets totally hammered as every read and write of a variable generates an SQL request. Even if you lazy update, you still get SQL on any variable read (if a==1 generates a SELECT a from TABLE WHERE id=...). tail -f your database log and see the unholy spam that is SQLObject in action. You also have a coherency problem, because instead of loading a record once, at the top of the function, it loads each field on each read, so e.g.

if a > 1:
    ...

if b CPython is normal Python.  It's just referring to the fact that the standard Python implementation is in C.  The other 2 are Jython (way behind) and Iron Python (in C# for .NET/Mono, probably not 100% there yet).

As for SQLObject, I’ve had some problems with it. One is that it seemed to leak memory — IIRC, keeping the cache on meant using LIMIT did no good as everything was retained anyway. Also with cache on, if you modify the DB anywhere else but in the program running SQLObject, it won’t pick it up. If you turn cache off, your DB gets totally hammered as every read and write of a variable generates an SQL request. Even if you lazy update, you still get SQL on any variable read (if a==1 generates a SELECT a from TABLE WHERE id=...). tail -f your database log and see the unholy spam that is SQLObject in action. You also have a coherency problem, because instead of loading a record once, at the top of the function, it loads each field on each read, so e.g.

if a > 1:
    ...

if b < 2:
    ...

The record can be updated before each of those statements by other threads/processes, so atomicity goes right out the window as well. If your logic says that if a is this, then b is always that, you can’t enforce that. Unless you cache. And then you leak memory.

There are some easy alternatives. One is to make your DB fields match your class member names (e.g. self.id, self.first_name, etc.), then grab the row with your normal cursor.execute() as a dict, and just do self.__dict__ = rowdict. To save it, just put together a query with for k in self.__dict__... UPDATE set first_name (i.e., k) = %s, etc., params=(self.__dict__[k]). Updating the whole thing at once is atomic. Essentially it’s a lazy update with sane caching.

Another option if you are just using the DB for storage and don’t search it is to pickle (serialize) your Python objects to a string and save them as TEXT fields. You can select and unpickle (de-serialize) a couple thousand per second at least.

One more thing: An ORM shouldn’t have you specify string lengths. That is very un-Python. Just use TEXT for everything. If you set everything up with a default value in self.__init__, the ORM should be able to infer types based on your default values.

As for Python, it’s great many other places besides web stuff. GUIs are so much better. OpenOffice-Python bridge is much nicer than the messy C++ one. Game programming too, with PyOpenGL, it’s amazing to have a 3D game work on NVidia Linux and ATI WinXP with no recompiling.

P.S. Your comment posting has some problems (weird markup too, like self dot first underscore name makes name italic). It showed me about 10 partial copies of this post, or perhaps the last however many previews, prepended, instead of replaced. I.e. the first half of the message numerous times.

Adam Vandenberg November 21st, 2005 @ 6:30 pm

What are you using to draw your diagrams?

Jeff Watkins November 21st, 2005 @ 6:38 pm

Adam, I’m using OmniGraffle. It’s really a great program. No surprise, it puts Visio to shame.

CP, I can’t believe that SQLObject thrashes the database as badly as you suggest. I’ll check the code, but I believe it fetches the entire object via a single select. Yes, updating a property can thrash the DB if you haven’t enabled Lazy Updates. But I’m pretty certain Ian would have fixed the sort of problem you describe.

Oh, and you’ll note at the bottom of the comment form, I indicate that comments use Markdown formatting. I’m not terribly happy about how that interacts with Python code, but for the moment that’s the way things are. You can prevent the weirdness by wrapping code in back-ticks.

Shalabh November 21st, 2005 @ 10:23 pm

You bring up some good points above. Another ORM is QLime. It’s different from SQLObject in that you do not define the schema in Python but just connect your Python classes to database tables and can search/edit etc. Also, it includes some features you mention above such as loading dependent objects in the same query, lazy attributes etc. Check it out - I’d be happy to see your comments!

cp November 22nd, 2005 @ 1:42 am

This is easily checked — just watch your DB log and see what queries are made as you run anything that uses SQLObject, with cache off. Yeah, you’d think various shortcomings would be fixed, but I’m guessing most people are using SQLObject in the context of web apps and only loading up a record or two at a time. What I was doing was loading up 300,000 records 100 at a time and compiling stats, and the old objects were not getting freed so it kept eating more and more memory. Oh yeah, and you can’t reload() code that defines objects derived from SQLObject. Also the resulting objects are not lightweight due to all the magic glue. I ended up making a blank class and copying over just the data in the column dict.

I just checked my notes and I wrote a big warning that with lazyUpdate on and cacheValues off, setting a property does not take effect even in the local scope until you call sync().

E.g.:

>>> report.num
2
>>> report.num = 5
>>> report.num
2
>>> report.sync()
>>> report.num
5

This is horribly counter-intuitive and can trip you up in all sorts of situations. I seem to recall that it internally stores the new value but does a DB select when checking the “current” value, but it’s possible it stores the new value out of the way and doesn’t do the DB read. I don’t feel like firing it up right now to check, because I determined after quite a bit of usage that SQLObject has too many problems to be worth using. Even if it doesn’t generate spurious SELECTs on every read, it creates too much DB spam regardless — just check your logs. (To add to the list: in multiple threads, fetching the same ID returns the same instance. This means they can step on each other’s updates. It’s yet another unexpected behavior, because when I fetch something from the database, I expect to have a local temporary copy that nobody else can mess with. Now I have to use application-level locking just to get() a record, copy its bits, and unlock it, but not if I ask for it in a way that causes a SELECT. Or if you use transactions — in which case you are supposed to turn cacheValues off, and now your DB traffic goes way up.)

SQLObject was “wow! neato!” when I first saw it, but I ditched it on the second project. My personal guess is that the “Rails for Python” crowd is busy doing clever things, and they haven’t checked how all the pieces work in production for real sites. SQLObject has a lot of caveats and side-effects that either lead to race conditions or poor performance, along with unexpected behavior. And its SQL building syntax is ugly.

=)

Ben Bangert November 23rd, 2005 @ 3:03 pm

If you’re looking for something more like Hibernate, you should check out SQLAlchemy. It’s docs are mostly complete and it relies heavily on Enterprise data mapping paradigms, Unit of Work pattern, etc. Before you start hacking out your own, I’d highly suggest looking it over.

http://www.sqlalchemy.org/

mike bayer November 23rd, 2005 @ 3:11 pm

hey there -

check out http://www.sqlalchemy.org , a new toolkit I am working on. Not released yet but is mostly functional, pending some Postgres typing issues and such. I think it addresses a lot of your points here.

Jeff Watkins November 23rd, 2005 @ 3:15 pm

Ben & Mike, Wow! That’s really cool. I’ll be downloading that in the next few days. Have you considered adding a metaclass (minimal magic) to blend the mapping with the class definition?

mike bayer November 23rd, 2005 @ 3:26 pm

well I avoided using a metaclass, and instead opted to just add property decorators to the mapped class, as well as an override for its init method to log it as “new”. that was my own version of “minimal magic”. that way, a mapped class needs to know nothing at all about how its being mapped, not even a metaclass keyword. I was trying to stay as far away from “active-record” types of requirements as possible.

I would imagine theres some advantages to using metaclasses as opposed to just tacking on “foo=property(lala)” as needed, but at the moment I cant think of any.

if the toolkit interests you, feel free to contribute suggestions, test cases, or patches. I imagine its got a long way to go before it handles everyone’s needs (not to mention MySQL support…)

Jayson Vantuyl November 25th, 2005 @ 2:33 am

Have you taken a look at Axiom (from Divmod and some Twisted people)?

While I’m not sure it makes the grade on “no magic”, it does support transactions (turn off autocommit and use the checkpoint/revert methods) and it also has the extremely powerful “powerup” feature (imagine being able to extend classes at a per-row level).

That said, I doubt you’ll be extremely satisfied with it. It’s tied to SQLite, and doesn’t allow inheritance. These are probably both showstoppers.

It basically makes a directory structure containing databases (stores) as you need them in a potentially hierarchical fashion. It creates one table per class per schema version.

Weirdness aside, it makes a lot of hard things extremely easy. It supports schema upgrades, is almost completely self contained, integrates with Twisted in delightful ways, and its query syntax is devilish (in a good way). I’d recommend taking a look at it (and also Twisted in general).

Essentially, SQLObject focuses on making objects translate well into the relational world (to the detriment of advanced query functionality and transactioning). Axiom focuses on self-containment, simplicity, and being generally useful.

nerd — Not Invented Here December 4th, 2005 @ 3:33 pm

[…] After my attempt to build another ORM for Python, I had lots of helpful and some not so helpful feedback. Several folks pointed out existing ORMs that I hadn’t run across in my exploration of the available packages. One notable ORM is SQLAchemy. With a little work, I think SQLAlchemy can address all of my requirements. Plus it’s got some really cool features and a much more sane query mechanism than packages like SQLObject. […]