In Memory Test Database
22 November 2005
An in-memory database is a database that runs entirely in main memory, without touching a disk. Often they run as an embedded database: created when a process starts, running embedded within that process, and is destroyed when the process finishes.
While most people think of databases as large disk-centered creatures, there's a small but busy world of in-memory databases out there. There are applications which need fast access to some sort of managed data which doesn't need to be persisted either because it doesn't change, or it can be reconstructed (imagine a routing table in a router, or an EventPoster.)
Yet even developers of traditional database systems can find an in-memory database useful, particularly for testing. When you're developing an enterprise application, tests that hit the database can be a huge time drain when running your test suites. Switching to an in-memory database can have an order of magnitude effect which can dramatically reduce build times. Since most ThoughtWorkers get the shakes if they haven't had a green bar recently, this makes a big difference to us.
There are two routes people seem to take to a in-memory database for testing. The first one is to use a SQL in-memory database library. In Java-land the popular one seems to be HSQLDB. Elsewhere SQLite and Firebird come up. The nice thing about these tools is that they allow you to use regular SQL to query them. One issue is that they may not support quite the same dialects or have all the features of the target database. You can do something similar by running a file-based database on a ram disk, which allows you to keep the test and production deployments closer to each other.
Another route is to abstract all the database access behind a Repository. Then you can swap out the database with regular in-memory data structures. Often just a bunch of hash-tables for the entry points to the object graph is enough. One of the strengths of the repository approach is that it gives you a consistent way to access (and stub out) non SQL data sources too. This means that your object-relational mapping system is also hidden inside the repository.
Indeed a few people actively dislike using SQL in-memory databases under the belief that they encourage spreading either SQL or object-relational mapper code around the domain model. Running SQL in-memory may removes much of the pain of slow access but acts as a deodorant to cover the smell of a missing repository.
Testing is the main driver thus far, but I think there's more to come from in-memory databases. Memory sizes are now enough that many application databases can be loaded into memory. If you use an approach that keeps an event log of all changes to your application state, you can treat the in-memory database as a cache of the result of applying the log, rebuilding it and snapshotting it as you need. Such styles can be very scalable and have high performance in cases where you have lots of readers and few writers. I've run into a few cases where people have used in-memory databases for very high performance applications. A difference here is that these experiences tend to be with niche commercial databases while for testing people seem to prefer open-source.
Prevayler got a lot of attention for taking this kind of approach. People I know who tried it found it's tight coupling to the in-memory objects and lack of migration tools caused serious problems. But I think the approach of persistent change logs as systems of record is a fertile ground to explore in the future.
Follow Ups
I got some interesting mail after writing this page, so I thought I'd share some of the points.
One correspondent said he liked using in-memory databases for tasks that SQL does well but objects don't do as well. There are certainly cases where SQL can solve a problem rather more elegantly than objects or procedural code, although usually I find it's only a minority of developers who like thinking in SQL.
My colleague Steve Sparks told me about a recent project where for testing they would pull data from the live database on the first call then save that data in a file to initialize the in-memory repository so that subsequent queries wouldn't hit the database. I first saw that done in the C3 project, it kept it's data in a hash table keyed by the SQL query string. If there was no value it went to DB2 and cached the result.
Steven Graves pointed out that my original entry didn't really talk enough about the general uses of in-memory databases, as a result of which I did some rewriting and gave the item a re-titling.
(Thanks for Peter Becker, Zane Rockenbaugh, and Steve Sparks for their comments to me. I should also thank various unspecified ThoughtWorkers for helpful comments on our internal mailing list.