As you might can tell from Friday’s post, my recurring obsession with persistence has flared up again. In all my years of development, I have still yet to find a solution that feels "perfect". Perhaps the holy grail of storage is impossible to obtain. Perhaps there is no perfect solution. Maybe it’s some Matrix-style philosophical issue, and the idea of a perfect persistence solution is a dream that us primitive developers refuse to accept. In any case, I thought it was time to (once again) survey the landscape of solutions to see what’s changed since I went on this persistence trip a year ago. Seriously, there must be something about January, because I think I go on this tear every year about this time. Anyway…
Storage for Developers 101
In the beginning, there were files. And files were good, we still use the crap out of them today. But they were lacking something (several somethings, in fact). How do you read/write to files quickly? How do you maintain some degree of integrity and consistency in the data? How do you handle concurrency? It turns out that these cross-cutting concerns were not application specific, they were universal requirements that needed to be addressed, and most developers just weren’t very good at it. Enter the database.
Early databases were not much more than flat files with some API abstractions built on top of them. There was no SQL, no joins, nothing of that sort. It wasn’t until the 1970s that the method that would dominate our industry for the next 30 years (and counting) was introduced: the relational database. By the 1980s, these things had really taken off, and we had commercial systems being rolled out by a variety of vendors. The Structured Query Language (better known as SQL or "Sequel") became a standard of sorts and was supported by most vendors, and heavyweights IBM, Microsoft, and Oracle rose to power to dominate the scene.
While relational databases were on the rise, alternative systems were also being proposed. Object-oriented databases began to gain some traction (though mostly in academia) due in large part to the paradigm shift towards object-oriented programming. There was also this little thing called XML that showed up on the scene, and with it XML databases, that provided the flexible schema and hierarchical data support that relational databases are not very good at. Recently, there has also been a shift towards document-oriented databases, which have an even less strict structure than XML databases.
Despite all these advances, most of us in the .NET world still immediately fall back to relational databases as soon as we need to store something. For many (myself included, for the most part), the alternatives are too untested, immature, and unproven to be considered. Many "architects" even start building new applications from the database up, beginning with an ER diagram and only creating the object-oriented view once all the entities are well-defined.
Why the focus on storage?
If it often seems like a disproportionate amount of development time is spent on database access code, that’s because it is. Studies have shown that as much as 40% of total effort in a software project is spent on writing and maintaining data access code. That’s 40% of your effort that’s not meeting requirements, that’s not adding useful features, that’s not adding value to the application. "But Matt, the database drives everything, without it we have no system, so our data access code is actually adding more value than anything else!" WRONG. Do you really think that the customer cares whether you use NHibernate or ActiveRecord or SubSonic or MyHomeGrownDAL or whatever? No, they don’t, what they care about is whether the application meets their needs. Sure, there are exceptions, and sometimes the underlying database itself actually fulfills business requirements, but often times it doesn’t. So why spend 40% of your time on something that isn’t actually improving your application? If you are using a relational database, the answer is because you have to.
The Impedance Mismatch
One of the reasons that data access code takes up so much of the development effort is because there is a gigantic mismatch between the view of relational databases and the object-oriented view that our application code takes. This square-peg-in-a-round-hole scenario is referred to as the impedance mismatch, and it is the bane of modern development. Translating between the relational world and the object world is expensive, painful, and time consuming. The sheer number of tools that exist to help overcome this barrier is a testament to the scope of this problem.
So if relational databases are such a pain, what’s the alternative? I’m still not convinced that there is one, but there are a couple of areas that I’m looking in to further. The first is object-oriented database systems, such as db4Objects. With an object-oriented database, the impedance mismatch disappears completely, you just send your object to the database as-is, and the database stores it. There are downside to object-oriented databases though: aggregate queries, such as totaling sales by employee in a retail store database, are much, much more expensive than their relational counterparts. They also don’t have a widely adopted and understood query language like relational databases have with SQL.
Another alternative is an XML database, like eXist. XML databases can’t natively store objects, but they can store hierarchical data easily, and many object-oriented languages (.NET included) have solid support for converting object graphs to and from XML representations. An upside to XML databases is that there are actually somewhat widely-adopted languages for querying them, such as XQuery and XPath.
Yet another alternative is a document-oriented database, such as the popular newcomer CouchDB. Document-oriented storage is very (very) flexible, but it is also very immature. I do not know of any enterprise-critical applications that are built on top of it (that doesn’t mean they aren’t out there, it just means I don’t know about them). There are no industry standards on how to query these things yet.
In future entries to this series, I’m going to look at as many of the alternatives to relational databases as I can. If you have suggestions, please send them to me. I will be using Castle ActiveRecord as the measuring-stick for the relational database in terms of ease-of-use (I have yet to see anything that matches both its flexibility and ease-of-use at the same time), but I am willing to look at alternative approaches if/when appropriate. I’m going to focus on three main factors: feasibility (which includes performance, tool support, reliability, etc), ease-of-use (how quickly can I go from nothing to fully-working storage layer), and flexibility in terms of meeting other business requirements (such as reporting and business intelligence). If anyone has suggestions on other criteria that should be considered, let me know.