Today has not been a fun day. I have spent most of today and a large part of yesterday trying to fix a problem in our system. The problem seems very simple at first, and indeed we came up with a dozen or so ideas for solutions to the problem. In the end though, none of the ideas could be implemented given our time constraints. Why? How could something that seems so simple be so difficult to fix? The answer lies in a decision that we made four years ago, a decision that seemed like a great idea at the time: we chose the wrong architecture.
Four years ago, we were designing the "final" version of a massive text and data mining platform after going through three previous major prototype efforts (each consisting of about 4-8 months of development and culminating in a somewhat-working product). Based on our successes and failures in the previous prototypes, we decided we wanted to try something new. Instead of writing a bunch of architectural infrastructure, we would jump on the web services band wagon, make everything a service, and stick it all in IIS. It seemed so simple at the time… we would go with the pipe & filter pattern, using asynchronous web services as the transportation mechanism. The feeling was that we didn’t really need anything overly reliable or performant. Our system was expected to take weeks to run, processing hundreds of thousands of documents. We did the math and thought "yeah, this architecture should handle that fine." Plus, we really thought that using IIS and web services would reduce the amount of architectural and infrastructure plumbing and management we would have to do. Everything would be loosely coupled (asynchronous and all), and indeed, it was and is to this very day. It would be robust in the sense that it could recover from errors, and to a degree, this is true. If something goes wrong while processing a document, the system will eventually try again. And since we using IIS as the host, we wouldn’t have to write our own hosting services, and again, we sort-of hit the mark. But there were problems. Oh man, there are still problems.
Fast forward four years, and I am convinced that our architecture has been more of a hindrance than a help. Everything is asynchronous and decoupled, but it was a lot of work to get it there. Did you know you can’t send soap messages that are 50 MB long by default (at least you couldn’t with .NET 1.1). We found that out the hard way. Did you know that the XML serializer, which .NET uses for web services, fails to escape a whole slew of characters? Again, we found that out the hard way after a lot of painful debugging. Do you know what happens when you fire a document off to an asynchronous pipeline? Neither does the process that sent it! Is it in there? Did it come out the other side? Should I resend it? The only way we could address that was by "guessing" how long it would take the document to make it through, then essentially looking for it on the other end. Did it come out? No? Then resend it!
And that, my friends, is what I have spent the last two days working on. Let’s think about that strategy for a second. We send a document to a pipeline, wait for some amount of time, then look to see if it has come out there other end. If not, surely that means something went wrong, and the document died somewhere in the pipeline in a burst of exceptiony goodness. Right? WRONG. The document may very well still be in there. Someone may have fed a patent document in to it that contains a massive DNA sequence. One of the processes in the pipeline may be faithfully chugging away, trying to figure out what the various letters in the sequence mean. But we don’t know that. All we know is that the document never made it out the other end, so we have to assume the worst and send another copy in. Great. Now we have a second thread faithfully chugging away on the same DNA sequence. Again, we wait, then look to see if it has come out the other end. No? Send it again! We now have three copies of the document eating up three threads on a four core machine. One more pass like that, and we have effectively clogged the pipeline. Throw in 8 more copies for good measure, and you can rest assured that the pipeline is now permanently blocked until IIS is reset. This is the bug I’ve been trying like crazy to fix for two days: how does our controlling process (which we weren’t even supposed to have to create according to our original architectural grand vision) know what’s going on? There’s no good answer. I thought of a few hacks, but most wouldn’t work. The hack I went with was basically to try to detect documents that *might* contain genetic sequences and ignore them. In a system that will see hundreds of thousands of documents in a week, I’m pretty confident that things will be filtered that shouldn’t have been.
Anyway, the moral of this rambling post is simply this: the importance of architecture, especially in an enterprise application, is critical. You do not want to get this piece wrong, or whoever takes over for you when you finally go insane from all the hacks you’ve had to implement to work around the danged architecture will pay for it. Think through everything: how it will work under normal conditions, how it will work under load, how it will work when under attack, how it will respond to every conceivable error, how flexible it needs to be, how difficult it will be to maintain… do not skimp on this step, or you will be sorry.