Good thing is that the new, rewritten solution works fine. I think that the last modifications were made early March and since then nothing has been heard of the client. Nothing, total silence. Now, for those not in the field, silence is about as close to success as you can get in the world of business software. Unless your users have abandoned your software, that is.
But before this bliss there were issues... so what went wrong? Well, the first thing thing that went wrong was that I was unprepared. I should have at least skimmed a book like Enterprise Integration Patterns before starting the project. Second, I should have better understood why our client needed the integration to make better decisions during initial development, especially during source-to-target mapping. Basically this was an application integration (EAI) project that later turned into an operative data integration (ETL) project. There are key differences between the two, and the reason of turning to data-level integration was that one endpoint, a 90's heritage windows client app, was totally out of my control.
Wrong assumption: Simplicity
The original integration plan was built on a premise of simplicity. There would be two applications with two tables on each side (a person and a company tables), and they would stay synchronized via triggers on both ends. When a user adds a person in one application UI, a transaction would be triggered that would write the same data to the other application's person table. Or abort the whole thing, so that there would no insert in either.
Unfortunately, things were not that simple. In a normalized database schema, adding a new person through a UI will probably read and write multiple tables. Phone numbers and email addresses would go to some contact table, profile data like marketing restrictions to a profile table, and there would be a references from persons to a company table for all persons whose employers were known.
Solving all this complexity took time but eventually it worked. During testing all CRUD operations worked fine on both ends. So the increased complexity caused by dependencies did not break the system. But it made it fragile and too weak to withstand the final blow.
After testing, the integration was rolled out into production. Soon afterwards reports started to come in, and finally the client's representative called and said that they couldn't do any work. I took down the integration temporarily, waited until late evening, and started to debug in the production environment.
What I found was a deadlock in the legacy client-server application. There were no reported problems on the new web application side. But the catch was that this old client-server app had actually two fat client UIs, one of which was not available during testing. And this second one wrote to tables in slightly different order than the first, tested client, and would therefore fire triggers in different order than planned and cause a deadlock.
Now, at this point it became clear that the legacy application should not play any role in the integration. Not even the role of firing a trigger when a "Save" button was pressed. The solution needed to be uncoupled from the client app.
Wrong technology: Triggers
Some of the readers might have noticed the words trigger integration at the beginning. Triggers are not in fashion these days. The problem with triggers is that anything unexpected breaks them, and it is painful to debug them. This pretty much goes to declaratice, set-based SQL in general. Most programmers, myself included, are not very productive writing professional quality software with T-SQL. For example, an integration solution should have error logging facilities and behave reasonably well when networking is down. But I wouldn't expect .NET developers to write as high quality exception handling in T-SQL as they do in .NET languages.
Right technology: Asynchronous broker service
The new solution was written in C# as a Windows service. It collects change data in both systems, resolves any conflicts between them and calculates a change set to be written to each database. This would be scheduled to run regularly like once in an hour. There are two major design considerations in building this kind of a synchronization:
- how to collect change data
- how to resolve conflicts
Avoiding wrong choices here was actually one factor in taking the synchronous path first. In a T-SQL trigger, change data would be stored in inserted and deleted tables. Also, there would be no conflicts because transactions would take care of concurrent updates.
Without going to details here, a user-table level change data capture can use timestamps columns in the tracked table or maintain a separate change table. The problem with timestamps is that you have to trust them. In two servers, clocks will go out-of sync unless a time server is used. Second, delete operations should be lazy, otherwise they cannot be detected.
In our case, time service was provided by Active Directory domain controller and there was no hard deletes. Conflict resolution could be handled by selecting the row with the most recent timestamp. So far, this simple solution has worked well enough.
-mika-

0 comments:
Post a Comment