Sunday, April 26, 2009

A Postmortem of a Failed Trigger Integration

Last summer I wrote an admittedly noninspiring post on trigger integration between two database instances. Guess what? The solution failed and I had to start over. Actually, all this happened before Christmas, but until now there has not been enough time to write a postmortem.

Good thing is that the new, rewritten solution works fine. I think that the last modifications were made early March and since then nothing has been heard of the client. Nothing, total silence. Now, for those not in the field, silence is about as close to success as you can get in the world of business software. Unless your users have abandoned your software, that is.

But before this bliss there were issues... so what went wrong? Well, the first thing thing that went wrong was that I was unprepared. I should have at least skimmed a book like Enterprise Integration Patterns before starting the project. Second, I should have better understood why our client needed the integration to make better decisions during initial development, especially during source-to-target mapping. Basically this was an application integration (EAI) project that later turned into an operative data integration (ETL) project. There are key differences between the two, and the reason of turning to data-level integration was that one endpoint, a 90's heritage windows client app, was totally out of my control. 

Wrong assumption: Simplicity
The original integration plan was built on a premise of simplicity. There would be two applications with two tables on each side (a person and a company tables), and they would stay synchronized via triggers on both ends. When a user adds a person in one application UI, a transaction would be triggered that would write the same data to the other application's person table. Or abort the whole thing, so that there would no insert in either.

Unfortunately, things were not that simple. In a normalized database schema, adding a new person through a UI will probably read and write multiple tables. Phone numbers and email addresses would go to some contact table, profile data like marketing restrictions to a profile table, and there would be a references from persons to a company table for all persons whose employers were known.

Solving all this complexity took time but eventually it worked. During testing all CRUD operations worked fine on both ends. So the increased complexity caused by dependencies did not break the system. But it made it fragile and too weak to withstand the final blow.

After testing, the integration was rolled out into production. Soon afterwards reports started to come in, and finally the client's representative called and said that they couldn't do any work. I took down the integration temporarily, waited until late evening, and started to debug in the production environment.

What I found was a deadlock in the legacy client-server application. There were no reported problems on the new web application side. But the catch was that this old client-server app had actually two fat client UIs, one of which was not available during testing. And this second one wrote to tables in slightly different order than the first, tested client, and would therefore fire triggers in different order than planned and cause a deadlock.

Now, at this point it became clear that the legacy application should not play any role in the integration. Not even the role of firing a trigger when a "Save" button was pressed. The solution needed to be uncoupled from the client app.

Wrong technology: Triggers
Some of the readers might have noticed the words trigger integration at the beginning. Triggers are not in fashion these days. The problem with triggers is that anything unexpected breaks them, and it is painful to debug them. This pretty much goes to declaratice, set-based SQL in general. Most programmers, myself included, are not very productive writing professional quality software with T-SQL. For example, an integration solution should have error logging facilities and behave reasonably well when networking is down. But I wouldn't expect .NET developers to write as high quality exception handling in T-SQL as they do in .NET languages.

Right technology: Asynchronous broker service
The new solution was written in C# as a Windows service. It collects change data in both systems, resolves any conflicts between them and calculates a change set to be written to each database. This would be scheduled to run regularly like once in an hour. There are two major design considerations in building this kind of a synchronization:
  • how to collect change data
  • how to resolve conflicts
Avoiding wrong choices here was actually one factor in taking the synchronous path first. In a T-SQL trigger, change data would be stored in inserted and deleted tables. Also, there would be no conflicts because transactions would take care of concurrent updates.

Without going to details here, a user-table level change data capture can use timestamps columns in the tracked table or maintain a separate change table. The problem with timestamps is that you have to trust them. In two servers, clocks will go out-of sync unless a time server is used. Second, delete operations should be lazy, otherwise they cannot be detected.

In our case, time service was provided by Active Directory domain controller and there was no hard deletes. Conflict resolution could be handled by selecting the row with the most recent timestamp. So far, this simple solution has worked well enough.


-mika-

Monday, April 20, 2009

Real-Time Data Warehouse, An Introduction

This is the first part of an ongoing series on Real-Time Data Warehousing. The series consists of draft chapters of my master's thesis, with added links to relevant sources. When possible, topics discussed are demonstrated with Microsft SQL Server 2008. Hope you enjoy!

-mika-



INTRODUCTION

Real-time data warehouse (RTDW) is an analytical database that supports soft real-time reporting applications. A basic requirement for these applications is continuous availability of fresh data. Soft real-time applications are usually performing operative rather than strategic or analytical tasks. However, real-time reporting with read-only semantics is not an OLTP application. RTDW has to serve concurrently multiple read transactions and one write transaction. The problem is to optimize for read transactions without failing to load data in time. A traditional DW loading methodology is to take the warehouse off-line for maintenance at most once a day. Online maintenance with frequently executed update transactions is much less well known and is considered risky by many experts. It also introduces query contention bottlenecks, isolation anomalies and view maintenance phenomena not seen in a DW where reader and maintenance transactions are never executed concurrently [LYS07, MTK06].

This report focuses on online DW maintenance with soft deadlines. When applicable, Microsoft SQL Server database system is used to present examples of RTDW techniques. No actual experiments or performance benchmarks are planned. Recently published methods like the MeshJoin algorithm for storing streaming updates to database tables [PSV08] are not easily demonstrated with present-day database systems and ETL tools.

There are three sequential phases in a maintenance cycle, each having an effect on the propagation time of changes from the source systems to the DW: 
  1. change data capture (CDC) in the source systems [Oracle, MSSQL], 
  2. extract, transform and load process (ETL) and 
  3. incremental maintenance of materialized views (in the DW or OLAP DBMS).

The purpose of CDC is to avoid full load of the source system to the DW by providing change data for the ETL process. Changes in data can be detected from the transaction log, with the help of triggers or by setting up a transactional replica that can pull changes from its master as needed. Trigger-based CDC has two drawbacks: it requires changes in the user databases and there is an immediate performance penalty each time a trigger is fired. For these reasons, trigger-based CDC has gone out of favor recently. Log-based CDC is less intrusive to the user tables, as the preferred method is to use system tables to store change data acquired from the transaction log. Log-based CDC can be implemented using log APIs published by DBMS vendors. In fact, Microsoft SQL Server 2008 CDC tools are just a set of public stored procedures and functions written on top of the log API. Third alternative is to set up a transactional replica to collect data, which also avoids the drawbacks associated with triggers. What CDC method is best depends mostly on the available tools. As a database research topic, CDC barely does exist. It can be considered more as a part of the skillset of a working database professional. This does not diminish its importance in RTDW: without CDC, ETL loads from the source systems would take a lot more time.

Designing and developing an ETL system takes the most man hours in virtually every DW project. A brief description of the work involved is given in a promptly named article "38 subsystems of ETL" [Kim04]. Also, running an ETL process is usually the most time consuming phase of the maintenance cycle [MTK06]. Consequently, any treatment on DW systems must focus on ETL. In RTDW the natural conceptual model for ETL process is a queuing network of streaming inputs, processing nodes and endpoints (i.e., tables in the DW) [KVP05]. Current ETL tools in the market do not fully reflect this model. To be fair, ETL tools like Microsoft SQL Server Integration Services (SSIS) are modeled as a pipeline of data and control flows, but they do not support continuous input streams. Instead, they are designed to handle large bulk loads efficiently. With the current ETL tools one can simulate a continuous ETL process by increasing the frequency of ETL batch loads. Queuing network models with streaming inputs serve primarily as a model for estimating the highest possible ETL batch frequency. It is likely that running and ETL process once a day presents different challenges than running the same process once a minute, 1440 times a day. Robustness of the ETL process usually becomes more important. In many real-time reporting applications it is better to have approximate real-time data instead of delayed but complete data. For example, if a delay is caused by a queues, system quality could be improved by shedding, i.e. dropping a fraction of the incoming data stream, and loading it later completely [PSV08].

Materialized view is a cached view representing the result of a query. Materialized views can speed up query processing greatly, but to realize the benefits two subproblems must be solved: view matching and incremental view maintenance [LaZ07]. High-availability DW systems must also solve a third subproblem: how to maintain views without taking the system offline? View matching determines if a query can be computed from a view. View maintenance is the process of updating a view when base tables are updated. Incremental view maintenance means updating the view without rebuilding it from scratch. Maintenance of materialized views is a mature area with over 25 years of publications [ShI84]. View matching and efficient incremental view maintenance algorithms for SPJG views are well understood [LaZ07]. SPJG views are composed of selection, projection and inner joins with an optional group-by operation. Latest development is to extend view support to SPOJG view, that is, allow views to also contain outer joins [LaZ07]. Online view maintenance algorithms do not block reader transactions during maintenance [Gar06, KiM07]. This is achieved using two versions of the view, i.e. caching. During maintenance, readers are delivered a cached version of the view. Materialized views can be stored in a relational DW or in a separate OLAP DBMS like Microsoft SQL Server Analysis Services (SSAS). In RTDW view maintenance is triggered after every ETL load, or alternatively, after each change to the base tables. In SSAS/SSIS terminology, the first alternative means that the cube processing is scheduled as a last step in an SSIS package, and the second corresponds to using proactive caching in SSAS. Proactive caching is a configurable, online view maintenance technology that detects changes in the base tables and maintains the dependent views automatically.

Generally, a soft relative deadline on a maintenance cycle time affects the overall read performance of the DW. It is necessary to trade-off read performance in order to achieve low maintenance cycle times. The exact trade-off is determined by reporting application requirements. It is common that soft real-time reporting is a feature that is wanted more than needed, in a sense that ultimately no one is willing to pay the price of developing and operating a RTDW. It should be emphasized that a RTDW system should be considered only after finding direct reporting from OLTP database(s) insufficient. Generally, this is the case when analytical query loads are heavy or distributed queries from heterogeneous source systems are unfeasible. Security issues may also necessitate a centralized DW when source systems are administered by different organizations. 

Even when a RTDW is necessary, only a small physical partition of the DW database should be updated continuously. This partition should contain intra-day updates, and it would be regularly merged with the main partition during a daily online maintenance window. Most of the index rebuilding and view maintenance would occur in this maintenance window and the real-time partition would be lightweight in both indexes and materialized views. If analytical queries are executed on an OLAP cube instead of a relational DW, the real-time partition should benefit from a ROLAP storage mode while the vast majority of the cube should use a MOLAP storage [MTK06]. Segregating the real-time (intra-day) data from the rest of the DW has emerged as a best practice to deal with the issues that real-time data warehousing presents [SaB08, MTK06]. 


REFERENCES

[Gar06] Garca, C. (2006), Real time self-maintenable data warehouse, in 'ACM Southeast Regional Conference', pp. 518-524.

[KVP05] Karakasidis, A.; Vassiliadis, P. & Pitoura, E. (2005), ETL Queues for Active Data Warehousing, in 'Proceedings of the 2nd International Workshop on Information Quality in Information Systems', pp. 28--39.

[Kim04] Kimball, R. The 38 Subsystems of ETL. Intelligent Enterprise

[KiM07] Kim, N. & Moon, S. (2007), 'Concurrent View Maintenance Scheme for Soft Real-Time Data Warehouse Systems', Journal of Information Science and Engineering 23(3), 723--740.

[LaZ07] Larson, P-Å. & Zhou, J. (2007), Efficient Maintenance of Materialized Outer-Join Views., in 'ICDE', IEEE, , pp. 56-65.

[LYS07] Lin, Z.; Yang, D.; Song, G. & Wang, T. (2007), Dealing with Query Contention Issue in Real-Time Data Warehouses by Dynamic Multi-level Caches, in 'CIT', pp. 122-127.

[MTK06] Mundy, J.; Thornthwaite, W. & Kimball, R. (2006), The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset, Wiley.
(Especially Chapter 17: Real-Time Business Intelligence )

[PSV08] Polytzotis, N.; Skiadopoulos, S.; Vassiliadis, P.; Simitsis, A. & Frantzell, N.-E. (2008), 'Meshing Streaming Updates with Persistent Data in an Active Data Warehouse', IEEE Transactions On Knowledge And Data Engineering 20(7), 976--991.

[SaB08] Santos, R. J. & Bernardino, J. (2008), Real-time data warehouse loading methodology, in 'IDEAS', pp. 49-58.

[ShI84] Shmueli, O. & Itai, A. (1984), Maintenance of Views., in Beatrice Yormark (Ed.): SIGMOD'84, Proceedings of Annual Meeting, Boston, Massachusetts, June 18-21, 1984. ACM Press, pp. 240-255