Monday, April 20, 2009

Real-Time Data Warehouse, An Introduction

This is the first part of an ongoing series on Real-Time Data Warehousing. The series consists of draft chapters of my master's thesis, with added links to relevant sources. When possible, topics discussed are demonstrated with Microsft SQL Server 2008. Hope you enjoy!

-mika-



INTRODUCTION

Real-time data warehouse (RTDW) is an analytical database that supports soft real-time reporting applications. A basic requirement for these applications is continuous availability of fresh data. Soft real-time applications are usually performing operative rather than strategic or analytical tasks. However, real-time reporting with read-only semantics is not an OLTP application. RTDW has to serve concurrently multiple read transactions and one write transaction. The problem is to optimize for read transactions without failing to load data in time. A traditional DW loading methodology is to take the warehouse off-line for maintenance at most once a day. Online maintenance with frequently executed update transactions is much less well known and is considered risky by many experts. It also introduces query contention bottlenecks, isolation anomalies and view maintenance phenomena not seen in a DW where reader and maintenance transactions are never executed concurrently [LYS07, MTK06].

This report focuses on online DW maintenance with soft deadlines. When applicable, Microsoft SQL Server database system is used to present examples of RTDW techniques. No actual experiments or performance benchmarks are planned. Recently published methods like the MeshJoin algorithm for storing streaming updates to database tables [PSV08] are not easily demonstrated with present-day database systems and ETL tools.

There are three sequential phases in a maintenance cycle, each having an effect on the propagation time of changes from the source systems to the DW: 
  1. change data capture (CDC) in the source systems [Oracle, MSSQL], 
  2. extract, transform and load process (ETL) and 
  3. incremental maintenance of materialized views (in the DW or OLAP DBMS).

The purpose of CDC is to avoid full load of the source system to the DW by providing change data for the ETL process. Changes in data can be detected from the transaction log, with the help of triggers or by setting up a transactional replica that can pull changes from its master as needed. Trigger-based CDC has two drawbacks: it requires changes in the user databases and there is an immediate performance penalty each time a trigger is fired. For these reasons, trigger-based CDC has gone out of favor recently. Log-based CDC is less intrusive to the user tables, as the preferred method is to use system tables to store change data acquired from the transaction log. Log-based CDC can be implemented using log APIs published by DBMS vendors. In fact, Microsoft SQL Server 2008 CDC tools are just a set of public stored procedures and functions written on top of the log API. Third alternative is to set up a transactional replica to collect data, which also avoids the drawbacks associated with triggers. What CDC method is best depends mostly on the available tools. As a database research topic, CDC barely does exist. It can be considered more as a part of the skillset of a working database professional. This does not diminish its importance in RTDW: without CDC, ETL loads from the source systems would take a lot more time.

Designing and developing an ETL system takes the most man hours in virtually every DW project. A brief description of the work involved is given in a promptly named article "38 subsystems of ETL" [Kim04]. Also, running an ETL process is usually the most time consuming phase of the maintenance cycle [MTK06]. Consequently, any treatment on DW systems must focus on ETL. In RTDW the natural conceptual model for ETL process is a queuing network of streaming inputs, processing nodes and endpoints (i.e., tables in the DW) [KVP05]. Current ETL tools in the market do not fully reflect this model. To be fair, ETL tools like Microsoft SQL Server Integration Services (SSIS) are modeled as a pipeline of data and control flows, but they do not support continuous input streams. Instead, they are designed to handle large bulk loads efficiently. With the current ETL tools one can simulate a continuous ETL process by increasing the frequency of ETL batch loads. Queuing network models with streaming inputs serve primarily as a model for estimating the highest possible ETL batch frequency. It is likely that running and ETL process once a day presents different challenges than running the same process once a minute, 1440 times a day. Robustness of the ETL process usually becomes more important. In many real-time reporting applications it is better to have approximate real-time data instead of delayed but complete data. For example, if a delay is caused by a queues, system quality could be improved by shedding, i.e. dropping a fraction of the incoming data stream, and loading it later completely [PSV08].

Materialized view is a cached view representing the result of a query. Materialized views can speed up query processing greatly, but to realize the benefits two subproblems must be solved: view matching and incremental view maintenance [LaZ07]. High-availability DW systems must also solve a third subproblem: how to maintain views without taking the system offline? View matching determines if a query can be computed from a view. View maintenance is the process of updating a view when base tables are updated. Incremental view maintenance means updating the view without rebuilding it from scratch. Maintenance of materialized views is a mature area with over 25 years of publications [ShI84]. View matching and efficient incremental view maintenance algorithms for SPJG views are well understood [LaZ07]. SPJG views are composed of selection, projection and inner joins with an optional group-by operation. Latest development is to extend view support to SPOJG view, that is, allow views to also contain outer joins [LaZ07]. Online view maintenance algorithms do not block reader transactions during maintenance [Gar06, KiM07]. This is achieved using two versions of the view, i.e. caching. During maintenance, readers are delivered a cached version of the view. Materialized views can be stored in a relational DW or in a separate OLAP DBMS like Microsoft SQL Server Analysis Services (SSAS). In RTDW view maintenance is triggered after every ETL load, or alternatively, after each change to the base tables. In SSAS/SSIS terminology, the first alternative means that the cube processing is scheduled as a last step in an SSIS package, and the second corresponds to using proactive caching in SSAS. Proactive caching is a configurable, online view maintenance technology that detects changes in the base tables and maintains the dependent views automatically.

Generally, a soft relative deadline on a maintenance cycle time affects the overall read performance of the DW. It is necessary to trade-off read performance in order to achieve low maintenance cycle times. The exact trade-off is determined by reporting application requirements. It is common that soft real-time reporting is a feature that is wanted more than needed, in a sense that ultimately no one is willing to pay the price of developing and operating a RTDW. It should be emphasized that a RTDW system should be considered only after finding direct reporting from OLTP database(s) insufficient. Generally, this is the case when analytical query loads are heavy or distributed queries from heterogeneous source systems are unfeasible. Security issues may also necessitate a centralized DW when source systems are administered by different organizations. 

Even when a RTDW is necessary, only a small physical partition of the DW database should be updated continuously. This partition should contain intra-day updates, and it would be regularly merged with the main partition during a daily online maintenance window. Most of the index rebuilding and view maintenance would occur in this maintenance window and the real-time partition would be lightweight in both indexes and materialized views. If analytical queries are executed on an OLAP cube instead of a relational DW, the real-time partition should benefit from a ROLAP storage mode while the vast majority of the cube should use a MOLAP storage [MTK06]. Segregating the real-time (intra-day) data from the rest of the DW has emerged as a best practice to deal with the issues that real-time data warehousing presents [SaB08, MTK06]. 


REFERENCES

[Gar06] Garca, C. (2006), Real time self-maintenable data warehouse, in 'ACM Southeast Regional Conference', pp. 518-524.

[KVP05] Karakasidis, A.; Vassiliadis, P. & Pitoura, E. (2005), ETL Queues for Active Data Warehousing, in 'Proceedings of the 2nd International Workshop on Information Quality in Information Systems', pp. 28--39.

[Kim04] Kimball, R. The 38 Subsystems of ETL. Intelligent Enterprise

[KiM07] Kim, N. & Moon, S. (2007), 'Concurrent View Maintenance Scheme for Soft Real-Time Data Warehouse Systems', Journal of Information Science and Engineering 23(3), 723--740.

[LaZ07] Larson, P-Å. & Zhou, J. (2007), Efficient Maintenance of Materialized Outer-Join Views., in 'ICDE', IEEE, , pp. 56-65.

[LYS07] Lin, Z.; Yang, D.; Song, G. & Wang, T. (2007), Dealing with Query Contention Issue in Real-Time Data Warehouses by Dynamic Multi-level Caches, in 'CIT', pp. 122-127.

[MTK06] Mundy, J.; Thornthwaite, W. & Kimball, R. (2006), The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset, Wiley.
(Especially Chapter 17: Real-Time Business Intelligence )

[PSV08] Polytzotis, N.; Skiadopoulos, S.; Vassiliadis, P.; Simitsis, A. & Frantzell, N.-E. (2008), 'Meshing Streaming Updates with Persistent Data in an Active Data Warehouse', IEEE Transactions On Knowledge And Data Engineering 20(7), 976--991.

[SaB08] Santos, R. J. & Bernardino, J. (2008), Real-time data warehouse loading methodology, in 'IDEAS', pp. 49-58.

[ShI84] Shmueli, O. & Itai, A. (1984), Maintenance of Views., in Beatrice Yormark (Ed.): SIGMOD'84, Proceedings of Annual Meeting, Boston, Massachusetts, June 18-21, 1984. ACM Press, pp. 240-255

0 comments: