The Data Warehouse: An Overview

by James R. Baldwin

CS 615 - Advanced Database

Spring 1997

Abstract.

Data warehousing is a commercially significant database technique that draws data from widely disparate sources to support business analytical needs. It requires architectural approaches, such as multidimensional data and summary views, that are totally unlike traditional transaction processing systems and it presents its own set of problems with accuracy and consistency, but it offers new analytical approaches based on multidimensionality and aggregation.

What is a Data Warehouse?

The variety and commercial success of data warehousing products and services has grown significantly over the past three years. It has been estimated that the market for such wilt be $8 billion by 1998. [CHAU97] Economically at least, this is an area worthy of study. So, what is a data warehouse?

Mattison points out that there is no authoritative definition of data warehouse. There is no governing standards organization and although Inmon and Hackathorn are widely recognized for their work, they did not originate the concept. [MATT96]

Members of the Stanford Data Warehouse group define it as "a repository of integrated information from distributed, autonomous, and possibly heterogeneous sources." [WIEN96] [WID095] [YANG97]

Inmon defines a data warehouse as a "collection of data in support of management's decision making process" with four basic characteristics: subject-orientation, integration, time variancy and nonvolatility. [INMO95]

Data in the data warehouse is drawn from the operational environment, but unlike the operational environment where the orientation is operations or processes, the data warehouse is subject oriented. Data from various operational systems must be integrated prior to inclusion in the warehouse: naming conventions, physical attributes, encoding and units of measurement must be consistent. Another significant difference between operational data and the data warehouse is the time horizon. Operational data is typically current without the overhead of extensive historical data. The data warehouse is time variant, it contains numerous "snapshots" of data over a period of years. Because of the historical nature of the warehouse data, it is nonvolatile - not subject to record 1evel update. [INMO95]

Why Data Warehouse?

Inmon declares the data warehouse to be the "architectural foundation of decision support systems." While admittedly, applications such as data mining can be accomplished without a data warehouse, their chances of success are greatly improved by the presence of a data warehouse. [INMO96]

Queries and analysis are greatly facilitated because relevant information has been extracted, filtered and integrated prior to the query's submission, or even formulation, Data for the query or analysis is directly available and the differences that existed in the various data sources have already been resolved. The efficiency of source applications is not impaired by the execution of frequent and possibly complex queries and analyses. And availability of the data does not depend on the availability of the source. [HAMM95]

Operational systems frequently do not retain historical data. However, historical trends are an important business analysis supported by the data warehouse by collecting periodic snapshots of the source data. [YANG97]

Problems.

Jones prepared order of magnitude estimates of the data that might reside in a large international corporation. He found there is a shortage of data on error rates and the severity of errors, but that even with a low rate of errors the absolute volume can be significant (e.g., 500,000-600,000 errors in the exemplary estimate that he developed). Areas in need of further research include a need for effective metrics for data quality. [JONE95]

Hammer, et. al., recognized several potential drawbacks to data warehousing. There is additional storage required to store data duplicated from operational systems in the data warehouse. This is offset by the rapidly declining price of disk storage and the filtering and summarization that is applied to the data prior to loading it into the warehouse. Warehouse data can also become inconsistent with the source data - it can become out of date.

A more serious problem is the requirement that data to be loaded into the warehouse must be specified in advance as to source, format, filtering and summarization. This can be inappropriate for some requirements - where absolute current data is needed or users needs are unpredictable. [HAMM95]

Architectural considerations.

Architecturally, the data warehouse can be described in several ways, either by the types and structure of its data or by the processes taking place within it.

Inmon defines five components: metadata, current detail data, older detail data, lightly summarized data, and highly summarized data. [INMO95]

Wiener, et al., perceive two major components: an integration component, organizing data as it is loaded into the warehouse, and a query and analysis component for extracting data. [WIEN96]

The Stanford Data Warehouse Project has developed an experimental warehouse to test various concepts, focused primarily on the integration component. Architectural goals of this project were:

Plug and play modularity - it should be possible to use various databases for the warehouse proper and to draw data from all types of sources.
Scalability - both in terms of growing data loads and increasing numbers of views.
24X7 operation - with large organizations spreading over many time zones the traditional process at night or on weekends cycles become less feasible; data must be available at any time, but updates must still be processed.
Data consistency - the system should be able to support consistency with source data at desired levels.
Support for different source types - the system should be able to use the resources of cooperative sources to detect changes and to compare snap-shots of uncooperative sources to detect changes. [WIEN96]

Both data and process based definitions are more similar than they appear - each differing more in emphasis than substance, The graphical illustrations of each model make the similarities explicit. Most authors include at a minimum: source data, some type of integration mechanism, the warehouse database proper, access mechanisms and a metadata repository.

Another issue relates to the number of data warehouses present in an organization. The original data warehouse concept defined a corporate data warehouse - a single database serving the entire enterprise. Subsequently a more specialized form of the data warehouse called the departmental data mart was developed. [ZORN96] Inmon asserts that an architecture consisting of a central corporate data warehouse and multiple departmental data marts works well. In fact, the data mart should draw its data from the corporate data warehouse rather than directly from operational sources. This approach reduces redundancy and interface complexity while providing data integration. On the other hand, multiple corporate data warehouses are only recommended for very large, diverse and loosely integrated organizations. [INM097]

A key architectural concept which greatly facilitates the analytical use of data warehouses is the multidimensional model. Multidimensional data is organized around key numeric measures such as sales, costs, budget, etc. Associated with each amount are temporal, geographic and business dimensions. For instance, sales might be described by date, location and product line. Each dimension has attributes and is subject to aggregation. For instance, date might be viewed as daily, monthly or yearly or location might be store, city or state. [CHAU97]

The multidimensional model can be implemented directly in a multidimensional database using special data structures, such as arrays, Another approach is to use a relational database with a special database design. One such design is the star schema. This design consists of a central fact table consisting of the numeric data and foreign keys for each dimension and tables for each dimension.

Support for attribute aggregation is provided by normalizing the dimension tables of the star schema forming a snowflake schema. Of course, where there are stars, there are constellations. Fact constellations are star clusters in which different fact tables share dimension tables. [CHAU97]

Metadata.

Given the nature of warehouse data, drawn from numerous and disparate sources, and its use as an analytic too1, subject to ad hoc query by analysts and managers who need information about the data and its sources, metadata is a much more important topic than it is for operational data where interactions with the data are primarily controlled by applications. Metadata consists of data mappings, extract history, alias names, summarization algorithms, relation artifacts, ownership, access patterns, reference tables and data model/design references. [APT96]

Delivery.

To be useful, warehouse data must be delivered into the hands of the intended users. A principle approach to this is Online Analytical Processing (OLAP) and a newer and potentially complementary technique of using the internet/intranet.

Multidimensional data supports an analytical approach that exploits its dimensional hierarchy and the varying levels of aggregation supported by that hierarchy. OLAP operations view the data as cubes within cubes. [FINK94] Codd calls this "a puzzle of information." [CODD93] Typical procedures supported by this method include "rollup" and "drill-down," whereby the analyst is able to reduce or extend the granularity of a particular dimension; "slicing and dicing" whereby the analyst is able to select from a wide variety of viewpoints. [FINK]

Codd, et. at., have established 12 rules for evaluating OLAP servers:

Multidimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client/Server architecture
Generic dimensionality
Dynamic sparse matrix handling
Multiuser support
Unrestricted cross-dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels.

[CODD93]

0LAP tools are powerful, complex and relatively expensive. Another currently popular technology is the World Wide Web. Web browsers are simple to use and inexpensive. Web browsers have their own power: platform independence. The potential return for tying data warehouse technology to web technology would seem to be great. [RADE96]

While internet security is an issue that is not fully resolved, internet access is not necessary to enjoy the advantages of Web technology. Corporate Intranets are springing up within the corporate firewalls and simplifying access to all sorts of data. [RADE96]

Carickhoff sees three approaches to delivering OLAP functionality over the Web: an offline approach that creates static HTML reports, features such as drill-down can be emulated using this approach; an on the fly approach requires the OLAP server to populate HTML pages as required; the Java and ActiveX approach uses these tools to format data received from the data warehouse. But a11 is not perfect, the browsers are relatively simple in their functionality, information is passed as text and users tend to view the world one page at a time. [CARI97]

Conclusions.

Data warehousing provides unique analytical benefits to the enterprise. However, there appears to be more practical experience than theoretical understanding of the process and there is much disagreement. I even found two adjacent articles (not cited), one declaring the age of the giant data warehouse is over and the other proclaiming, "They're back!"

OLAP and the multidimensional model have intuitive appeal as business analysis tools, but do not appear to have the mathematical grounding of relational algebra. Nevertheless, even Codd sees the analytical limits of the relational model. [CODD93]

Bibliography
[APT96]	APT Data Group. "Briefing Paper: What is Metadata." [Retrieved from http://www.computerwire.com/bulletinsuk/212e_la6.htm]
[CARI97]	Rich Carickhoff. "A New Face for OLAP," Internet Systems, January 1997. [Retrieved from http://www.dbmsmag.com/9701iO8.html]
[CHAU97]	Surajit Chaudhuri, Umeshwar Dayal. "An Overview of Data Warehousing and OLAP Technology," SIGMOD Record, 26:1, March 1997. [Retrieved from http://bunny.cs.uiuc.edu/sigmod/sigmod_record/9703/]
[CODD93]	Edgar F. Codd, Sharon B. Codd and Clynch T. Salley. "Beyond Decision Support," July 26, 1983.
[FINK94]	Richard Finkelstein. "When OLAP does not relate," Computerworld, 28(50), December 12, 1994.
[FINK]	Richard Flnkelstein. "Understanding the need for On-Line Analytical Servers," White Paper, Performance Computing, Inc.
[HAMM95]	J. Hammer, H. Garcia-Retina, J. Widom, W. Labia, Y. Zhuge. "The Stanford Data Warehousing Project," IEEE Data Engineering Bulletin, June 1995. [Retrieved from http://www-db.stanford.edu/warehousing/publications.html]
[INMO95]	W. H, Inmon, "Tech Topic: What is a Data Warehouse?" Prism Solutions, Inc. 1995. [Retrieved from http://www.cait,wustl,edu/cait/papers/prism/vol1_no1/]
[INMO96]	W. H. Inmon. "The Data Warehouse and Data Mining," Communications of the ACM, 39(11):49-50, November 1996.
[INMO97]	W, H. Inmon. "Are multiple data warehouses too much of a good thing?" Datamation, April 1997. [Retrieved from http://www.Datamation.com/PlugIn/dwhouse/stories/good.htm]
[JONE95]	Capers Jones. "What goes into an information warehouse?" Computer, August 1995.
[MATT96]	Rob Mattison. Data Warehousing: Strategies, Technologies, and Techniques. McGraw-Hilt, New York, 1996.
[RADE96]	Neil Raden. "Warehouses and the WEB," Infoweek, May 13, 1996. [Retrieved from http://techweb.cmp.com/iw/579/Z9olweb.htm]
[WIDO95]	Jennifer Widom. "Research Problems in Data Warehousing," Proceedings of 4th International Conference on Information and Knowledge Management (CIKM), November 1995. [Retrieved from http://www-db.stanford.edu/warehousing]
[WIEN96]	Janet L. Wiener, Himanshu Gupta, Wilburt J. Labio, Yue Zhuge, Hector Garcia-Molina, Jennifer Widom. "A System Prototype for Warehouse View Maintenance," Proceedings of the ACM Workshop on Materialized Views: Techniques and Applications, Montreal, Canada, June 7, 1996, pp26-33. [Retrieved from http://www-db.stanford.edu/warehousing/pubtications.html]
[YANG97]	Jun Yang and Jennifer Widom. "Maintaining Temporal Views Over Non-Historical Information Sources for Data Warehousing," Technical note, 1997. [Retrieved from http://www-db.stanford.edu/warehousing/pubtications.html]
[ZORN96]	Aaron Zornes. "A Taxonomy of Corporate Data Warehouses". [Retrieved from http://www.dciexpo.com/speakers/archive/zornes.html]