Data warehousing from end to end
The most visible and perhaps the most interesting part of a data warehouse is the "front end," where the data meets the user at the interface of an online analytical processing (OLAP) or data mining tool. That’s where business users hope to find meaningful patterns that nudge them toward good decisions. But a lot goes on behind the scenes to get clean, reliable data to the user. Unless that part of the process is done correctly, valid business decisions cannot be made.
The development of a data warehouse begins with two initiatives: analyzing business needs and evaluating data sources. With any luck, the two will match up-an enterprise will have the data it needs to support its business decisions. Chances are, however, that the data itself will need some significant cleanup. Data being loaded into a warehouse generally comes from a variety of sources that evolved independently, and needs to be made consistent before it can be effectively used.
The process of taking data from source databases, cleaning it up and putting it into the target databases is called extraction, transformation and load (ETL). A number of companies, including Ardent (www.ardentsoftware.com) and its recent acquisition, Prism Solutions (www.prismsolutions.com), have products that specialize in doing just that. Each ETL product has its strengths. Acta Technology (www.acta.com), for example, markets ActaLink, which is tightly integrated with SAP’s (www.sap.com) application layer. D2K’s (www.d2k.com) Tapestry automates ETL for SAP and PeopleSoft (www.peoplesoft.com) databases.
"A typical problem in developing data warehouses is the presence of inconsistent formats for customer information across the enterprise," said Pam Whitmore, director of marketing for the Informix Data Warehouse Division (www. informix.com). In some cases, customers might have a middle initial entered or a job title, and in other cases, they won’t. When the data is integrated into a single repository, each instance will emerge as a separate record unless the data is cleaned up first.
"A more complex issue," Whitmore added, "is how revenue is calculated." If sales tracks it one way and accounting another, the data will not be consistent. Informix, which recently acquired Red Brick, has end-to-end solutions that go from the extraction process through the creation of data warehouses or data marts and include sophisticated analysis and reporting components.
The cleanup process can be automated, but only after rules for doing so are developed. The rules, which need to be carefully thought out, are then incorporated into the transformation process.
"Initially there may be a large table of exceptions," said Prism Solutions’ product manager Hackey Weinberger, "but as time goes on and rules are written to resolve them, the number of exceptions should drop." A side benefit is that corporations can catch errors in operational systems by setting up data warehouses that spot inconsistencies.
Metadata is captured when the data is loaded into the target data base. It is critical to ensuring meaningful analyses. The user should be able to see, for example, where the data came from, how often it is refreshed or what the underlying calculations are for a field. Being able to view the metadata helps the user make decisions about doing analyses and how valid the analyses are.
"If a monthly update will be available in two days," said Weinberger, "the user should be able to find this out." Thus the data warehouse is supported by a secondary database of information about its own data.
During the process of loading the data warehouse, the data must be aggregated or grouped in order to make it comprehensible to the analytical tools. Usually the data is aggregated using SQL statements. As the number of statements increases, though, the loading efficiency decreases exponentially. Constellar (www.constellar.com) offers a product called WarehouseBuilder that works with Constellar Hub, an application integration product, to dramatically improve loading efficiency by aggregating the data in a single-pass process.
Databases come into play from two directions-as the source of the data and as the target for the data warehouse. Sources can include anything from MS Access to DB2 databases; typical targets are relational databases such as Oracle (www.oracle.com) or Sybase (www.sybase.com). The data may end up in either a data warehouse or a data mart. Data marts are often portrayed as small data warehouses, but in fact they can be very large. The distinction is more one of focus-the data mart is designed for departmental use, and eliminates fields that are irrelevant so that users are not swamped by options they do not need.
A growing number of front-end tools can be used to make the most of all that data. The most common are OLAP tools that allow both report writing and exploration. DSS Agent, part of a suite of DSS products from MicroStrategy (www. strategy.com), is a widely used OLAP tool. Brio Technology (www. brio.com), Business Objects (www.businessobjects .com) and Cognos (www.cognos.com) also produce OLAP tools.
Data mining tools provide another way of analyzing information in data warehouses. The term "data mining," often misused to describe the general process of extracting meaningful business information from a database, has a more specific meaning. Typically, data mining tools are employing higher level analytical processes, such as statistical analysis and pattern analysis, to reveal unknown but important relationships among the data.
Dr. Kamran Parsaye, president of Information Discovery (www.datamining.com), puts it this way: "Our philosophy is that rather than picking the grapes and crushing them himself, the user should be able to reach for a bottle of wine." Information Discovery produces the Pattern Warehouse, which works directly on large SQL repositories without requiring sampling or extract files, and distills all patterns by itself.
With all the different components-ETL tools, databases, and analysis tools-integration can become a significant issue. One route is to work with a vendor such as Informix that provides a fully integrated solution. Companies are also acquiring complementary products, such as the purchase of Platinum Technology (www.platinum.com) by Computer Associates (www.cai.com) to match up Platinum’s data warehousing tools with CA’s object-oriented database. MicroStrategy, which produces a suite of decision support software tools, has developed a partnership program with leading vendors of ETL software to integrate metadata with its DSS Agent interface. Expect more action in that area in coming months.
Data warehouses and data marts are valuable tools for many types of businesses and are particularly well suited to those where customer analysis is critical.