Developing a Data Warehouse Architecture

 


By Laura Hadley

A data warehouse architecture is a description of the elements and services of the warehouse, with details showing how the components will fit together and how the system will grow over time. There is always an architecture, either ad hoc or planned, but experience shows that planned architectures have a better chance of succeeding.

“Every data warehouse has an architecture,” says Warren Thornthwaite, a partner with Menlo Park, CA-based InfoDynamics LLC. “It's either ad hoc or planned; implied or documented. Unfortunately, many warehouses are developed without an explicit architectural plan, which severely limits flexibility.” Without architecture, subject areas don't fit together, connections lead to nowhere, and the whole warehouse is difficult to manage and change. In addition, although it might not seem important, the architecture of a data warehouse becomes the framework for product selection.

Thornthwaite compares the development of a data warehouse to building a real house. “But how do you build a $3 million mansion, let alone a $100,000 house?” You do it with blueprints, he says—the drawings, specifications, and standards showing how the house will be constructed, at multiple levels of detail. Of course, there are different versions of the blueprint for various subsystems of the house, such as plumbing, electrical, HVAC, communications, and vacuum. There's also standards that all homes follow, including plugs, lights, plumbing fixtures, door sizes, etc.

For data warehousing, the architecture is a description of the elements and services of the warehouse, with details showing how the components will fit together and how the system will grow over time. Like the house analogy, the warehouse architecture is a set of documents, plans, models, drawings, and specifications, with separate sections for each key component area and enough detail to allow their implementation by skilled professionals.

“This is not a requirements document,” Thornthwaite points out. “The requirements document says what the architecture needs to do. The architecture also isn't a project plan or task list; it's the what, not the how or why.”

It's also not easy, he says, because we've only been developing data warehouse systems for 15 years, versus 5,000 years for building homes. Therefore we have fewer standards, the tools and techniques are rapidly evolving, there is little documentation of what systems we already have, and data warehouse terminology is extremely loose.

So while developing an architecture is difficult, it is possible—and it's critical. First and foremost, he says, the architecture has to be driven by the business. If your requirement is to have nightly updates, this has implications for the architecture, and you must understand the technical requirements to achieve what you want to have. Thornthwaite gives a few business requirement examples, and the general technical considerations for each:

  • Nightly updates - adequate staging horsepower.
  • Worldwide availability - parallel or distributed servers.
  • Customer-level analysis - [large] server size.
  • New data sources - flexible tools with support for meta data.
  • Reliability - job control features.

Key Component Areas

A complete data warehouse architecture includes data and technical elements. Thornthwaite breaks down the architecture into three broad areas. The first, data architecture, is centered on business processes. The next area, infrastructure, includes hardware, networking, operating systems, and desktop machines. Finally, the technical area encompasses the decision-making technologies that will be needed by the users, as well as their supporting structures. These areas are detailed in the sub-sections below.

Data Architecture

As stated above, the data architecture portion of the overall data warehouse architecture is driven by business processes. For example, in a manufacturing environment the data model might include orders, shipping, and billing. Each area draws on a different set of dimensions. But where dimensions intersect in the data model the definitions have to be the same—the same customer who buys is the same that builds. So data items should have a common structure and content, and involve a single process to create and maintain.

Thornthwaite says that organizations often ask how data should be represented in the warehouse—entity/relationship or dimensional? “If you have a star schema1 then use dimensional. Is your detail normalized2 or dimensional? Will users be querying detail? Then use dimensional.” He adds that most data warehousing experts are in substantial agreement; the [data] sources are typically entity/relationship models and the front end is a dimensional model. The only issue is where you draw the line between the warehouse itself and the data staging area.

As you work through the architecture and present data to your users, tool choices will be made, but many choices will disappear as the requirements are set. For example, he explains that product capabilities are beginning to merge, like MOLAP and ROLAP. “MOLAP is okay if you stay within the cube you've built. It's fast and allows for flexible querying—within the confines of the cube.” Its weaknesses are size (overall and within a dimension), design constraints (limited by the cube structure), and the need for a proprietary data base.

Infrastructure Architecture

With the required hardware platform and boxes, sometimes the data warehouse becomes its own IS shop. Indeed, there are lots of “boxes” in data warehousing, mostly used for data bases and application servers.

The issues with hardware and DBMS choices are size, scalability, and flexibility. In about 80 percent of data warehousing projects this isn't difficult; most businesses can get enough power to handle their needs.

In terms of the network, check the data sources, the warehouse staging area, and everything in between to ensure there's enough bandwidth to move data around. On the desktop, run the tools and actually get some data through them to determine if there's enough power for retrieval. Sometimes the problem is simply with the machine, and the desktops must be powerful enough to run current-generation access tools. Also, don't forget to implement a software distribution mechanism.

Technical Architecture

The technical architecture is driven by the meta data catalog. “Everything should be meta data-driven,” says Thornthwaite. “The services should draw the needed parameters from tables, rather than hard-coding them.” An important component of technical architecture is the data staging process, which covers five major areas:

  • Extract - data comes from multiple sources and is of multiple types. Data compression and encryption handling must be considered at this area, if it applies.
  • Transform - data transformation includes surrogate key management, integration, de-normalization, cleansing, conversion, aggregation, and auditing.
  • Load - loading is often done to multiple targets, with load optimization and support for the entire load cycle.
  • Security - administrator access and data encryption policies.
  • Job control - this includes job definition, job scheduling (time and event), monitoring, logging, exception handling, error handling, and notification.

The staging box needs to be able to extract data from multiple sources, like MVS, Oracle, VM, and others, so be specific when you choose your products. It must handle data compression and encryption, transformation, loading (possibly to multiple targets), and security (at the front end this is challenging, Thornthwaite says). In addition, the staging activities need to be automated. Many vendors' offerings do different things, so he advises that most organizations will need to use multiple products.

A system for monitoring data warehouse use is valuable for capturing queries and tracking usage, and performance tuning is also helpful. Performance optimization includes cost estimation through a “governor” tool, and should include ad hoc query scheduling. Middleware can provide query management services. Tools for all of these and other related tasks are available for the front end, for server-based query management, and for data from multiple sources. Tools are also available for reporting, connectivity, and infrastructure management. Finally, the data access piece should include reporting services (such as publish and subscribe), a report library, a scheduler, and a distribution manager.

A Word About Meta Data

The creation and management of data has the following “steps” in the data warehouse process:

  1. warehouse model
  2. source definitions
  3. table definitions
  4. source-to-target maps
  5. map and transformation information
  6. physical information (table spaces, etc.)
  7. extracted data
  8. transformed data
  9. load statistics
  10. business descriptions
  11. query requests
  12. the data itself
  13. query statistics

To show how important meta data is, of the steps listed above only three involve “real” data—7, 8, and 12. “Everything else is meta data,” says Thornthwaite, “and the whole data warehouse process relies on it.” The major technical elements of a meta data catalog include:

  • Business rules - includes definitions, derivations, related items, validation, and hierarchy information (versions, dates, etc.).
  • Movement/transformation information - source/destination information, as well as DDL (data types, names, etc.).
  • Operations information - data load job schedules, dependencies, notification, and reliability information (such as host redirects and load balancing).
  • Tool-specific information - graphic display information and special function support.
  • Security rules - authentication and authorization.

Developing an Architecture

When you develop the technical architecture model, draft the architecture requirements document first. Next to each business requirement write down its architecture implications. Group these implications according to architecture areas (remote access, staging, data access tools, etc.) Understand how it fits in with the other areas. Capture the definition of the area and its contents. Then refine and document the model.

Thornthwaite recognizes that developing a data warehouse architecture is difficult, and thus warns against using a “just do it” approach, which he also calls “architecture lite.” But the Zachman framework is more than what most organizations need for data warehousing, so he recommends a reasonable compromise consisting of a four-layer process: business requirements, technical architecture, standards, and products.

Business requirements essentially drive the architecture, so talk to business managers, analysts, and power users. From your interviews look for major business issues, as well as indicators of business strategy, direction, frustrations, business processes, timing, availability, and performance expectations. Document everything well.

From an IT perspective, talk to existing data warehouse/DSS support staff, OLTP application groups, and DBAs; as well as networking, OS, and desktop support staff. Also speak with architecture and planning professionals. Here you want to get their opinions on data warehousing considerations from the IT viewpoint. Learn if there are existing architecture documents, IT principles, standards statements, organizational power centers, etc.

Not many standards exist for data warehousing, but there are standards for a lot of the components. The following are some to keep in mind:

  • Middleware - ODBC, OLE, OLE DB, DCE, ORBs, and JDBC.
  • Data base connectivity - ODBC, JDBC, OLE DB, and others.
  • Data management - ANSI SQL and FTP.
  • Network access - DCE, DNS, and LDAP.

Regardless of what standards they support, major data warehousing tools are meta data-driven. However, they don't often share meta data with each other and vary in terms of openness. “So research and shop for tools carefully,” advises Thornthwaite. “The architecture is your guide. And use IT advisory firms, like GartnerGroup, META Group, Giga, etc.”

How detailed does a data warehouse architecture need to be? The question to ask is this: Is this enough information to allow a competent team to build a warehouse that meets the needs of the business? As for how long it will take, the architecture effort will grow exponentially as more people are added for its development (i.e., it becomes “techno-politically complex”), and more complex the resulting system needs to be (i.e., “functionally complex”).

Like almost everything in data warehousing, an iterative process is best. You can't do it all at once because it's too big—and the business won't wait. Also, Thornthwaite says, the data warehouse market isn't complete yet. So begin with high leverage, high-value parts of the process. Then, use your success to make a case for additional phases.

Conclusions

To sum up, the benefits of having a data warehouse architecture are as follows:

  • Provides an organizing framework - the architecture draws the lines on the map in terms of what the individual components are, how they fit together, who owns what parts, and priorities.
  • Improved flexibility and maintenance - allows you to quickly add new data sources, interface standards allow plug and play, and the model and meta data allow impact analysis and single-point changes.
  • Faster development and reuse - warehouse developers are better able to understand the data warehouse process, data base contents, and business rules more quickly.
  • Management and communications tool - define and communicate direction and scope to set expectations, identify roles and responsibilities, and communicate requirements to vendors.
  • Coordinate parallel efforts - multiple, relatively independent efforts have a chance to converge successfully. Also, data marts without architecture become the stovepipes of tomorrow.

Thornthwaite recommends that companies align with business requirements but to be practical. He also emphasizes the importance of keeping up with advances in the data warehouse industry. Finally, remember that there is always an architecture: implicit or explicit, “almost in time” or planned. Experience shows that the planned and explicit ones have a better chance of succeeding.

Source

“From Bauhaus to warehouse: Understanding data warehouse architecture requirements,” presented by Warren Thornthwaite at DCI's Data Warehouse Summit, held December 8-10, 1998 in Phoenix, AZ.

References

1. Star schema designs produce a large fact table and many, smaller dimension tables which extend the different aspects of the facts. By processing the dimension tables first, fewer of the detailed records in the fact table need to be scanned to complete the query. (From Interactive Data Warehousing, by Harry Singh. Prentice-Hall, 1999, p.163.) Back

2. Normalization is the process of removing all model structures that provide multiple ways to know the same fact; a method of controlling and eliminating redundancy in data storage. (from Interactive Data Warehousing, by Harry Singh. Prentice-Hall, 1999, p.444.) Back


   Go to:

      Data Warehouse Quality Management | Go to the top of this page

This site was designed and authored by Laura
Site last updated 2002