There isn’t a lot of information on the inter webs about designing the historical load of a Data Warehouse. Consequently I see a lot of confusion on the subject. Before we go farther down the Urge to Merge Series, I thought it would be beneficial to define the concept with some visual examples.
The Initial/Historical Load
First, let me clarify what I mean. When I mention historical loading, what often comes to mind is processing Slowly Changing Dimensions going forward. While that’s related to this discussion, they aren’t really synonymous. What I’m talking about is loading the dimension records, with full slowly changing dimension support if possible, going backwards in time. I’m referring to the Initial load, rather than subsequent incremental loads. A great deal of development time can be dedicated to this one-time load because getting the history is so important. After all, who wants to invest so many resources into a Data Warehouse that won’t yield much value for months or years until “new” history can be accumulated? So, where can we find this history?
History or Audit Tables
Well, the first and most obvious place to look is in Historical or Audit tables. Often someone has already created a method to track history on important source tables, perhaps with SQL triggers, because this need has already been identified. If we’re fortunate, they are tracking changes on every column designated as Type 2 and we can simply query those tables. I’ll go into more depth in the next article on how to query the source table(s) using T-SQL.
Derived from Source Transaction Tables
Often, history has to be derived from related transaction tables. For example, let’s assume that our source system keeps no customer history at all, but you do want to be able to track sales (and other Facts) as accurately as possible. In our simple example, our first customer table may simply look like this:
Notice that all we have is the last version of the Customer Name and Planet. Well, if we want to be able to historically track the address (Planet in our example), where might we find this? Well, obvious places to look would be in invoice or shipping tables. We could simply join these two tables, possibly with a left join, to get a data set similar to our history table.
Remember, in the grand scheme it’s important to create this history in your dimensions and not rely on your FactShipments table for the information. For example, let’s say that this Jedi Demo Data Warehouse is for a retail business. You may want to track your vendor purchasing details (for resale), Profit Amounts (from Invoicing), etc., per customer Planet. If you’ve loaded DimCustomer historically, then this history can be applied to other areas of the enterprise.
This method should be considered a last resort. If your client really needs certain history, and there just isn’t any way to get it from the source system, you may be able to restore previous versions of the database(s) and perform an ETL load after each. I worked on one system where we loaded an end of month database backup for the previous two years. These 24 restores were done from tape, and were extremely tedious. However, we got the job done and the client was satisfied.
Considerations and Complications
What happens if only certain columns designated as Type 2 can be obtained historically? Suppose you simply can’t get historical information for Planet, but want to track it going forward? The implications of this must be explained in detail to the client team and ultimately they make the decision. Over time, the issue becomes less and less important because you are accurately keeping history going forward and those “faulty” records may be queried less and less over time.
Also, many Data Warehouse implementations are rushed due to business demands and such. Invariably this urgency causes corners to be cut and it’s possible that some dimensional model changes will require a reload of certain sections of your warehouse. Or sometime after the Data Warehouse is in production, the client decides that additional dimension attributes need to be tracked historically. If the Data Warehouse kept certain information historically as of the “go live” date, and you have to completely reload certain tables, this creates a difficult situation. For more information, I’ll refer you to Warren Thornthwaite’s excellent article Creating Historical Dimension Rows.
In the next two articles I’ll cover the following:
- How to write queries to selectively extract records on the Initial/Historical Load.
- How to use T-SQL Merge to process these loads.