As I was writing the articles on handling Slowly Changing Dimensions with T-SQL Merge, I realized that some background on Dimensions may be necessary first. Astute readers may remember that I briefly covered Dimension table basics a few years ago, but we need to go into a bit more detail on how to build them and how their structure supports Slowly Changing Dimensions.
What is a Dimension?
In general, Dimensions are tables in your data warehouse, composed of wide and descriptive columns, which give your data meaning. Some in the industry refer to the columns of a Dimension Table as “Dimensions” as well. The entities described by Dimension Tables are often physical in nature such as Customers, Patients, Products, Stores, etc. Identifying Dimensions and their attributes is easy when you know the trick. When you were in school, did you learn to do Word or Story Problems in Math class? I was taught that an important step in doing so was to translate the question into math operations by recognizing and translating key words.
Luckily spotting Dimensions is even simpler. Consider this story problem that your boss may give you:
“I want to know net sales in dollars by Product, Territory, and Month.”
When you hear this, you should immediately look for words like “by” or “per.” In this case, Product, Territory, and Month are Dimensions or attributes of them. Easy huh?
Remember that the primary driving force behind Data Warehouse initiatives is to simplify data analysis. In general, the fewer tables we have in our Data Warehouse, the easier analysis will be. Therefore, we combine similar attributes into Dimensional tables which tend to be denormalized and flattened. In the Story Problem above, you might have the following Dimension Tables.
- DimProduct, which may be composed of fields like Product Name, Product Category, etc.
- DimSalesTerritory, which may have fields like Territory Name, Territory Code, etc.
- DimCalendar, the most common Dimension, which of course would have fields like Month Number, Month Name, Quarter, Year, Day Of Year, Day Of Week Name, etc.
Dimension Table Structure
Almost without exception, every Dimension should have a primary key which has no meaning, essentially a numbering column. The users users of a Data Warehouse won’t care about these Primary Keys, and in fact may never even see them. However, something like a Customer Number attribute has meaning, and points back to a specific customer. These meaningless keys, called Surrogate Keys, insulate the data warehouse from any prior business systems, they are typically an integer data type, and typically assigned by SQL Server through the use of Identity property. Basically as each new record is added to a Dimension, the next number is assigned to it. Dimensions join to related Fact tables, where the dollar amounts and other measures are kept, via the Surrogate Keys. These table groups form what is called a Star Schema which you can see below.
So, in more detail Dimensions need to use Surrogate Keys for the following reasons.
- The source database(s) may recycle its keys. If they do so, the uniqueness of your dimension primary key is violated.
- Multiple sources for the same dimensional entity likely have different keys. If you extract Customer information from a Customer Relationship Management system, Shipping Software, and your Billing System, an additional key field is necessary to unify it.
- If you acquire another company, which has it’s own set of keys, how could you incorporate them into your data warehouse without Surrogate Keys?
- Suppose one of your attributes is a ship date and that isn’t always known the first time a record is loaded? Surrogate keys allow a special value to be substituted instead like “Not Available” rather than a null.
- Surrogate Keys are compact which saves space, especially with indexes. To that end, you should be using the smallest integer datatype appropriate for these keys.
- Surrogate Keys facilitate easier tracking of historical changes. More on this in a bit.
In addition to the Surrogate Key, a Dimension will have one or more Business Keys, sometimes also referred to as Alternate Keys. These are fields, usually numeric or alpha-numeric (codes), from the source system(s) which identify each entity. Some Dimensional Modelers indicate the type of key in the field name. So, in a Customer Dimension you may see fields named Customer_SK (Surrogate Key), Customer_BK (Business Key, or Customer_AK (Alternate Key). Personally, I’m a big proponent of using naming conventions religiously. Unless I’m forced to deviate, my standard method is to name the Dimension DimCustomer, the surrogate key becomes CustomerKey, and the business keys are named exactly as the user understands them. It could be CustomerId, CustomerNumber, etc. One reason I do this is I am unlikely to run into a source which uses the exact naming convention which I use for my surrogates, and I dislike having to answer the same questions over and over again. “Umm… what does Customer_AK mean again?” I would urge you to follow a standard, but as long as you’re consistent feel free to create your own conventions.
Slowly Changing Dimension Maintenance Columns
Most Dimensions will have column(s) which handle SCD Changes. The columns you’ll typically find are some variation of RowisCurrent, RowStartDate, and RowEndDate. Not everyone uses the same fields, and some only use a subset of them. Jamie Thomson (Blog/Twitter) makes a very well reasoned argument for only using RowStartDate in his “Debunking Kimball Effective Dates” Part One and Part Two. I’ll weigh in on this in a future article, but for the sake of this discussion, I will be using all three.
- RowIsCurrent – In my implementation this a Char(1) field which contains ‘Y’ or ‘N’. It’s also common for modelers to use a bit (1 or 0) for this same purpose. This simply indicates if a particular record is the Current or Active record for a particular entity.
- RowStartDate – This is some form of a Date/DateTime column which indicates when the record first became current.
- RowEndDate – This is some form of a Date/DateTime column which indicates when the record was no longer current or active. The row was expired at this point. For the active records, some modelers leave RowEndDate NULL. I choose to assign it to the maximum date for the data type. So, if I use datetime, I may assign this field as ’12/31/9999′.
Further, it should be noted that some prefer when expiring a record to set the RowEndDate to a tiny increment less than RowStartDate for the next record for this entity. They do this because they would prefer to write “Between” statements in their T-SQL. However, I have found that maintaining such a structure is problematic. Therefore, I almost always set the expired record’s RowEndDate and new record’s RowStartDate to be equal.
General Indexing Strategies
There is a lot of conflicting information on the topic of how to structure and index Dimension Tables. The most basic disagreement is whether the Primary Key of a Dimension should be clustered or non-clustered. In most examples, you’ll find the Primary Key being clustered as in Microsoft’s own AdventureWorksDW. Further, in their definitive work,
The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2, the Kimball group espouses the same practice.
Also, Microsoft’s own Customer Advisory Team (SQL CAT) indicates that Dimensions should have clustered Primary Keys.
I tend to deviate and create a clustered index on the business key(s), which means the Primary Key must be non-clustered. The following articles available on the web explain why this is a good idea with SQL Server, and my own personal testing supports this practice.
Regardless of which method you choose, be sure to index on the Business Key(s) and Slowly Changing Dimension columns.
Why do Dimensions Slowly Change?
One of the primary problems with source systems is that many of them don’t track history. For example, if a customer moves, many of these systems simply over-write the old address information in place. If you were reporting from this database, then any sales for this customer would also “move” to the new location, which may not be desirable. Slowly Changing Dimensions allow us to track history in the data warehouse. Since Dimension Tables join Fact Tables via the Surrogate key, we can have many records per business entity (customer in our case) with its history of Dimension Attributes.
It’s important to note that Dimension Tables themselves are rarely a specific type, rather we track the SCD type per Dimensional Attribute (column). Ralph Kimball, who many consider to be the father of Data Warehousing, creatively named the types of SCDs as Type 1, Type 2, etc.
- Type 0 – Fixed attributes that shouldn’t change like a Customer’s birth date.
- Type 1 – These columns don’t track history and simply over-write values in place when change is encountered.
- Type 2 – A change in these columns causes expiration of the current row and a new row to be added with the newly changed value.
Fear not dear reader, if you are new to Dimensional Modeling the handling of these types will make sense with the screen shots provided below. There are additional types of Slowly Changing Dimensions, but they are beyond the scope of this article. The following articles by The Kimball Group explain Slowly Changing Dimensions in more depth:
- Slowly Changing Dimensions – Part 1
- Slowly Changing Dimensions – Part 2
- Slowly Changing Dimensions are not Always as Simple as 1, 2, 3
So as you can see, our simplified Customer Dimension has only two attributes. CustomerName will be handled as Type 1 and Planet as Type 2. While our example is extremely simple, it is also “realistic.” In many real world systems, attributes like Customer Name are often Type 1; and Planet is analgous to a location or address, which are often processed as Type 2. Let’s see an example of SCD Changes in action. Suppose you had the following data in DimCustomer:
Be sure to note the following:
- If there had been several records for Obi-Wan, then every one of them would be updated to “Ben Kenobi”.
- Notice that the SCD fields, RowIsCurrent, RowStartDate, and RowEndDate are un-affected.
Simple huh? Well, let’s move to Type 2. In our example, we only have one Type 2 column which is planet. For our Type 2 change, let’s say that Yoda escapes to Degobah after failing to destroy Emperor Palpatine. I have no idea why Obi-Wan and Yoda didn’t attempt to destroy him together later since Vader was beaten and incapacitated, which left the entire galaxy to live in tyranny for almost 2 decades. And if Obi-Wan loved Anakin, why on earth did he allow him to burn to death (as far as he knew) rather than mercifully ending him? Maddening. Umm, sorry I digress. So, how do we process a Type 2 change like this? The basic steps are as follows:
- Find the current record for each entity (per business key) in the Dimension. Most often this is done by finding the record where RowisCurrent = ‘Y’ or some variation. However, In certain situations, you may need to find the current record by comparing the change date to RowStartDate and RowEndDate.
- If there are changes to any of the Type 2 columns for that current row, expire it. This is typically done by setting the RowIsCurrent value to ‘N’ as well as assigning the RowEndDate equal to the change date. Again remember what I said previously that there are differing opinions on this.
- Insert a record with a new Surrogate Key for that Dimension record and insert all of the new values into it. The RowEndDate will be ’12/31/9999′, NULL, or some other value you choose.
So, what does this look like? Well, we know that the current row for Yoda will have a Type 2 change since he’s moving planets. So, we need to expire that row:
Followed by the insertion of the new record for Yoda:
As you can see, Yoda still has the same CustomerNum (Business Key), but now he has 2 surrogate key values. To illustrate how important this is, consider our Dimension along-side a sample Fact table.
Yoda’s sales are the last three in that Fact table. Those with the Customer Key of 2, will be properly attributed to Coruscant, while the final will be attributed to Degobah. That way, even if the source system updated the Planet in place, you still retain history in your data warehouse.
In the next article, I’ll show you how to properly process both Type 1 and Type 2 Dimension changes using T-SQL Merge.