You unlock this door with the key of imagination. Beyond it is another dimension – a dimension of sound, a dimension of sight, a dimension of mind. You’re moving into a land of both shadow and substance, of things and ideas. You’ve just crossed over into the Twilight Zone.
Before we begin with Dimension and Fact tables, I would like to recommend that you use the Kimball Toolkit Data Modeling Spreadsheet when designing your dimensional model. In that download, you’ll receive a blank copy, as well as a copy of one they created for Adventure Works. The Spreadsheet actually generates the T-SQL script to create your data warehouse database using your specifications.
What is a Dimension?
A dimension is an entity involved in your business process which gives your data meaning. Remember the old days when you were confounded by those story problems your sadistic Math teacher foisted on you? Don’t worry, I’m not going to ask you about a train moving west at 45 mph… Our story problem (data request) is simpler. “I want to know what the order amount was by fiscal year, customer, and product category.”
The important words are obviously what and by, which is why I bolded them. What generally tips you off to your Facts (measurements) and by indicates your dimensions. The dimensions are the items in your group by statement. The statement above could be written in T-SQL as:
Select sum(invsales) where datefield.... group by Year, Customer, Product Category
Dimension tables are typically wide, verbose, and filled with descriptive detail fields called attributes. They should include as many details as is practical because you never know what a user will want to analyze by. Try to anticipate all required attributes up front, but add more as you go if need be. A Customer dimension for a business that sells to individual consumers. That dimension would typically include fields such as location, age, sex etc.
Dimension tables are denormalized which is difficult for some DBAs to accept. In an OLTP system the Customer entity attribute values may be divided among many tables such as address, credit information/rating, and demographics. During dimensional modeling we flatten those tables into as few tables as possible. Why do we do this? It’s easier for the user to understand and fewer joins (plus indexing) means much faster query response.
It is also important to be verbose in your descriptions. For example, in my current project there is an attribute called Product Class which is a two character code. Users have to memorize these numbers and what they represent because the source system separates the descriptions in another table. The descriptions should be used instead of the codes. Think of it from a new person’s point of view. Do you want to memorize that “01″ means “Ceiling Fan” or would you rather just see “Ceiling Fan?”
Long time users of the system may actually resist this change, so in my current project I provide both values in my dimension table which provides the ultimate in flexibility.
A dimension is conformed when it means the same thing for every fact table. Early on in the project the business must come to agreement of the definition of each dimension and what each attribute means. This can cause friction as well. For example, the Manufacturing Department of a company may have a different idea of Product Categories than the Sales Department.
Do NOT cave in. If you allow dimensions to have different meanings and attributes for different fact tables, you will lose the ability to drill across. This causes your different modules or data marts to be stove piped. Data from one may not be compared to another. This defeats the entire point of the data warehouse project.
The Primary Key of a dimension table should be a Surrogate Key. Surrogate keys are unique meaningless integers generated by the database server. Typically, the user never sees these keys, they are simply used by the data warehouse to link tables together. You should not use your OLTP system keys instead even if they are integers. People often protest and want to use their OLTP values for these keys, but you shouldn’t because surrogates insulate your data warehouse from outside forces. For example:
- The OLTP source may recycle its keys. If they do so, the uniqueness of your dimension primary key is violated.
- Multiple sources for the same dimensional entity likely have different keys.
- If you aquire another company, which has it’s own set of keys, how could you incorporate them into your data warehouse without surrogate keys?
- Suppose one of your attributes is a ship date and that isn’t always known the first time a record is loaded? Surrogate keys allow a special value to be substituted instead like “Not Available” rather than a null.
- Surrogate Keys are compact which saves space, especially with indexes. To that end, you should be using the smallest integer datatype appropriate for these keys. Also, integer keys offer faster joins than string types.
- Surrogate Keys facilitates easier tracking of historical changes. More on this in a bit.
Unusual Dimension Types
Date Dimensions are unique in that they typically use meaningful keys instead of surrogates. I’ll cover Date Dimensions/Tables in depth in an article in the near future.
Degenerate dimensions are actually a part of the fact table and are typically significant keys from your OLTP system like Sales Order Number or Customer Purchase Order Number.
A Junk Dimension is a catch all dimension for miscellaneous data or record indicators. For example, attributes like Ship Early, Ship Method, Order Status, Pay Type or other flags can be put into a Junk Dimension. This keeps your design from being cluttered with too many dimensions.
Role Playing Dimensions
Role Playing Dimensions are specialized copies (typically views) of a Conformed Dimension. The most common use of these is with Date Dimensions. If you need to track Order Date, Acknowledgement Date, Ship Date, and Invoice Date you could do so using views which show different versions of the Date Dimension.
Tracking History With Slowly Changing Dimensions
There are three types of Slowly Changing Dimensions (SCD). Ralph Kimball, being the genius that he is, creatively named them type 1, 2, and 3.
- Type 1 overwrites the attribute and ignores history.
- Type 2 creates a new dimension record when attribute changes and marks the previous row as inactive.
- Type 3 tracks the current attribute and the last previous attribute in separate columns.
It’s important to note that the attributes themselves are of each type, not the whole table. Combinations of SCD Types can be implemented as well. The easiest way to understand this is with an example. Let’s say that we have a very simple customer dimension like this:
Incidentally, the CustomerAK represents the Alternate Key or Business Key, in this case it’s the customer number from the OLTP source system. Let’s say that the Customer attribute Sex is a type one because if it is changes it is usually due to an order entry error. So, in the next extract, the value has been corrected for Jane Doe. The table would look like this:
Notice that Jane Doe did not get a new row, that record was updated in place. Now, let’s assume that City is a type 2 dimension and that I moved to Jacksonville. Why would I move to Jacksonville? Well, a certain well known BI consulting company is headquartered near there… Hint Hint. Anyway, after processing the the extract, the table would look like this:
So, I’ve been given a new surrogate key and my old record has been expired. Any new fact records after that time use the record currently active. Notice that we are ignoring time of day, and as long as the effective and end dates don’t overlap, I can use a simple T-SQL Between statement to easily query the data.
In a future article, I’ll cover in detail how to accomplish the slowly changing dimension processing using SQL Server Integration Services.
The next natural question is “who decides which columns are which of which SCD type?” Well, if you read yesterday’s article you can already guess the answer. The business folks decide which attributes are SCD 2. However, you must protect them from themselves. Typically if you ask them what attributes they’d like to keep history on, they’ll invariably answer “Everything.” This may not be feasible because if you add a new record for every single dimension change your data warehouse size may grow at a ridiculous rate. Inform them how much all of that history will cost them in hardware. It’s important to note that certain regulatory rules may force all changes to be type 2 regardless of cost.
Tomorrow, we’ll discuss the Facts. “Just the facts, ma’am.” Yeah, I know the quote was lame, but I couldn’t help myself.