Come See Me Present at SQL Saturday Chicago and Dallas

I’ll be presenting at two SQL Saturdays back to back. You are taking advantage of SQL Saturday events, right?

SQL Saturday Chicago is this weekend (3/26), and I’ll be presenting:

Data Warehousing – How to Convince “The Bobs”

I got a meeting with the Bobs in a couple of minutes...

Building your first Data Warehouse is a long, and often difficult process. How can you get your boss to approve a Data Warehouse project? What’s the best way to explain dimensional modeling and the benefits of a Data Warehouse to a business person? What are the best/most cost effective ways of learning it? What kind of materials, hardware, software, etc do you need? What’s the best way to build a proof of concept that will impress your boss, as rapidly as possible?

Come to this presentation, and I’ll answer all of these questions plus the most important question of all.

“Is this good for the company?” Absolutely.

The following Saturday (4/2) my home group, the North Texas SQL Server User Group, will be hosting our 3rd SQL Saturday. I’ll be presenting the following:

T-SQL Techniques to Survive Zombie Databases

Do you have a Zombie Database? Does your database suffer from roaming hordes of problems like the lack of primary and foreign keys, poor normalization, in place updates and deletes, and an insatiable drive to eat your braaaaaaaaaaaaaains (CPU cycles).

Fear not fellow survivor for in this session, I’ll show you how to fortify your code against zombie infestation. How to deal with the unknown (nulls), how to Restore Order by Joining together in Groups. Flak Jackets are Optional, but encouraged.

So, participate in SQL Saturdays. Present, volunteer, or just attend. They’re excellent, free sources of SQL Education.

Isn't it Ironic? Don't You Think?

Anyone who reads this blog should know that I’m a fan of Todd McDermid and his free Kimball Slowly Changing Dimension Transform. For those of you who’ve not been following his blog, he was contacted by the Kimball Group and informed that he needed to stop using Kimball’s name for his free transform, the Kimball Method Slowly Changing Dimension Transform, even though it has existed with that name for years.

Todd was classy about it, had a contest to rename it, and came up with Dimension Merge SCD. Since then, another pillar of the SQL Community, Brian Knight (Blog/Twitter), has picked up the ball and is integrating the transform into Pragmatic Works Task Factory.

Anyway, last week I received the new The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2by the Kimball Group and ripped through the book at a blistering pace. I hit page 222 by the next day and this is what I found:

The two most popular third-party tools for replacing the Microsoft SCD transform are:

Kimball Method SCD Transform, available for free download from and developed by Todd McDermid. This transform offers more functionality than the Microsoft SCD transform, and performs much better. Edits are not destructive to downstream elements. This transform is not associated with the Kimball Group, though the developer followed Kimball published best practices in his design.

Um… what?

I have great respect for everyone involved including Ralph Kimball, that’s why I buy, read, and recommend his books. However, the fact that the request was made at nearly the same time this new book was published was too much to pass up.

Drawing a Blank. Dealing With Blank Values in SQL Server using NullIf and Coalesce

“Melodical Murray is a human musical instrument; he makes music by blowing his…blank” – Match Game

In the previous article I discussed the issues when working with Null-able fields in Joins and Where clauses. This article will deal with the problems of blank values.

What About Blanks?

Many systems forego the use of Null-able fields, but instead use blank values. In fact, blank values are a bigger problem in M2M than Nulls. A common problem is with M2M Product Class values. For those that don’t use M2M, Product Class is similar to a Product Category that happens to be tied to general ledger accounts. Unfortunately, not every record in sales and other tables has a Product Class value and the value is blank. If you weren’t aware of this, run the following query against your M2M Database.

SELECT DISTINCT ARI.fprodcl ProductClass
FROM   dbo.aritem ARI

The M2M Educational Database returns the following:

In order to demonstrate this for those who don’t use M2M, I’ll run the following against the Kung Fu Database.

UPDATE dbo.OrderDetails
SET    ItemDescription = ''
WHERE  orderid = 4
FROM dbo.OrderDetails OD

So, how can we deal with the blank value (and the Null value as well?) Well, we could use a Case statement.

         WHEN OD.ItemDescription IS NULL THEN '..Null None'
         WHEN OD.ItemDescription = '' THEN '..Blank None'
         ELSE OD.ItemDescription
         AS ItemDescription
  FROM dbo.OrderDetails OD

Noticed that I assigned the value to Null and blank separately so you could see that it worked. Why did I use two periods before the text? I’ll explain that in a minute. The Case When works, but it’s wordy. There’s a much easier way to deal with both Blanks and Nulls.

NullIf and Coalesce

Let’s start with the NULLIF function. NULLIF accepts two arguments and if the arguments are the same, then Null is returned, otherwise the first value is returned. Therefore, we can detect blanks with the following:

 -- Returns 'A'
-- Returns Null
Declare @Test Char(5)
SELECT NULLIF (@Test, '') 
-- Returns Null
-- I had to use a variable for the last test because the first argument must 
-- have a known data type or an error will result.

The second argument in this case is always ” (blank). Why? If the field value is blank, the end result will be Null. Also, if the first argument is Null, that will return as Null as well. Make sense?

A Coalesce statement can then be used to replace the resulting Null value like so:

SELECT COALESCE(NULLIF ('A', '') , '..None') 
-- Returns 'A'
SELECT COALESCE(NULLIF ('', '') , '..None') 
--Returns '..None' 
Declare @Test Char(5)
SELECT  COALESCE(NULLIF (@Test, '') , '..None') 
-- Returns '..None'

Let’s return to our original statement and substitute this method.

       coalesce (nullif (OD.ItemDescription, ''), '..None')
         AS ItemDescription
  FROM dbo.OrderDetails OD

I have this process programmed as a Toad Toad for SQL Server snippet. When I have time I’ll record a video showing how to use them. In this case, all I have to do is highlight the database field name, click the snippet, and Toad does the rest. So, why do I substitute text with leading periods? I do that primarily for generating available values in SSRS parameters.

When I go back and edit the first script (for M2M Product Classes) with this method, I get the following:

       ARI.fprodcl ProductClass
      ,coalesce (nullif (ARI.fprodcl, ''), '..None') FixedProductClass
  FROM dbo.aritem ARI
ORDER BY coalesce (nullif (ARI.fprodcl, ''), '..None')

Which would your users rather see in a SSRS parameter drop down?

It’s important to note that blank (”) is an actual value, which is in fact equal to another blank, and can be used to join as well. However, in the case of Made2Manage, you shouldn’t expect to join your tables to the Product Class Table (INProd) with an Inner Join because you’ll “lose” records. That’s why I always link to INProd with a Left Outer Join. Would you like to see the invoice records that you’d lose in an Inner Join? Run the following against your database to find out.

SELECT ARM.fbcompany AS BillToCompany
      ,ARI.fcinvoice AS InvoiceNo
      ,ARI.fitem AS ItemNo
      ,ARI.fprodcl AS ProductClass
      ,rtrim (ARI.fpartno) AS PartNo
      ,ARI.frev AS PartRev
      ,ARI.ftotprice AS NetExtInvoiced
  FROM   dbo.armast ARM
         dbo.aritem ARI
 WHERE ARM.fcstatus <> 'V' AND
       coalesce (nullif (ARI.fprodcl, ''), '..None') = '..None' -- Returns Blanks and Nulls

In the M2M Educational Database, 19 records are returned.

So, in these two articles, I’ve demonstrated the problems of dealing with Nulls and Blanks in your database and how to deal with them. In the future, I’ll follow up with articles on creating and using Date tables and other more advanced T-SQL techniques.

Stay tuned!

I Don’t Know! Deal With It! (Nulls)

This topic has been done to death in many places, but I still get frequent questions about Null field issues in T-SQL. My purpose in this article is not to debate the definition of Null. Smarter men than I have already done so.

SQL Server Books Online indicates “A value of NULL indicates that the value is unknown.” However, some disagree, most recently Joe Webb – Null is not Unknown. Some argue that there should be several different types of Null.

Why Ask Why?

As I said before, I don’t really care about the various interpretations of the Null value. I just know that they’re a pain to deal with. I have never received questions like this:

  • What exactly is a Null?
  • What’s the definition of Null?

The questions I get follow along these lines.

  • Why am I losing records in this join?
  • Why is my total (or average) off in this query?
  • Why does my SSRS parameter list look funny?
  • Why won’t my parameters default like they should?

So dear reader, as part of the T-SQL Basics Series, I am going to share what you need to know about dealing with Nulls rather than defining them. If you’d like to follow along, please download the Kung Fu Database Script.

Where Clause

To start with, while people disagree about the definition of Null, everyone agrees what Null is not. It is not any of the following:

  • 0
  • blank
  • ” (Empty String)
  • 1/1/1900

When most of us look at a Where clause, we only consider two options – True and False. However, SQL Server uses three value logic, True, False, and Unknown. Since Null represents an Unknown value any comparison to it results in Unknown, not True or False. Consider the following pseudo-code:

Select * from Employees where Salary > 500

If the Salary field doesn’t allow nulls then it’s very easy to return the list of Employees which earn more than $500 per some time period. However, if Salary is a Nullable field, then any Employee without a Salary specification will not be returned. How could it be? You simply do not know whether they make more than $500 or not.

I find the easiest way for me to learn these concepts is to look at examples. Let’s start with the following base query from the Kung Fu Database.

SELECT OD.OrderID, OD.[LineNo], OD.ProductID, OD.ItemDescription, OD.OrderQty, OD.Discount, OD.ListPrice
FROM   dbo.OrderDetails OD

Null values are hightlighted.

Returns: 6 Records. This is the total data set.

Now consider the following:

SELECT OD.OrderID, OD.[LineNo], OD.ProductID, OD.ItemDescription, OD.OrderQty, OD.Discount, OD.ListPrice
FROM   dbo.OrderDetails OD
WHERE  OD.OrderQty >= 1

Returns: 4 Records. Those with Null Values for OrderQty are ignored.

SELECT OD.OrderID, OD.[LineNo], OD.ProductID, OD.ItemDescription, OD.OrderQty, OD.Discount, OD.ListPrice
FROM   dbo.OrderDetails OD
WHERE  OD.OrderQty < 1

Returns: 0 Records. Comparing any value to Null results in Unknown and those those records are not returned.

SELECT OD.OrderID, OD.[LineNo], OD.ProductID, OD.ItemDescription, OD.OrderQty, OD.Discount, OD.ListPrice
FROM   dbo.OrderDetails OD

Well, logically this should return 2 records, right? After all, you are asking for those records with Null values for OrderQty. However, the correct answer is 0 records. The reason is simple. Comparing any value to Null (even another Null) returns Unknown. This may seem confusing, but logically it makes sense. Consider my Salary example above. If Bob and Dave both have Null Salary values, do they earn the same amount? Obviously, there’s no way to know. Therefore, the query evaluates to Unknown for those records and they aren’t returned. So, how can we return the Null records?

SELECT OD.OrderID, OD.[LineNo], OD.ProductID, OD.ItemDescription, OD.OrderQty, OD.Discount, OD.ListPrice
FROM   dbo.OrderDetails OD

Returns: 2 Records. For the same reason, we could have used “IS NOT NULL” and 4 records would return.

So, what if we want to return any records with a quantity less than 1 and include Null Values? Well, you could write that query like this:

SELECT OD.OrderID, OD.[LineNo], OD.ProductID, OD.ItemDescription, OD.OrderQty, OD.Discount, OD.ListPrice
FROM   dbo.OrderDetails OD
WHERE  OD.OrderQty < 1 

Returns: 2 Records.

Replacing Null Values with ISNULL or COALESCE

So that works, but it’s wordy. How can we more easily deal with nulls? Well, we can replace them with an actual value by using COALESCE or ISNULL. Each function returns the first Non-Null value found in the list of arguments. For example, SELECT ISNULL(NULL, 25) returns 25. However, SELECT ISNULL(NULL, NULL) returns Null.

Some people prefer IsNULL because the name of the function is more understable to them and it has been shown to have a very slight performance advantage. However, I use Coalesce because IsNull is limited to 2 arguments, whereas COALESE doesn’t have any realistic limit. SELECT COALESCE(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 7) still returns 7. Remember that it returns the first non-null value. Therefore, SELECT COALESCE(NULL, 3, NULL, NULL, NULL, 7) only returns the first value which is 3.

So, you may use Coalesce to simply these comparisons as well as for aesthetic reasons. Consider the previous Kung Fu Database query re-written with Coalesce. I’ve formatted it vertically to emphasize the difference.

  dbo.OrderDetails OD
  COALESCE(OD.OrderQty,0) < 1

Who determines whether or not you should replace the field like this? For the most part, the business user. So, if the business user provides the proper substitution values, the query may be written like this:

 ,COALESCE(OD.ProductID, 0) ProductID
 ,COALESCE(OD.ItemDescription, 'None') ItemDescription
 ,COALESCE(OD.OrderQty, 0) OrderQty
 ,COALESCE(OD.Discount, 0) Discount
 ,COALESCE(OD.ListPrice, 0) ListPrice
FROM dbo.OrderDetails OD

Joining on NULL-able Columns

So, I’ve demonstrated how Null values can cause problems in the Where clause. For the same reasons, joining on Null-able columns is problematic as well. Consider my example Order Details table which allows a Null value for ProductID. Consider the following script:

    dbo.OrderDetails OD
    dbo.Products P
  ON P.ProductID = OD.ProductID

If that were written as your standard inner join, the recordset would lose the indicated record.

Grouping, Sorting, and Aggregates

In regards to Group By and Order By, Null values are considered equal. Meaning that all of the Null values are treated as one entity and are grouped and ordered as such.

Consider the following queries and result sets:

FROM   dbo.OrderDetails OD

SELECT   OD.OrderQty
FROM     dbo.OrderDetails OD

Aggregate Functions
Aggregate functions such as COUNT, SUM, and AVG ignore Null values. However, one can use COUNT(*) which will count all records. Consider the following query:

  SUM(OD.OrderQty) QtySum
  ,Avg(cast(OD.OrderQty as decimal (4,2))) NullAvg
  ,Avg(cast(COALESCE(OD.OrderQty,0) as decimal (4,2))) ZeroAvg 
 ,COUNT(OD.OrderQty) QtyCount
 ,COUNT(*) TotalCount
  dbo.OrderDetails OD

I had to Cast the quantities as decimals, because otherwise the values would have been rounded to 1. If you do the math, you’ll see that aggregate functions only considered the four records with non-null values. This is particularly noticeable when you compare the NullAvg and the ZeroAvg. The ZeroAvg actually divides the total by 7 instead of 4.

In this article, I’ve attempted through the use of many examples to describe the issues when working with Null values and how to deal with them. In the next article, I’ll discuss how to deal with blank values in Joins, Where clauses, and when creating parameter lists for SQL Server Reporting Services.

M-Data Analytics Questions

I’m sorry that my posts haven’t been as regular as I’d like, but the good news is that we’re making great progress with the M-Data Analytics open source project. In order to provide the best benefit to my fellow Made2Manage users, please answer the following questions.

What Version(s) of Made2Manage are you currently using?

View Results

Loading ... Loading ...

What version(s) of SQL Server are you currently using?

View Results

Loading ... Loading ...

How many live M2M Companies do you have?

View Results

Loading ... Loading ...

If you have more than 1 live company, which setup would you prefer?

View Results

Loading ... Loading ...

Are you running more than 1 version of M2M in production?

View Results

Loading ... Loading ...

When are you likely to purchase a new SQL Server?

View Results

Loading ... Loading ...

M-Data Analytics has the ability to notify separate Data Stewards regarding data problems for each business topic. For example, a designated sales person would be notified if sales records were not up to par giving them the ability to address those issues. Perhaps a new sales address contains an unrecognized value for State Name, etc. Would you actually utilize this feature?

Also, are any of you using naming conventions for your M2M databases other than the standard M2MDataXX where XX represents a two digit number? If so, how are you naming your M2M databases?

I really appreciate your input.

SQL University - Resistance is Futile

Note: This is the last in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. Click here to read the rest of the series.

I can haz assimilation?

We are Borg – Bringing it all together.

Now that we understand the basics of Dimension and Fact Tables, let’s learn how they work together. Once again, don’t forget that the Kimball Toolkit Data Modeling Spreadsheet is very helpful when creating these models.

The Star Schema

Let’s go back to the dimensional model Fact Table mentioned in the last article.

Remember that Fact Tables consist primarily of foreign keys and measures.

These foreign keys link to the surrogate primary keys of the Dimension Tables. When you link them all together, they might look like this.

So, what does that look like? The tables are generally laid out in the formation of a Star. That’s all a Star Schema really is.

Snowflake Schema

Let’s say that Product Category is related to the Product Dimension in a one to many relationship (or mostly so), and that it has many verbose attributes. The architect may decide to split it off from the Product Dimension. The resulting schema would look like this.

This is now a Snowflake Schema. By the way, check out the header of the third section of the linked wiki. Is the wording intentionally hilarious? Another common use for this kind of schema would be for a Customer Dimension and associated addresses. The location becomes a Dimension Table linked to DimCustomer. A general best practice is to use Star Schemas unless there is a very important reason to move to Snowflake as they are simpler to report from and load. If you use a Snowflake Schema, you may want to cover the tables with a view to simplify querying. In our case, we’d have a view for DimProduct and that would include the DimProductCategory and appear as one entity.

The Simple Life of Borgs

Here’s the payoff. Let’s assume that your boss wants to know the sales order amount by year and product. You can ignore all of the other dimensions.

The query is simple and fast because it involves 3 fields and 2 joins. Those joins are primary and foreign keys which means most query products recognize them automatically. Life is now simple.

Prepare to be Assimilated – Convincing Your Boss

So, I can hear you saying, “Alright Dave, you’ve convinced me, how can I get my boss to let me work on something like this?”

If you’ve been reading this series of articles, you know that creating a data warehouse is not quick or easy. It involves a great deal of time to learn to design and load a data warehouse. So, how can you convince your boss to let you pursue this?

Well, to start with business people are just like everyone else. They’re short sighted, demanding nearly instant return on investment (ROI), and are averse to risk.

What are you preparing?! You’re always preparing! Just go! – Dark Helmet. Spaceballs

The first thing to remember is that this project can’t be about you. In most cases, your boss doesn’t care about what you want to do, but rather what you can do to add value to the business. Also, in today’s economy your boss may be reluctant to send you to dimensional modeling classes which cost thousands of dollars. So, how can you learn these principals on the cheap? I’ll cover that in a moment.

Pitching the idea to your boss can be difficult because dimensional modeling concepts are intangible. It’s hard for a business person to see the value in it. You have to relate the project to revenue in some way.

So, I suggest that you create a proof of concept project, perhaps in your free time. After reading this series of lessons, pick a business topic and create a test project. Some important things to keep in mind in choosing this topic are:

  1. Choose a business topic that is of value to the business. When your boss sees value, even as just a test case, he/she will be more likely to green light it.
  2. To maximize your chances of success, choose a business topic with a single clean data source if possible. Don’t bite off some project that involves flat files that are FTP’ed to you and often contain dirty or missing data. Pick a relatively clean OLTP system, preferably in SQL Server.
  3. The proof of concept need not be perfect or complete. Create a single Fact Table, and a few Dimensions to go with it.
  4. Use your existing skill set to create and load the data warehouse. While SQL Server Integration Services is the tool of choice for the extract, transform, and load process (ETL); if you don’t know how to use it, load it with T-SQL instead. You can learn SSIS later.

When pitching the idea to your boss, show the model and demonstrate how he/she can perform their own analysis on it. Stress how much more convenient this is for the users, not how much time it will save you. Also, explain that this process can answer long standing business questions the boss may have. In my situation, Made2Manage cannot track sales order history. Through the ETL process of the data warehouse project, I can now track value changes in sales orders in near real time. This functionality saves my company money every day in that users no longer have to track those changes manually.

Learning on the Cheap

Books are one of the most cost effect ways to learn this. Below are a few books I recommend and why.

There are also free webinars from time to time on the subject from the Kimball Group, as well as Pragmatic Works. Pragmatic offers excellent classes in the various Microsoft technologies involved, and I’ve taken nearly all of them. To my knowledge they don’t offer a class in dimensional modeling itself. Also, you can follow blogs like this one. In future articles I intend to document all phases of my open source project from modeling, to ETL, and then reporting on the data warehouse. It should serve as a good case study for others to learn from.

One last thing that I want to stress about learning these techniques. It is absolutely imperative that you learn by doing it. Create a Dimensional Model and attempt to load it. If you don’t have a data source, look for one. Volunteer with a charity, many of them would love to leverage your slave labor to better their “business.”

Get involved with your local SQL Server User Group and make connections. Your progress is only limited by your level of commitment.

So, will you be assimilated?

SQL University - Facts, Show Me The Money

Note: This is the fourth in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. Click here to read the rest of the series.

So, we’ve covered dimension tables and how they give your data meaning. That begs the question, how do we actually track the measures (value) of those business process events. In other words, where are we going to track the money, and some other measurements as well.

What are Fact Tables?

Fact Tables contain the measurements of your business process events usually in the form of money, quantities, and sometimes dates. The typical fact table contains nothing but numbers. Suppose we are designing a data warehouse for Acme Inc. maker of a variety of fine products, usually involving rockets, for sale to discerning coyotes everywhere. They have multiple stores throughout Arizona. Below is a sample fact table.


Whereas Dimension Tables tend to be denormalized or flattened, Fact Tables are highly normalized. For the most part, they consist of measurements and Keys that link to the surrogate keys of the Dimension Tables. A sample data set for the table above would look like this:

Fact Tables tend to be as “narrow” as possible because they are so much larger than the dimensions. The only keys which aren’t surrogates are the date table smart keys. Also, the measurements included in the Fact Table should be additive. Notice that I am not storing Unit Price. The user can calculate it from order quantity and extended price if need be. If you store a non-additive fact like unit price, users will try to analyze by it and get erroneous results.

Fact Table Grain

The most important consideration when designing Fact Tables is to set the grain. As Kimball states,

What, exactly, is the grain? The grain of a fact table is the business definition of the measurement event that creates a fact record. The grain is exclusively determined by the physical realities of the source of the data.

The grain should be at the most atomic level you can measure at because you never know how your users will want to slice and dice the data.

In my example, the grain is at the level of Sales Order Line Items. If we were a point of sale establishment like a grocery store, how would we define the grain? Paraphrasing Kimball, when the cashier scans your Cheetos and Bacon and you hear that “BEEP”, the “BEEP” is your grain of your fact table.

Do not violate the grain of your Fact Table, though you may be enticed to do so. For example, what if Acme wanted to add shipping costs to our Order Fact Table. Well, shipping costs are typically recorded per Sales Order, not per line item or individual product. We currently have this problem with M-Data Analytics.

You should not simply add a shipping cost value to your fact table. If a particular sales order cost $10 to ship and had 3 line items, how would you record that? You should not simply place $10 in each of the three fact table rows because it’s not additive. You shouldn’t include $10 for one of the lines and $0 for the others either. The only way to solve this is to estimate the cost per line item (perhaps by using weights in your product table) and use those values. That process could be prohibitively difficult however.

Fact Table Types

There are three types of Fact Tables.

Transactional A Transactional Fact Table is the fundamental type and the example above is Transactional. It should be at the most atomic grain possible, one row per transaction or business event. One row per “BEEP” as Kimball puts it. Since this table is at the most atomic level, they tend to have the greatest number of Dimension tables surrounding them.

Periodic Snapshots
A Periodic Snapshot represents a point in time representation of measurements. One common application is with measurements that cannot be historically tracked like current inventory or order backlog. Another is to summarize over periods of time, perhaps at end of month to track balances and such.

Accumulating Snapshot
An Accumulating Snapshot Fact Table is typically a series of snapshots in a single row where each row represents a business process with a definite beginning and end. For example, let’s say that Acme Inc.’s order process typically includes these steps.

  1. Product is Quoted to Customer
  2. Quote is accepted or denied
  3. Sales Order is generated if Quote is accepted
  4. Sales Order is Shipped
  5. Customer is Invoiced
  6. Payment is Processed

If management wants to measure throughput of orders, how fast they move through the system, an Accumulating Snapshot can be used. The grain would represent a particular sales order (or sales order line item if they are ever handled separately). Included columns would be the important dates, foreign keys to the relevant dimensions, and the quantity and dollar amount measurements.

So, now that we’ve discussed the basics of Dimension and Fact Tables, I’ll bring it all together in tomorrow’s lesson. We’ll discuss Star and Snowflake Schemas, and how this all works. I’ll also cover the best ways to learn this process as well as how to convince your company to allow you to undertake building a data warehouse.

Stay tuned!

SQL University - Dimensions, It's All Meaningless Without Them

Note: This is the third in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. Click here to read the rest of the series.

You unlock this door with the key of imagination. Beyond it is another dimension – a dimension of sound, a dimension of sight, a dimension of mind. You’re moving into a land of both shadow and substance, of things and ideas. You’ve just crossed over into the Twilight Zone.

Before we begin with Dimension and Fact tables, I would like to recommend that you use the Kimball Toolkit Data Modeling Spreadsheet when designing your dimensional model. In that download, you’ll receive a blank copy, as well as a copy of one they created for Adventure Works. The Spreadsheet actually generates the T-SQL script to create your data warehouse database using your specifications.

What is a Dimension?

A dimension is an entity involved in your business process which gives your data meaning. Remember the old days when you were confounded by those story problems your sadistic Math teacher foisted on you? Don’t worry, I’m not going to ask you about a train moving west at 45 mph… Our story problem (data request) is simpler. “I want to know what the order amount was by fiscal year, customer, and product category.”

The important words are obviously what and by, which is why I bolded them. What generally tips you off to your Facts (measurements) and by indicates your dimensions. The dimensions are the items in your group by statement. The statement above could be written in T-SQL as:

Select sum(invsales)
where datefield.... 
group by Year, Customer, Product Category

Dimension tables are typically wide, verbose, and filled with descriptive detail fields called attributes. They should include as many details as is practical because you never know what a user will want to analyze by. Try to anticipate all required attributes up front, but add more as you go if need be. A Customer dimension for a business that sells to individual consumers. That dimension would typically include fields such as location, age, sex etc.


You must unlearn what you have learned....

Dimension tables are denormalized which is difficult for some DBAs to accept. In an OLTP system the Customer entity attribute values may be divided among many tables such as address, credit information/rating, and demographics. During dimensional modeling we flatten those tables into as few tables as possible. Why do we do this? It’s easier for the user to understand and fewer joins (plus indexing) means much faster query response.

It is also important to be verbose in your descriptions. For example, in my current project there is an attribute called Product Class which is a two character code. Users have to memorize these numbers and what they represent because the source system separates the descriptions in another table. The descriptions should be used instead of the codes. Think of it from a new person’s point of view. Do you want to memorize that “01” means “Ceiling Fan” or would you rather just see “Ceiling Fan?”

Long time users of the system may actually resist this change, so in my current project I provide both values in my dimension table which provides the ultimate in flexibility.

Conformed Dimensions

Conformity Rules!

A dimension is conformed when it means the same thing for every fact table. Early on in the project the business must come to agreement of the definition of each dimension and what each attribute means. This can cause friction as well. For example, the Manufacturing Department of a company may have a different idea of Product Categories than the Sales Department.

Do NOT cave in. If you allow dimensions to have different meanings and attributes for different fact tables, you will lose the ability to drill across. This causes your different modules or data marts to be stove piped. Data from one may not be compared to another. This defeats the entire point of the data warehouse project.

Surrogate Keys

The Primary Key of a dimension table should be a Surrogate Key. Surrogate keys are unique meaningless integers generated by the database server. Typically, the user never sees these keys, they are simply used by the data warehouse to link tables together. You should not use your OLTP system keys instead even if they are integers. People often protest and want to use their OLTP values for these keys, but you shouldn’t because surrogates insulate your data warehouse from outside forces. For example:

  1. The OLTP source may recycle its keys. If they do so, the uniqueness of your dimension primary key is violated.
  2. Multiple sources for the same dimensional entity likely have different keys.
  3. If you aquire another company, which has it’s own set of keys, how could you incorporate them into your data warehouse without surrogate keys?
  4. Suppose one of your attributes is a ship date and that isn’t always known the first time a record is loaded? Surrogate keys allow a special value to be substituted instead like “Not Available” rather than a null.
  5. Surrogate Keys are compact which saves space, especially with indexes. To that end, you should be using the smallest integer datatype appropriate for these keys. Also, integer keys offer faster joins than string types.
  6. Surrogate Keys facilitates easier tracking of historical changes. More on this in a bit.

Unusual Dimension Types

Date Dimensions
Date Dimensions are unique in that they typically use meaningful keys instead of surrogates. I’ll cover Date Dimensions/Tables in depth in an article in the near future.

Degenerate Dimensions
Degenerate dimensions are actually a part of the fact table and are typically significant keys from your OLTP system like Sales Order Number or Customer Purchase Order Number.

Junk Dimensions
A Junk Dimension is a catch all dimension for miscellaneous data or record indicators. For example, attributes like Ship Early, Ship Method, Order Status, Pay Type or other flags can be put into a Junk Dimension. This keeps your design from being cluttered with too many dimensions.

Role Playing Dimensions
Role Playing Dimensions are specialized copies (typically views) of a Conformed Dimension. The most common use of these is with Date Dimensions. If you need to track Order Date, Acknowledgement Date, Ship Date, and Invoice Date you could do so using views which show different versions of the Date Dimension.

Tracking History With Slowly Changing Dimensions

Slowly Changing

There are three types of Slowly Changing Dimensions (SCD). Ralph Kimball, being the genius that he is, creatively named them type 1, 2, and 3.

  • Type 1 overwrites the attribute and ignores history.
  • Type 2 creates a new dimension record when attribute changes and marks the previous row as inactive.
  • Type 3 tracks the current attribute and the last previous attribute in separate columns.

It’s important to note that the attributes themselves are of each type, not the whole table. Combinations of SCD Types can be implemented as well. The easiest way to understand this is with an example. Let’s say that we have a very simple customer dimension like this:

Incidentally, the CustomerAK represents the Alternate Key or Business Key, in this case it’s the customer number from the OLTP source system. Let’s say that the Customer attribute Sex is a type one because if it is changes it is usually due to an order entry error. So, in the next extract, the value has been corrected for Jane Doe. The table would look like this:

Notice that Jane Doe did not get a new row, that record was updated in place. Now, let’s assume that City is a type 2 dimension and that I moved to Jacksonville. Why would I move to Jacksonville? Well, a certain well known BI consulting company is headquartered near there… Hint Hint. Anyway, after processing the the extract, the table would look like this:

So, I’ve been given a new surrogate key and my old record has been expired. Any new fact records after that time use the record currently active. Notice that we are ignoring time of day, and as long as the effective and end dates don’t overlap, I can use a simple T-SQL Between statement to easily query the data.

In a future article, I’ll cover in detail how to accomplish the slowly changing dimension processing using SQL Server Integration Services.

SCD Decisions

The next natural question is “who decides which columns are which of which SCD type?” Well, if you read yesterday’s article you can already guess the answer. The business folks decide which attributes are SCD 2. However, you must protect them from themselves. Typically if you ask them what attributes they’d like to keep history on, they’ll invariably answer “Everything.” This may not be feasible because if you add a new record for every single dimension change your data warehouse size may grow at a ridiculous rate. Inform them how much all of that history will cost them in hardware. It’s important to note that certain regulatory rules may force all changes to be type 2 regardless of cost.

Tomorrow, we’ll discuss the Facts. “Just the facts, ma’am.” Yeah, I know the quote was lame, but I couldn’t help myself.

If We Build It, Will They Come?

Note: This is the second in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. Click here to read the rest of the series.

Creating a data warehouse is a daunting process. It’s not something you just whip out in a weekend. So, let’s get a few things straight to start with.

Who Designs the Data Warehouse (Dimensional Model)?

If you build it, will they come? It seems obvious that you design the data warehouse, that’s why you’re taking your time read this blog, or maybe you’re just fascinated with my movie quotes. But seriously, who designs this thing? The business users.

The business users have a large part in designing a data warehouse. This isn’t an IT process, its a business process. The subjects of the data warehouse are decided by the business, as well as what information needs to be kept on each. In reality the data warehouse exists to assist users in making better decisions. The data warehouse architect is there to assist and guide in this process. One sure way to fail is to have IT department centered data warehouse because the users won’t see value in it.

Notice that I didn’t suggest that you should ask the business users what they want. If you do so, you are handing over the design role to the business user. They don’t know what they want because they don’t know what you provide. They will tell you something, but because they don’t understand what you can do, it will be wrong. It’s up to you to bridge the gap and learn the business to so you can design a successful dimensional model.

So, the business users don’t understand tables, keys, referential integrity, etc. Part of the data architect’s job is to educate the users about what can be done so they can provide better input into the process. So, how can you teach them the basics of fact and dimension tables? The same way I’m going to teach you.

Let’s start out with pivot tables, which most users understand. Ask the user for a typical reporting requirement that they might have. Suppose they say, “I want to know incoming order amounts by fiscal year, period, and product category.” Well, the most users understand how to use Excel pivot tables so that’s the method we’ll use.

Open Excel, query the data, and create a pivot table. The user may already know that they need to drag the amount to the Values box, and already know how to pull the values for year, month, and product category to the columns and rows as shown. The values, or the column you want to measure, are your facts. The things you measure facts by (by fiscal year, period, and product category) are your dimensions. Dimensions and facts go in separate tables, but the users may never see them. Now that they understand that, what’s next?

How do we model the model?


Do you want to know what the Matrix is Neo? – Morpheus The Matrix

Start with the Kimball Bus Matrix, which is kind of like a high level road map of your project. Along the left hand side are the business processes which bring value to your organization. The columns are the descriptive objects (dimensions) for your organization. If a dimension is involved with a specific business process, place an X in the intersecting cell. Notice that I have added a column called “Business Priority” which I use to indicate the order in which I will handle each process.

Typically the Bus Matrix is much larger because enterprises typically have more business processes. The example presented is a simplified version demo purposes.

How do we determine which order to tackle the business processes?

Well, the first business process takes the longest. You’re going to find that dealing with the dimensions typically takes a lot longer than fact tables. When you work on your first business process, you will need to create all of the dimensions associated with it. In my example above, if we start with Incoming Orders, every dimension but Vendors has to be created. However, each additional business process becomes easier because we re-use the already existing dimensions.
So, which business process do you start with? Start with the low hanging fruit. You want to pick a process which is relatively easy and has high business value. Perhaps that value represents money savings, increasing sales opportunities, or something that is a mandated compliance issue. Once again, all of this is considered with a focus on the business. We want a quick return on investment so we continue to have the support of management.

So, you’ve identified the business process and the facts and dimensions associated with it. Note that we haven’t started fleshing out the dimension fields with the descriptive data. That’s what we’ll handle next when I show you the basics of modeling dimensions.

SQL University - Dimensional Modeling and Why It Doesn't Suck

This is the first in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. The rest of the lessons are as follows:

SQL University

Welcome to those of you from SQL University attending this series of “classes” on Dimensional Modeling Fundamentals. If you are new to my blog, you may want to take a moment to read about me and why I blog.

This series of articles will serve as an introduction to Dimensional Modeling and Data Warehouses. While I am a SQL Server Business Intelligence DBA, the fundamentals of dimensional modeling are universal and platform agnostic. I’m a big proponent of Kimball Methodology, and most of what I will share with you is based on it.

What is Dimensional Modeling and what is a Data Warehouse?

Dimensional Modeling is a term for a set of techniques and concepts used to design data warehouses. Different people have their own definitions of the term data warehouse, but the one we’ll use is simple. A Data Warehouse is a relational database that is optimized for reporting. Typically they contain information from many different data sources and are loaded from those sources on a periodic basis.

Why do I need a Data Warehouse?

What would you do with a brain if you had one? – Dorothy. The Wizard of Oz

There are many reasons why you need a data warehouse. Let’s cover a few.

Your Database Design Sucks

Well, it sucks for reporting anyway. How can I make that blanket statement? Well, when most of us think of a database, we think of an On-Line Transaction Processing (OLTP) database whether we know it or not. This type of database is optimized for data entry: inserts, updates, and deletes. Part of this optimization is due to normalization. I won’t go into the gory details, but normalized databases are generally composed of narrow tables of closely related data, which link to other related tables to minimize redundancy. This results in a logical layout for the data, reduces the possibility of multiple values for the same attribute, and reduces database size.

For example, let’s assume you were designing the sales order tables of a database for manufacturing companies (like Made2Manage). You could not have all of your sales information in a single table. Logically, at the very minimum, you would require the following tables. I’ve listed some common fields at each level.

  1. Customer Table (SLCDPM), which would contain customer name, discount, credit rating, billing address.
  2. Sales Order Header (SOMAST)- Purchase Order Number, Order Date, Shipping Method.
  3. Sales Order Detail (SOITEM)- Quantity Ordered, Unit Price, Part Number.

Of course, I’m simplifying things. Even in a small database like Made2Manage the sales order module encompases more like 10 tables as you can see from the simplified model below.

One of the tables is in red because its a Visual FoxPro table. I have seen production systems which have as many as 15 – 20 tables involved in the sales process. While this may make sense in a transactional system, reporting from it can be nightmarish.

A dimensional model for a data warehouse is designed differently. We transform the standard OLTP structure into a simplified structure with two basic types of tables. Fact tables generally contain measurements such as dollar amounts, and dimensions which contain descriptive facts about them. I’ll go into this in more detail later this week, but in the dimensional model tables are combined and “flattened” out which makes reporting easier for the users and faster for the optimizer.

After being dimensionalized, the tables look similar to the following:

  • Lacking Primary and Foreign Key Relationships – That’s the reason the first illustration lacks the arrows. Made2Manage lacks primary and foreign keys. Therefore, most query programs will not pick up the relationships for the users automatically.
  • Incomplete Normalization – Your database may repeat the customer name in both the customer master and sales order header tables. This leads to situations where the customer has changed names, but the sales orders reflect the previous name.
  • “Bad” Field Names – While developers may understand that Fsono stands for sales order number, the users should not have to see that. Some databases have field names that are in foreign languages, or use field names like F111, F143, etc. Notice that in the second illustration, the field names are typically in real English.
  • Indexing issues – As mentioned before, OLTP databases are optimized for data manipulation and indexes tend to slow those processes. However, when retrieving millions of records from 15 joined tables and constraining by multiple values, indexing is critical for performance.
  • Hard Updates and Deletes / Auditing – Some OLTP databases update records in place which makes it difficult to track changes. They may also “hard delete” as well, meaning that they delete the record from the table rather than marking it with some flag indicating that it has been deleted. While the data might be accurate as of now, it becomes difficult to track history which causes auditors to have fits. Often auditing functionality can be built into the extract, transform, and load (ETL) process for loading a data warehouse.
  • Improper Data Types – Databases may save data in the wrong data type. For example, they use the Varchar or Char data types for dates, numbers, etc. While the data is saved, this can make data manipulation and reporting difficult.

Even if you are fortunate enough to have a well designed OLTP system, it is most likely not a good reporting platform.

Your Data Sucks

The consitency and completeness of data required in a data warehouse is much greater than typical OLTP databases. These issues are typically corrected during the extract, transform, and load (ETL) process. For example, in a typical OLTP source database, you may run into any of the following:

  • Blank or Null Values. If an address is lacking a country designation, the package will still likely be delivered. However, when reporting blank values make analysis difficult.
  • Inconsistent or mis-spelled values. Made2Manage doesn’t have any restrictions on its State fields. This allows the users to enter things like “IT” for Italy. Also, my database contains multiple versions of the same state name. The great state of California is represented as CA, Cali, California, Califrnia, etc. This incredibly frustrating for reporting.
  • Duplicate Entities. Many database systems allow users to accidentally enter duplicate entities like customers, contacts, etc. Reports that group by these entities are often inaccurate because of this.

User Extracts Suck

Rest assured that if you aren’t extracting your data for analysis, your users are. They’re exporting reports from the source systems, using Excel to pull the data, etc. They’re most likely doing so with inefficient queries, and at inopportune times. Their interpretations of that data may be inaccurate and they often litter your servers with multiple copies that are at varying stages of obsolescence.

Multiple Data Sources Suck

Most larger companies have numerous legacy systems, which don’t “talk” to each other. Different systems for Customer Relationship Management (CRM), Shipping, Manufacturing, Accounting, Payroll, etc. can all be merged into one well designed data warehouse for analysis.

Another issue arises when you need to change your source OLTP system. Suppose you’re switching your ERP system product. Well, you’ve already spent forever creating dashboards and reports on that system and now you’ll have to do it all over again. Source systems may change, but a properly designed data warehouse doesn’t necessarily have to. If the business itself doesn’t fundamentally change, then the data warehouse structure remains effective. You can simply change your ETL process to load your current structure or add additional functionality to your current data warehouse.

Date Grouping and Manipulation Sucks

Without the use of a date table/dimension date grouping is a royal pain, especially if you use non-standard fiscal periods. My current employer uses 4-4-5 fiscal periods. Therefore, when grouping by fiscal period, I simply have to use a date table. Data warehouses almost always have a date dimension.

The Report Process Sucks

First, let me tell you what the majority of your business users think about reports. When they want a report, they send a request to IT. It sits in the queue for weeks or perhaps longer. Eventually the DBA asks them a ton of questions which the business user thinks should be a matter of common sense. After several drafts, the approval is finally made and the user forgets why they requested the report in the first place. Reports evoke visions of green bar paper and dust. Wouldn’t it be nice for them to open Excel and simply drag fields into a pivot table or fire PowerPivot up and create an interactive application? They want analytics, not reports.

And what about you, the DBA or Report Writer? How many times can you write the same report with slightly different groupings and parameters? It’s a never ending treadmill and that’s one of the reasons I pursued this path in my free time.

Data Warehousing Doesn’t Suck

A few years ago I attended a SQL Server Analysis Services class with the great Mike Hotek, and he started out with dimensional modeling. He spent less than 20 minutes discussing it, briefly describing dimension and fact tables. However, when he did the clouds parted and a beam of light shown through, and I heard angels singing “aaaah”.

It’s a UNIX system! I know this! – Lex. Jurassic Park

Dimensional modeling just clicked with me and I instantly “got it.” I proceeded to purchase nearly Kimball Group book on the subject and have been studying it ever since. Data warehousing is fascinating to me as it gets me out of the daily grind, and it’s good for the business as well. Its like getting paid to work on puzzles all day long. In short, its full of win.

Tune in every day this week, and I’ll show you why.