Every Kimball Group book I’ve read, as well as every Data Warehouse class I’ve attended, has indicated that a Date Dimension Primary Key should be a smart integer key in the format YYYYMMDD (20110518) so I’ve always built my Date tables that way. However, Barnaby (Blog/Twitter) pointed out that I should be using a Date Datatype key in SQL 2008 instead. His reasoning is sound in that the new Date datatype requires 3 bytes while an Int requires 4. This saves space and memory during processing and still facilitates table partitioning by year. The most obvious downside is that an unknown member value is required for Fact Table processing. However, I always use 19000101 as my unknown member, which can be entered as a date value anyway.
When I inquired on the internets about this prospect, I found conflicting opinions. One cited downside of using Date datatype was that the SQL Query Optimizer was more efficient when joining on Integers rather than dates or character types. However, I couldn’t find any test results involving the Date datatype to back that up so I thought I’d test it myself.
Let me preface the rest of this article with a disclaimer. I am not a query tuning expert, nor do I play one on TV. However, I ran the basics of these tests against the great Grant Fritchey (Blog/Twitter) and am grateful to him for the guidance.
I used my Date Table Script to generate two date tables, one with a smart integer key called DateTableInteger and another called DateTableDate with a Date datatype primary key. The primary key field for each is called DateID. Each table included an index on the [Date] column, which is a DateTime type in DateTableInteger and a Date datatype in DateTableDate. Each table is identical in every other way.
I created two source tables utilizing code written by Michael Valentine Jones from SQLTeam using his method of generating random numbers which can be used to create a column of random dates. For this test I created two tables with a single field of two million records each with dates ranging from 1/1/2000 to 12/31/2020. The table with the Date datatype is called DateSource and the other with smart integer keys is called IntSource. Each source table has exactly the same list of dates as well for test conformity.
Tested Queries and Methodology
I used three simple queries for my tests. Each represents a common scenario where the user needs to select records from a specific year.
set statistics io on set statistics time on Select DT.Date From IntSource ISO -- Integer Type Source join DateTableInteger DT on DT.DateId = ISO.IntColumn Where DT.Date >= '1/1/2010' and DT.Date < '1/1/2011' Select DT.Date From dbo.DateSource DS -- Date Type Source join DateTableDate DT on DT.DateId = DS.DateColumn Where DT.Date >= '1/1/2010' and DT.Date < '1/1/2011' Select DS.DateColumn From dbo.DateSource DS -- Date Type bypassing join. Where DS.DateColumn >= '1/1/2010' and DS.DateColumn < '1/1/2011'
The first query is an integer join, the second utilizes a Date datatype join, and the third is Date datatype but doesn’t require a join. I realize that the two field values in the second query, DateID and Date, are identical but I used it for consistency and it represents a scenario where the join is used to pull additional information such as Calendar Month. In my experience many queries against a Data Warehouse don’t require additional information about a date anyway, and I suspected that the greatest performance benefit would be found when the extraneous join was removed. In my testing, the Integer Join represents my control, the base to which I compare other values. Each query returns the same 100,086 records.
I performed two rounds of tests. The first used non-indexed source values and in the second I added a clustered index to each of the source tables, as Fact Tables often use a date field for their clustered indexes.
Prior to testing each query set, I called dbcc freeproccache and dbcc dropcleanbuffers. I then ran the query set once to eliminate compile time which I was not interested in. Each query was executed 10 times and I averaged the CPU and elapsed time to eliminate other factors. Note that this is a production level server, Dell R510 with Raid 10 arrays of 15K rpm drives, that has not yet been put in production so nothing should be running while I tested. The server has a setting of MAXDOP 1.
Obviously using a Date datatype instead of the smart integer will save space on disk and in memory, however the savings are not dramatic. For example, the space difference is less than 2MB with two million records.
Also, in my opinion, the performance differences between the Integer and Date datatype joins isn’t statistically relevant. I consider performance deltas of less than 5% to be a wash because my testing methodology isn’t exhaustive. The clustered index date join outperformed the integer join by 24% but trailed it by 5% when not indexed. However, I find it interesting that in both sets of tests the number of logical reads is less for the Date datatype join than the integer join.
As I expected, the Date datatype key method easily outperforms the other two when a join is not necessary. The 52% performance boost without indexes and 48% for clustered indexes is considerable and were reflected in a second test run I performed to verify.
Based on these results, I am going to switch to using the Date datatype in my data warehouse projects using SQL Server 2008. Next week, I’ll post an updated script to reflect this.