Archives

Date vs Integer Datatypes as Primary Key for Date Dimensions

Every Kimball Group book I’ve read, as well as every Data Warehouse class I’ve attended, has indicated that a Date Dimension Primary Key should be a smart integer key in the format YYYYMMDD (20110518) so I’ve always built my Date tables that way. However, Barnaby (Blog/Twitter) pointed out that I should be using a Date Datatype key in SQL 2008 instead. His reasoning is sound in that the new Date datatype requires 3 bytes while an Int requires 4. This saves space and memory during processing and still facilitates table partitioning by year. The most obvious downside is that an unknown member value is required for Fact Table processing. However, I always use 19000101 as my unknown member, which can be entered as a date value anyway.

When I inquired on the internets about this prospect, I found conflicting opinions. One cited downside of using Date datatype was that the SQL Query Optimizer was more efficient when joining on Integers rather than dates or character types. However, I couldn’t find any test results involving the Date datatype to back that up so I thought I’d test it myself.

Let me preface the rest of this article with a disclaimer. I am not a query tuning expert, nor do I play one on TV. However, I ran the basics of these tests against the great Grant Fritchey (Blog/Twitter) and am grateful to him for the guidance.

Setup

I used my Date Table Script to generate two date tables, one with a smart integer key called DateTableInteger and another called DateTableDate with a Date datatype primary key. The primary key field for each is called DateID. Each table included an index on the [Date] column, which is a DateTime type in DateTableInteger and a Date datatype in DateTableDate. Each table is identical in every other way.

I created two source tables utilizing code written by Michael Valentine Jones from SQLTeam using his method of generating random numbers which can be used to create a column of random dates. For this test I created two tables with a single field of two million records each with dates ranging from 1/1/2000 to 12/31/2020. The table with the Date datatype is called DateSource and the other with smart integer keys is called IntSource. Each source table has exactly the same list of dates as well for test conformity.

Tested Queries and Methodology

I used three simple queries for my tests. Each represents a common scenario where the user needs to select records from a specific year.

set statistics io on
set statistics time on
 
Select DT.Date From IntSource ISO -- Integer Type Source
join DateTableInteger DT on DT.DateId = ISO.IntColumn
Where DT.Date >= '1/1/2010' and DT.Date < '1/1/2011'
 
Select DT.Date From dbo.DateSource DS -- Date Type Source
join DateTableDate DT on DT.DateId = DS.DateColumn
Where DT.Date >= '1/1/2010' and DT.Date < '1/1/2011'
 
Select DS.DateColumn From dbo.DateSource DS -- Date Type bypassing join. 
Where DS.DateColumn >= '1/1/2010' and DS.DateColumn < '1/1/2011'

The first query is an integer join, the second utilizes a Date datatype join, and the third is Date datatype but doesn’t require a join. I realize that the two field values in the second query, DateID and Date, are identical but I used it for consistency and it represents a scenario where the join is used to pull additional information such as Calendar Month. In my experience many queries against a Data Warehouse don’t require additional information about a date anyway, and I suspected that the greatest performance benefit would be found when the extraneous join was removed. In my testing, the Integer Join represents my control, the base to which I compare other values. Each query returns the same 100,086 records.

I performed two rounds of tests. The first used non-indexed source values and in the second I added a clustered index to each of the source tables, as Fact Tables often use a date field for their clustered indexes.

Prior to testing each query set, I called dbcc freeproccache and dbcc dropcleanbuffers. I then ran the query set once to eliminate compile time which I was not interested in. Each query was executed 10 times and I averaged the CPU and elapsed time to eliminate other factors. Note that this is a production level server, Dell R510 with Raid 10 arrays of 15K rpm drives, that has not yet been put in production so nothing should be running while I tested. The server has a setting of MAXDOP 1.

Results

All recorded times are in milliseconds (ms).


Conclusions

Obviously using a Date datatype instead of the smart integer will save space on disk and in memory, however the savings are not dramatic. For example, the space difference is less than 2MB with two million records.

Also, in my opinion, the performance differences between the Integer and Date datatype joins isn’t statistically relevant. I consider performance deltas of less than 5% to be a wash because my testing methodology isn’t exhaustive. The clustered index date join outperformed the integer join by 24% but trailed it by 5% when not indexed. However, I find it interesting that in both sets of tests the number of logical reads is less for the Date datatype join than the integer join.

As I expected, the Date datatype key method easily outperforms the other two when a join is not necessary. The 52% performance boost without indexes and 48% for clustered indexes is considerable and were reflected in a second test run I performed to verify.

Based on these results, I am going to switch to using the Date datatype in my data warehouse projects using SQL Server 2008. Next week, I’ll post an updated script to reflect this.

Related posts:

9 comments to Date vs Integer Datatypes as Primary Key for Date Dimensions

  • Hi David,

    Good to see the results of your tests. On the space issue, a fact table can have multiple date columns and at 1 MB per million keys, a large fact table of a hundred million rows and four date keys could save 400MB of space, which I think is pretty cool :-)

    I think the most impressive thing about using the date datatype is as you point out, that you can perform date calculations without having to join to the date dimension.

    Cheers,

    Barny

  • Hi David,

    an Intriguing post – but i think you have missed an important consideration.

    even if you used integer keys you can still bypass the join to the date dimension. The bypass query could be applied to the other source too… can you update post after testing how well that performs please?

    Select IS.DateColumn From dbo.IntSource IS
    Where IS.IntColumn >= ’20010101′ and IS.IntColumn < '20110101'

  • Bhavik,

    between is one thing, how about datepart, datediff etc? You cannot use them on ints can you.

    Barny

  • Bhavik, I actually tested that but didn’t include it in the post. The performance was slightly less than the Date Datatype (minus) join query above. The actual averages for clustered index was 26ms CPU Time and 672ms elapsed of which I feel that CPU time was the more important stastic.

    I didn’t include that option because it violates the basic objective of the Kimball method. In my opinion, the whole point of dimensional modeling is to make reporting easier with performance being a secondary consideration.

    Isn’t it inconvenient to write all of your queries with as >= 20100101 and < 20110101 rather than referring to them as dates?

    Also, if you cast the date table integer key value as a Date (or datetime), using a view or simply in your T-SQL queries, you negate the use of your indexes.

    For those reasons, I didn’t post the Integer (without join) data type results.

  • Bhavik

    Agree with both of you wholeheartedly! I was merely interested in the performance scores

    We use the standard kimball approach as a teamwide standard but after reading this post we are probably going to move to date keys instead of int.

    Solutions I’ve implemented over the years have always employed sass cubes, so I only foresee a small benefit of date keys during partition processing. However, having a date in a fact instead of int offers more ease during testing and reconciliation of the dimensional model since we wont have to join to the date dim as often

  • Bhavik

    Ah, of course rolap and holap scenarios could benefit too!

  • Excellent points Dave. I can see you put a lot of thought and work into this one. I’ll give it a go on my next data warehouse.

    Tim

  • Boris

    Thank you for the article! Good one!!
    Just came across with one misprint at the beginning – Date datatype requires 3 bitsbytes.

  • Thanks Boris. I just fixed it and am surprised that I missed that.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>