Data Warehouse Initial (Historical) Dimension Loading With T-SQL Merge

If T-SQL Merge can only handle one changed record at a time, how can we use it for our initial/historical load? Further, how can we use Merge in those situations where the incremental load may produce more than 1 changed record per entity? Well, I’m glad you asked.

Welcome to the next article in the Urge to Merge Series. In this series I’ve tried to clear up many misunderstandings about how to use T-SQL Merge effectively, with a focus on Data Warehousing.

The Setup

We will be using the same code we used in Extracting Historical Dimension Records using T-SQL, which is available here: Jedi Merge Historical Load Setup.

For the sake of simplicity, we will use the first query provided in that article which extracts the entire historical load. The query is as follows:

INSERT INTO Staging.DimCustomer (CustomerNum
   SELECT CustomerNum
         ,Row_Number ()
             OVER (PARTITION BY CustomerNum ORDER BY ChangeDate DESC)
     FROM Source.CustomerHistory
   --Where CustomerNum = 102
   ORDER BY CustomerHistoryId

The most important part of this query is the RankNo column. Notice that I am assigning a Row_Number per entity (Business Key) by descending ChangeDate. This means the oldest records per CustomerNum get the highest RankNo value. My looping Merge is set to loop from highest rank down to 1, which may seem counter-intuitive to some. The reason I do this will become clear in a minute. Anyway, the above query will create a data set like the following in Staging.DimCustomer:

Notice that I've filtered for Yoda only for simplicity.

Notice that I’ve filtered for Yoda only for simplicity.

If you’ve been following along in this series, I’ve repeatedly mentioned a limitation of T-SQL Merge. On each execution of the Merge statement, there will only be 1 record per entity to Merge. Well, we’re about to deal with that issue now. Just because we can only Merge one change record per entity at a time, doesn’t mean we can’t loop through Merge statements to accomplish an Initial (Historical) Dimension Load. The Merge code should look familiar as it builds on my previous articles.

Looping Merge Code

IF OBJECT_ID('tempdb..#DimCustomer') IS NOT NULL
    DROP TABLE #DimCustomer;
CREATE TABLE #DimCustomer(
  CustomerNum int NOT NULL, 
  Planet VARCHAR(25) NOT NULL, 
  RowIsCurrent Char(1) Not NULL Default 'Y',
  RowStartDate datetime Not Null Default getdate(), 
  RowEndDate datetime Not NUll Default '12/31/9999');   
DECLARE @Counter SmallInt
SET @Counter = (SELECT MAX(RankNo) FROM Staging.DimCustomer);
WHILE (@Counter >=  1)
-- Type 2 Loop
INSERT INTO #DimCustomer (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
Select CustomerNum, CustomerName, Planet,'Y', RowStartDate, '12/31/9999' From 
(Merge dbo.DimCustomer AS Target
 USING ( Select CustomerNum, CustomerName, Planet, ChangeDate, RankNo  From    Staging.DimCustomer
 Where RankNo = @Counter) as Source 
ON  Target.CustomerNum = Source.CustomerNum
 AND Target.RowIsCurrent = 'Y' 
 and EXISTS (SELECT Source.CustomerNum, Source.Planet
 EXCEPT SELECT Target.CustomerNum, Target.Planet) 
 THEN UPDATE SET RowIsCurrent = 'N',Target.RowEndDate = Source.ChangeDate
 WHEN NOT MATCHED BY TARGET THEN INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate) 
Values (Source.CustomerNum, Source.CustomerName, Source.Planet,'Y',Source.ChangeDate ,'12/31/9999')
Output $ACTION ActionOut ,Source.CustomerNum, Source.CustomerName, Source.Planet, Source.ChangeDate RowStartDate
              ) AS MergeOut
    WHERE MergeOut.ActionOut = 'UPDATE';
Insert  dbo.DimCustomer
Select CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate  from #DimCustomer;
	SET @Counter = @Counter - 1;
-- Type 1
Select @Counter Counter
 MERGE  dbo.DimCustomer AS Target
 USING ( Select CustomerNum, CustomerName, Planet, ChangeDate, RankNo  From    Staging.DimCustomer
 Where RankNo = 1) AS Source
    ON Target.CustomerNum = Source.CustomerNum
                    (SELECT Source.CustomerNum, Source.CustomerName
                     SELECT Target.CustomerNum, Target.CustomerName)
   UPDATE SET Target.CustomerName = Source.CustomerName;

There are just a few things to note:

  1. As I mentioned in a previous article, the Type 1 Merge statement is performed last. Obviously, it has to be.
  2. Notice that I’m using a simple While statement to loop.
  3. The RankNo column was calculated by descending ChangeDate because this insures that the last record for each entity to be processed has a RankNo equal to 1. This makes the Type 1 Merge statement easier to write.

What about Transactions and Error Handling?

I’ve intentionally left those aspects out of this code for a couple of reasons. First, it’s easier to understand what’s happening without the extra code in the way. Second, you may not actually want explicit transactions in your Initial Data Warehouse load. It doesn’t take a Rocket Surgeon to figure out that if you load a huge Dimension historically, and you only use one transaction, your transaction log may grow out of control. This is particularly true if you are loading multiple Dimensions simultaneously. This is definitely an “It Depends” situation.

As I mentioned in the previous article, Extracting Historical Dimension Records using T-SQL, the Merge statement above should work in either Method 1 or 2. Also, because of the RankNo column, incremental Dimension loads which may contain more than 1 changed record per entity (CustomerNum) will work correctly as well.

Any questions?

3 comments to Data Warehouse Initial (Historical) Dimension Loading With T-SQL Merge

  • Suhail Ali

    This might a bit off topic but I was wondering how you would handle inferred dimensions in both a historical and normal dimension load? Typically this is done in fact load when the dimension look-up fails and the dimension is created based on the data in the fact table but here you are advocating use merge for both dimension and fact load. If this is something you’ll cover in future blog posting for fact loads I’ll patiently wait. Love the merge series.

  • Elena Z.

    David, I’m glad that I found this fantastic article. That was exactly my question: how to handle multiple records and keep a history. And I found a very detailed answer here. Thank you very much!!!

  • Duane Wilson

    I have a situation where I may have to load more than one change per day (since last dimension load), but most of the time just one. On some smaller dimensions, I have been truncating and reloading every day because our stage tables have the entire history appended day by day. It was a trivial matter to truncate the dimension table day by day and reload it because it was small as if they were initial loads (which they are).

    Now, I am faced with building large ones where that would not be a viable option, but I still have the code similar to above, as a template. But most days, I will just be selecting the most recent rows from our “stage” tables. These stage tables all have all the dates. And there could be, as I said, more than one change since the last dimension load.

    I guess I should sum up by saying, should I handle all loads as some kind of modified initial load? Can I use the same code somehow?

    Thanks for any help. There are some great, informative articles here.

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>