Archives

How to Properly Load Slowly Changing Dimensions using T-SQL Merge

One of the most compelling reasons to learn T-SQL Merge is that it performs Slowly Changing Dimension handling so well. This post is the fourth in a series called Have You Got the Urge to Merge?. This post builds on information from the other three, so I suggest you stop and read those before continuing, particularly the last one What exactly are Dimensions and Why do They Slowly Change? Go ahead… I’ll wait.

For most SSIS developers, their first experience in loading them is to use the SSIS Slowly Changing Dimension Wizard. It’s fairly easy and straight forward to use, but offers painfully poor performance. There are also 3rd Party SQL Server Integration Services (SSIS) SCD options available, but I’ve tested all of them and never found one that I was happy with. As you can probably guess, I prefer to use T-SQL Merge statements in an SSIS Execute SQL Task. There are several reasons for this.

What are the Pros and Cons of using Merge to load Slowly Changing Dimensions?

Pros

  1. T-SQL Merge is among the best performing options for loading Slowly Changing Dimensions.
  2. Merge statements are very versatile and allow for high degree of customization. This is important when you run into a scenario which deviates from standard data warehousing practices.
  3. The Merge source can be a table, view, common table expression, etc. In fact, the Merge destination (or target) could be a view as well.

Cons

  1. The Source and Destination should be on the same SQL instance. Yes, you can perform Merge with linked servers, but I wouldn’t do so.
  2. Merge statements can be wordy and tedious to write, as you’ll see in a minute. Merging a couple of tables with 20+ columns can be described as tedious at best.
  3. Merge performance relies upon proper indexing and setup. Small mistakes can result in drastically reduced performance.
  4. The initial load of a table can be painful since all of the information is typically written twice, once in staging and then again in the destination table.

Merge statements can be confusing at first so there’s a bit of a learning curve. However, once you get the hang of it, I think you’ll agree that Merge is a great way to go.

When discussing Merge today we are going to stipulate the following:

  1. Each entity in our Dimension will have 1 active record which will be designated by a RowIsCurrent (or another variation) field and/or RowStartDate and RowEndDate.
  2. Entities in our Dimension will be continuous. So, for our DimCustomer, we’ll assume that we don’t have customers cancelling their accounts and then re-starting them. There will be no gaps in time.
  3. On each load, there will only be 1 record per entity to Merge. In other words, for a particular customer, there will only be 1 changed record per load. You’ll notice that my code includes a field called “RankNo” which handles situations where this is not the case such as initial loads. I’ll dive deeper into that process in a future article.

Let’s Get to the Code!

First download the following file which will create a database called JediMergeDemo and the necessary schema and tables. For those who read my last article, these tables and data will look very familiar. As in the previous article, CustomerName will be a Type 1 change, while Planet will be Type 2. I’m including the Fact Table even though we aren’t going to load it. I’m including it because Foreign Key relationships affect how you write your Merge statements.

simple_jedi_star

First, let’s pre-load our Dimension table and also our Staging table. We’ll start by loading a record that requires a Type 1 change in staging. Obi-Wan Kenobi changes his name to Ben Kenobi before he goes into hiding.

INSERT INTO dbo.DimCustomer
(CustomerNum, CustomerName, Planet,  RowStartDate) 
VALUES (101,'Anakin Skywalker', 'Tatooine',   getdate() - 101),
       (102,'Yoda', 'Coruscant',  getdate() - 100),
       (103,'Obi-Wan Kenobi', 'Coruscant',  getdate() - 100)
 
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
VALUES
(103,'Ben Kenobi', 'Coruscant',  getdate() - 99)

base_dim_customer_jedi

Type 1 Merge

Unfortunately, using T-SQL Merge to process Slowly Changing Dimensions typically requires two separate merge statements. Also, it’s important to note that I’m covering the Type 1 Merge process first because it is the simplest to understand. In a real world scenario, I typically handle the Type 1 Merge step AFTER I process Type 2. I’ll explain why in a future article. So, did you read the previous articles on Merge? If you did, the Type 1 Merge code will look eerily similar. The Type 1 Merge statement is simple because all it does it check for changes to the Type 1 columns of the dimension. If any of those change, an update is run against the entire column with that new value. As I mentioned earlier, we only have a single Type 1 column, CustomerName. The Type 1 Merge statement is as follows:

MERGE  dbo.DimCustomer AS Target
 USING  Staging.DimCustomer as Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
                    (SELECT Source.CustomerNum, Source.CustomerName
                     EXCEPT
                     SELECT Target.CustomerNum, Target.CustomerName)
THEN
   UPDATE SET Target.CustomerName = Source.CustomerName;

In case you haven’t read the previous articles, I’m using Select/Except to find changes while correctly handling NULLs. Also, I could have written this query as:

 USING (Select  CustomerNum, CustomerName From  Staging.DimCustomer) as Source

However after extensive testing, even with a dimension where only 1 of many attributes is handled as Type 1, there was no difference in performance or query plan.So, Type 1 Merge is simple, right? Well, Type 2 Merge is a lot more difficult.

Type 2 Merge

Note: Properly handling Type 2 changes with T-SQL Merge is complex, and most people don’t perform it correctly. In order to convey the basics, as well as the complexities, I’m going to present this in layers. If you want to simply want to see the end result, feel free to scroll down to the final statement.
So, before we tackle the Type 2 Merge, let’s review what our code needs to accomplish. To process a Type 2 change, our code must:

  1. Find the current record for each entity (per business key) in the Dimension. Most often this is done by finding the record where RowisCurrent = ‘Y’ or some variation. However, In certain situations, you may need to find the current record by comparing the change date to RowStartDate and RowEndDate.
  2. If there are changes to any of the Type 2 columns for that current row, expire it. This is typically done by setting the RowIsCurrent value to ‘N’ as well as assigning the RowEndDate equal to the change date. In the last article, I explained that some people handle these dates differently.
  3. Insert a record with a new Surrogate Key for that Dimension record and insert all of the new values into it. The RowEndDate will be ’12/31/9999′, NULL, or some other value you choose.

The Merge Into portion of the Type 2 is just like the Type 1 Merge.

          Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum

Followed by a WHEN NOT MATCHED BY TARGET statement which inserts any new records. If my Staging table was durable, meaning it kept every single record that had been loaded into the Dimension, I may have wanted to add a WHEN NOT MATCHED BY SOURCE section as well.

          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')

The previous snippet inserts brand new records, but what about handing the Type 2 expiration process? Well, this next snippet of code handles just the expiration, not the insertion of the newly created records.

          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate

As I mentioned previously, I could have found the current record via Begin and End Dates instead, like this:

          WHEN MATCHED AND Target.RowStartDate <= Source.ChangeDate and Source.ChangeDate < Target.RowEndDate 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate

Now, this is one of the places Type 2 Merge statements get tricky. You cannot insert those expired records and insert them from within the Merge statement. You must capture them with an Output clause, which filters for only the Updated records. Enclose the entire statement (so far) with parenthesis and alias the sub-query.

         (Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum
          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')
          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate
           Output $ACTION ActionOut, Source.CustomerNum, Source.CustomerName, Source.Planet, Source.ChangeDate RowStartDate) AS MergeOut
  WHERE  MergeOut.ActionOut = 'UPDATE'

So, we’re filtering for only the expired records (Update) and outputting all of the columns, not just the Type 2 columns; because we’re going to use the results from this sub-query as the source of an insert statement. Before I share that portion of the code, I want to point out that this is another common error that people make when they write these statements. Suppose you appended the following to the beginning of Merge statement and ran it. You’d receive an error like the following:

Insert dbo.DimCustomer
  Select CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate from (Merge Stuff)

The target table ‘dbo.DimCustomer’
of the INSERT statement cannot be on either side
of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement.
Found reference constraint ‘fk_FactSalesDetails….’

So, how exactly do we get around this? As my good friend Peter Schott points out, a good method is to first insert these records in a Temp Table, and then insert them into the Dimension from there. Let’s look at the code:

IF OBJECT_ID('tempdb..#DimCustomer') IS NOT NULL
    DROP TABLE #DimCustomer;
 
CREATE TABLE #DimCustomer(
  CustomerNum int Not Null PRIMARY KEY CLUSTERED, 
  CustomerName VARCHAR(25) Not Null,
  Planet VARCHAR(25) NOT NULL, 
  RowIsCurrent Char(1) Not NULL Default 'Y',
  RowStartDate datetime Not Null Default getdate(), 
  RowEndDate datetime Not NUll Default '12/31/9999');   
 
INSERT INTO #DimCustomer(CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
  Select CustomerNum, CustomerName, Planet, 'Y', RowStartDate, '12/31/9999'
  From   (Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum
          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')
          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate
           Output $ACTION ActionOut, Source.CustomerNum, Source.CustomerName, Source.Planet, Source.ChangeDate RowStartDate) AS MergeOut
  WHERE  MergeOut.ActionOut = 'UPDATE';
 
Insert dbo.DimCustomer
  Select CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate from #DimCustomer;

As far as I’m concerned it is not absolutely necessary to define the Temp Table the way I have. I’ve seen it successfully done by using a Select * into #DimCustomer from….. Since my Merge statements are mostly generated automatically, I spell it out. Also, it should be noted that you could simply have another version of the destination table that acts as a temporary storage for these records. That way, you wouldn’t have to regenerate a Temp Table each time, though you would need to remember to Truncate the additional holding table. Anyway, Let’s test it with the following Type 2 Change. You’ll notice that I’m truncating the Staging table before a new Merge. Yoda moves to Degobah.

Truncate Table Staging.DimCustomer;
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
VALUES (102,'Yoda', 'Degobah',  getdate()-97);

type_2_final

And we’ll test an insert as well. Suppose Darth Maul places an order from Coruscant.

Truncate Table Staging.DimCustomer;
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
VALUES (104,'Darth Maul', 'Coruscant',  getdate()-97);

Run the Type 2 Merge code and you should see the following:
after_insert_final

So, I bet you’re thinking, “Well, that was a bit more complex than I expected, but at least that’s it. That is it, right? Right?” Well, not completely.

Transactions and Error Handling

The code above will work under the conditions I stipulated when we started. However, what happens if there’s an error? Well, when these statements are executed, I either want the entire Merge set (both statements) to succeed or fail together. I also tend to include code which records any errors which occur in a table for analysis. For the sake of this demo, I’ll just display them to the user. So, if we add that code to our Merge statements we have:

BEGIN TRY
Begin Transaction
 
IF OBJECT_ID('tempdb..#DimCustomer') IS NOT NULL
    DROP TABLE #DimCustomer;
 
CREATE TABLE #DimCustomer(
  CustomerNum int Not Null PRIMARY KEY CLUSTERED, 
  CustomerName VARCHAR(25) Not Null,
  Planet VARCHAR(25) NOT NULL, 
  RowIsCurrent Char(1) Not NULL Default 'Y',
  RowStartDate datetime Not Null Default getdate(), 
  RowEndDate datetime Not NUll Default '12/31/9999');   
 
/* Type 2 Merge */
INSERT INTO #DimCustomer(CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
  Select CustomerNum, CustomerName, Planet, 'Y', RowStartDate, '12/31/9999'
  From   (Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum
          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')
          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate
           Output $ACTION ActionOut, Source.CustomerNum, Source.CustomerName, Source.Planet, Source.ChangeDate RowStartDate) AS MergeOut
  WHERE  MergeOut.ActionOut = 'UPDATE';
 
Insert dbo.DimCustomer
  Select CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate from #DimCustomer;   
 
 /* Type 1 Merge */
MERGE  dbo.DimCustomer AS Target
 USING Staging.DimCustomer AS Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
                    (SELECT Source.CustomerNum, Source.CustomerName
                     EXCEPT
                     SELECT Target.CustomerNum, Target.CustomerName)
THEN
   UPDATE SET Target.CustomerName = Source.CustomerName;
Commit
END TRY
 
BEGIN CATCH
 
  IF @@TRANCOUNT > 0
     ROLLBACK
 
    SELECT
        ERROR_NUMBER() AS ErrorNumber
        ,ERROR_SEVERITY() AS ErrorSeverity
        ,ERROR_STATE() AS ErrorState
        ,ERROR_PROCEDURE() AS ErrorProcedure
        ,ERROR_LINE() AS ErrorLine
        ,ERROR_MESSAGE() AS ErrorMessage;
       RETURN;
END CATCH

To test this, let’s deliberately cause an error. We’ll add two records for Yoda. If you’ll remember, one of the stipulations above was that we would have only 1 changed record per entity per merge. Run the following code and re-run our final merge statement.

Truncate Table Staging.DimCustomer;
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
Values
(102,'Yoda', 'Kashyyyk',  getdate()-94),
(102,'Yoda', 'Bespin',  getdate()-93);

This should return an error like the following:

The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

Feel free to take the code samples and test out merging records with both Type 1 and Type 2 changes. Also, I’m constantly refining my processes, including Merge so if anyone has suggestions for improvement, I’d love to hear them.

In upcoming articles, we’ll cover:

  • A detailed explanation of how historical Data Warehouse loads (should) work.
  • How to perform those kinds of loads with T-SQL Merge. This will also handle incremental load situations where more than one Type 2 change may occur between extracts.
  • Merge statement performance tuning.
  • How to optimally Load Fact Tables using Merge.
  • Options for T-SQL Merge statement “automatic” generation. This will include 3rd party products, free open source projects, and my own generator code. I have T-SQL Code which generates Merge statements, but I’m also generating them using BIML Script.

Stay Tuned!

Related posts:

6 comments to How to Properly Load Slowly Changing Dimensions using T-SQL Merge

  • Patrick Ge

    Thanks. Very nice article. Looking forward to your upcoming posts.

  • Suhail Ali

    How would you know which record in the merge statement caused the error? What if I wanted to write out all the bad records into an error table for analysis?

  • […] How to Properly Load Slowly Changing Dimensions using T-SQL Merge […]

  • Lovemore Nyere

    Thank you. this is a very nice article.

    I’m busy working on SSAS cube solution- multidimensional model. i need to model a many to many relationship where both dimensions are slowly changing. The Fact table has monthend data (snapshot measures).

    I used the following process:
    1. use your approach (with slight modification to handle reloads) to model the Dim1 and Dim2 dimensions. Dim1 and Dim2 are slowly changing and between them, there is a many to many relationship.
    2. Create at bridge table between Dim1 and Dim2 with fields: Dim1Key, Dim2Key, MonthEndDateId. The bridge table is handling the relationship between the 2 tables;
    3. Join Dim1 to the Fact table: Dim1 join results in distinct records in the Fact table;
    4. In SSDT Dimension Usage, i included a relationship on the Date dimension joining to the Bridge table on MonthEndDateId. this gives me the ability to see changes in Dim1 and Dim2 over time.

    Its working fine, but i wanted to know of a better implementation of this solution. If i denormalise the dimension, each record in the Fact table gets multiplies about 20 times.

    Any advice will be greatly appreciated.

  • Ron Romero

    Great article, you’ve convinced me that sp is the way to go; not only for scd, but for all ETL tasks. I’m sure Microsoft will get rid of SSIS in about 5 years and everything will have to be redone anyway.

    Yoda? Skywalker? You NERD…

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>