Archives

Extracting Historical Dimension Records using T-SQL

Just because you can extract every single historical record for a dimension doesn’t mean you should. In this article, we’ll cover three basic extraction “methods” and when each of them may be preferred.

As with many of my Urge to Merge Series articles, this one will be presented in layers. In this case, each layer is actually a fully functional method and will work with the T-SQL Merge code I’ll publish in the next article. However, each step reduces the number of records which must be processed.

Let’s Get to the Code!

First download the following file which will create a Jedi Merge Database and the necessary schema, tables, and seed data. Also, the Excel Spreadsheet I used to create the screenshots is included as well. The script actually builds on the same basic code I’ve been using for the entire series. In this case, I’ve added a source history table, placed in a schema called source. While I realize that your source will most likely not be inside your Data Warehouse, I’ve done this for simplicity and so I can use T-SQL to load data into the staging table without having to actually use SSIS. In a real world scenario, the Select statement against the source would be placed into a OLE DB Source of a SSIS Dataflow instead.

The seed data contains 27 records, covering 4 different customers. The most important one is Yoda, Customer Number 102. I tried to strike a balance between “realism” and simplicity by mixing the other customer records with Yoda’s, but if you want to simpify this as much as possible, you can either delete the records with a Customer Number other than 102 or filter them out.

Source Data Set

Source Data Set

Considerations

As I mentioned in a previous article, in order to use the code I propose, the following stipulations must be met.

  1. Each entity in our Dimension will have 1 active record which will be designated by a RowIsCurrent (or another variation) field and/or RowStartDate and RowEndDate.
  2. Entities in our Dimension will be continuous. So, for our DimCustomer, we’ll assume that we don’t have customers cancelling their accounts and then re-starting them. There will be no gaps in time.
  3. On each execution of the Merge statement, there will only be 1 record per entity to Merge. When we historically load Dimensions, this is clearly not the case. This is why I’ve included a column called RankNo which has been unused up to this point. In the next article, we will loop our Merge executions based on this column.

Method 1: Extract Entire History

We could simply extract all of the data as in the following query. Notice that I’ve used a the Row_Number function to obtain my Rank. I’m ranking them in reverse order of ChangeDate. The oldest records per entity (Customer) have the highest rank. This may seem counter-intuitive but it will make more sense when you see the looping Merge statements in the next article.

   SELECT CustomerNum
         ,CustomerName
         ,Planet
         ,ChangeDate
         ,Row_Number ()
             OVER (PARTITION BY CustomerNum ORDER BY ChangeDate DESC)
             RankNo
     FROM Source.CustomerHistory
   --Where CustomerNum = 102
Please excuse the date inconsistencies.

Please excuse the date inconsistencies (Yoda only).

For the sake of this demo I’ll simply append an Insert into to the above statement to load the staging table.

INSERT INTO Staging.DimCustomer (CustomerNum
                                ,CustomerName
                                ,Planet
                                ,ChangeDate
                                ,RankNo)
   SELECT CustomerNum
         ,CustomerName
         ,Planet
         ,ChangeDate
         ,Row_Number ()
             OVER (PARTITION BY CustomerNum ORDER BY ChangeDate DESC)
             RankNo
     FROM Source.CustomerHistory
   --Where CustomerNum = 102
   ORDER BY CustomerHistoryId

This will work with the Merge code provided in the next article, but even in this small sample, the Merge will loop 20 times, which is excessive. Keep in mind that this may NOT actually matter to us since we should only be performing the initial/historical load one time anyway. If the dataset is relatively small, or we can afford to be less efficient, we may not want to pursue the other methods defined below.

Method 2: Extract Only History Which Matters

Unless the mechanism which records history in the source only adds new records for the exact same set of attributes designated as Type 2 for the Data Warehouse, it would be helpful to limit our extract. So, how do we determine which records actually matter? We need to extract any record which meets the following criteria:

  • The first instance of each entity (Customer) encountered so we can set the original RowStartDate (or whatever you choose to call it).
  • Each record that contains one or more Type 2 changes.
  • We need to be able to capture the final value(s) for every Type 1 column. One way is to extract the last record for each entity regardless of whether it included a Type 2 change.

What’s an effective way of detecting Type 2 changes? I prefer to use SQL Server 2012 Window Functions. You can learn more about them at the link I provided but basically I’m using LAG to compare the current value of our only Type 2 column (Planet) to the previous record for that entity. I’m also LEAD to detect the final record for each entity. I’m having a little contest to find the best way to accomplish methods 2 and 3 for SQL 2005 through 2008R2. See below for details.

SELECT CustomerHistoryId
      ,CustomerNum
      ,CustomerName
      ,Planet
      ,LAG(Planet) over(Partition by CustomerNum Order by ChangeDate ) PrevPlanet
      ,ChangeDate
      ,CASE WHEN LAG (Planet) over(Partition by CustomerNum Order by ChangeDate ) is NULL-- First Record
or LAG(Planet) over(Partition by CustomerNum Order by ChangeDate ) <> Planet        -- Type 2 Field Value Changed
or LEAD(Planet) over(Partition by CustomerNum Order by ChangeDate ) is NULL         -- Last Value per Entity
then 'Y'
else 'N' end as Matters
From     Source.CustomerHistory
Order by CustomerHistoryId;

Notice that our Case When statement checks for a NULL value (first record for an entity), any record for that entity where the Planet of the previous record does not equal the current value, and using LEAD to find the last record for that entity.
jedi_merge_historical_load_matter
Using the Excel Spreadsheet I included in the download link above, you can easily auto-filter for only the records which matter.

Filtered for records which Matter.

Filtered for records which Matter.

How could we filter out the rows which don’t matter? Well, one way is to use the query as our data flow task OLE DB Source. We can then use a Conditional Split and only stage the records where Matters = ‘Y’. After that, we could run an update statement to set the RankNo of the remaining records in staging. Finally, we’d run the looping Merge in the next article and receive the exact same results as the first method, but with a lot fewer loops.

Method 3: Skip Staging and Merge Entirely

Wouldn’t it be nice if we could perform the Initial/Historical load without actually using Merge? That way we wouldn’t have to write every single relevant record to staging and re-write it (with some logical manipulation) again in the Merge statements. I recently worked on an Data Warehouse implementation where the data volume involved necessitated an alternative solution. In the following example, I’ve chosen to use a Temp Table, but could have used a Common Table Expression or a Table Variable instead. The thought process goes like this. If we can filter for only the records which matter, while at the same time retrieving the last value per entity (customer) for each Type 1 column, then we have no need to stage and Merge. We begin by Selecting the minimum number of fields necessary into the Temp Table which includes the ID field to uniquely identify each source row, any Type 1 columns, and our “Matters” calculated column. Notice that I’m using the SQL Server 2012 LAST_VALUE function to derive the last name used by each Customer.

IF OBJECT_ID('tempdb..#JediTemp') IS NOT NULL
    DROP TABLE #JediTemp
    Go
Select CustomerHistoryId TempCustomerHistoryId
       ,CASE WHEN LAG (Planet) over(Partition by CustomerNum Order by ChangeDate ) is NULL -- First Record
       or LAG(Planet) over(Partition by CustomerNum Order by ChangeDate ) <> Planet  -- Type 2 Field Value Changed
       then 'Y' else 'N' end as Matters
       ,LAST_VALUE(CustomerName) OVER (Partition by CustomerNum Order by ChangeDate
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) LastCustomerName
into #JediTemp
From     Source.CustomerHistory;

jedi_merge_historical_temp_table

So, now it’s easy to visualize that if we filter out the records which don’t matter, and then join to the original data set to get the Type 2 column(s), we can actually create our Initial load without resorting to staging and T-SQL Merge. In this case, I’ve appended an Insert Into the Dimension directly from the source. However, in a real world scenario, you’d use the Select query as the source of your Data Flow Task and designate the Dimension as the destination.

INSERT INTO dbo.DimCustomer
(CustomerNum, CustomerName, Planet, RowStartDate, RowEndDate, RowIsCurrent) 
SELECT CustomerNum
      ,LastCustomerName CustomerName
      ,Planet
      ,ChangeDate AS RowStartDate
      ,coalesce(LEAD(ChangeDate) over(Partition by CustomerNum Order by ChangeDate ), '12/31/9999') RowEndDate
      ,Case when LEAD(ChangeDate) over(Partition by CustomerNum Order by ChangeDate ) is NULL then 'Y'
       else 'N' End as RowIsCurrent
From     Source.CustomerHistory
join #JediTemp on CustomerHistoryId = TempCustomerHistoryId
and Matters = 'Y'
Order by CustomerNum;

jedi_dimcustomer_historic_load_final
In the next article, I’ll share T-SQL Merge code which will process the Initial/Historical Load of a Data Warehouse. The code will also handle an incremental load situation where more than one record per entity may be extracted per load.

A Contest with a Prize

As I mentioned above, I wrote this code using SQL 2012 Window Functions. I’d like to challenge my readers to submit code that will work in SQL Server 2005 through 2008R2. The winning submission will be chosen based on the following criteria:

  1. The code must be reasonably performant in most circumstances. I realize my example contains only 27 records but think about working with millions.
  2. Stage as little of the data as possible. Ideally, nothing would be written to disk.
  3. Needs to be a logical and repeatable pattern as I will be converting it to Biml Script.

Please submit your code to Made2MentorContests@gmail.com no later than Friday, August 30th and I will pick a winner over the weekend. The winner will receive a brand new copy of:

SQL Server 2012 Integration Services Design Patterns
I was one of the technical reviewers on that fine book so I recommend it highly. If the winner is a US Resident, I will mail them a physical copy of the book. For those outside the US I will provide a copy of the eBook. I’m sorry if that is inconvenient, but otherwise the international shipping costs could get out of hand.

Are there any questions?

1 comment to Extracting Historical Dimension Records using T-SQL

  • Geir F

    Hello!

    I read Your articles With great interests. I have a case here similar to Your examples explained in this articles. Also in my case I could have many items of the same entity in the same run like running Your initial data over and over again into dim.Customer. When running Your code, the Merge loop does not take any action if the row With identical Fields exists, it just inserts a New row With New customerkey. Is there a fast change in this code, to fix this?

    I hope I explained me clearly

    Regards Geir

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>