Archives

Upcoming Presentations - NTSSUG and SQL Saturday OKC

Tonight at the North Texas SQL Server User Group

This is just a quick reminder that I’m presenting another hour in my continuing series Data Warehousing for the North Texas SQL Server User Group. We’ve already covered Dimension Tables in depth, so tonight I’ll be presenting, “Fact Tables – Show Me The Money!” I’ll explain the fundamentals of Fact Table design and demonstrate how to load them with SSIS.

The PowerPoint Slides, SSIS Project, and T-SQL Setup code can be downloaded here.

Also presenting tonight will be Devin Knight (Blog/Twitter). Devin will be presenting, “Getting to Know Power Query.” Devin’s presentations are always great so I urge everyone in the area to attend.

SQL Saturday 223 – Oklahoma City

I’ll be giving the following presentations in OKC on Saturday August 24th. I’ve given both of these before, but I am always improving them.

Have You Got the Urge to (T-SQL) Merge?

You’ve got data to load into an already existing table, perhaps a slowly changing dimension, and need a method that is simple, supportable, and performs well. The T-SQL Merge statement, introduced in SQL 2008, can do all of this and more. Many DBAs investigate it, but find it unintuitive, time consuming, or they couldn’t get it to perform well. This doesn’t need to be difficult; after attending this session you’ll thoroughly understand the fundamentals of the merge statement and how to write them. You will also leave with scripts which automatically write well formed merge statements for you in less than 1 minute, guaranteed. Come see how.

Data Warehouse Mistakes You Can’t Afford to Make

Many data professionals understand the basics of Data Warehouse design, including Dimension and Fact Tables, slowly changing Dimensions, and the use of meaningless surrogate keys. However, it isn’t until you’ve created a dimensional model and put it into production, that you realize just how much of an impact seemingly trivial mistakes can make. They can hobble performance, allow inaccuracy, and perhaps worst of all, inhibit adoption and usage of the new system. Learn how to avoid many common mistakes, from someone who’s made them and then found ways to correct them.

I’ve received high marks on both of these presentations for good content and funny delivery. Nobody talks about Data Warehousing the way I do, and I’m sure a good time will be had by all.

If you haven’t signed up for SQL Saturday OKC you can find more information about it on their website.

Hope to see some of you there!

Stop Clowning Around and use Biml

Sorry for my terrible photography.

Sorry for my terrible photography.

Clowning Around

Friday, I took my girlfriend to see The Greatest Show On Earth. We love the animal acts, stunts, and the over-the-top atmosphere. However, I am always torn when watching these kinds of performers. On one hand, as anyone who’s seen me present can attest, I love performing. There’s something magical about the energy you get from an audience. However even though the circus performers are obviously so talented, by necessity they have to perform the same short routines over and over again. Take the clowns for instance. While I know that making children laugh is fun and would be an amazing vocation, personally I would be bored performing the same old jokes and visual gags repeatedly. How many times can one “accidentally” trip over the broom of his partner or get squirted by the lapel flower pin?

I feel the same way about the acrobats, trapeze artists, and even the Lion Tamer. When I perform repetitive tasks I lose focus and can’t concentrate. Also, while I know that many of us really like our jobs, let’s face it. Is writing the same basic SSIS package over and over again nearly as fun as flying on a trapeze? If you’ve ever loaded a Data Warehouse with SSIS, you know that loading Fact Tables is a fairly simple and repetitive process. The process of writing the ETL packages for loading 20 Fact Tables can easily consume at least a week of development time and is very tedious. Does the manual process of writing these packages really add value?

Run The Circus

Be The Ringmaster

Biml can make you the Ringmaster

I’d rather be in control of the entire show. Using Biml Script, you can create a durable framework which facilitates code re-use. This allows you can automate all the repetitive, mundane tasks and speed up development. You can spend your time gaining a better understanding of your client’s needs and designing effective solutions for them, rather than manually writing and re-writing an endless parade of SQL Server Objects (stored procedures, views, etc.) and SSIS packages.

However, this process is not all laughs and cotton candy. Keep in mind that in order to excel at being a Circus Ringmaster, he/she must know how to be a clown, acrobat, animal trainer, etc. or at least what it takes to do those jobs. There just aren’t a lot of ways to shortcut around that. Similarly, in order to create these ETL frameworks and fully implement Biml Script, you need to have a comprehensive understanding of all the individual aspects of it. Biml is NOT a magic bullet and isn’t a “Data Warehouse in a Box.”

The journey to Biml Script mastery is not easy, but is achievable, fascinating, and completely worth pursuing. Stay tuned for more articles on why I chose to implement Biml Script as well as code samples you can use to further your efforts.

Why I Choose BIML Scripting

Welcome to a new series of articles, about Biml and Biml Scripting. Most of these will be fairly short personal essays about why I am investing my time in Biml Scripting. I’m hoping these will resonate with you and encourage you to start a similar journey. As I publish related articles, I will add them below.

What is Biml Scripting?

I’m sure some of you have never heard of Biml. Rather than re-write what Andy Leonard (Blog/Twitter) already has, I’ll just link you to the 2nd Step in the Stairway to Biml series for details. In a nutshell Biml, which is short for Business Intelligence Markup Language, is a published dialect of XML which can be used to generate SQL Server objects such as tables and views, SSIS Packages, and SSAS Cubes.

Biml is fairly easy to read and write using the intellisense built into SQL Server Data Tools (SSDT) and BIDS Helper which is available as a free download. You can then use BIDS Helper to generate a subset of these objects for Free. Using Biml alone can greatly improve productivity, but the real magic happens when you embed code snippets into Biml and automate repetitive tasks. This is called Biml Scripting. You can implement Biml Script using VB.Net or C#, though most of the examples tend to be written with C#. I’ve been working on my C# skills for some time, so all of my code samples are likely to follow suit.

Varigence, the company which wrote Biml, also offers a commercial product to assist in writing Biml Script. Mist is a Biml IDE which makes writing Biml Script much more efficient and easy to understand, especially for complex projects. Unless otherwise specified, all of my examples will work just fine using BIDS Helper, and cost nothing to implement, but your time.

Series – Why I Choose Biml Scripting and You Should Too

How to Properly Load Slowly Changing Dimensions using T-SQL Merge

One of the most compelling reasons to learn T-SQL Merge is that it performs Slowly Changing Dimension handling so well. This post is the fourth in a series called Have You Got the Urge to Merge?. This post builds on information from the other three, so I suggest you stop and read those before continuing, particularly the last one What exactly are Dimensions and Why do They Slowly Change? Go ahead… I’ll wait.

For most SSIS developers, their first experience in loading them is to use the SSIS Slowly Changing Dimension Wizard. It’s fairly easy and straight forward to use, but offers painfully poor performance. There are also 3rd Party SQL Server Integration Services (SSIS) SCD options available, but I’ve tested all of them and never found one that I was happy with. As you can probably guess, I prefer to use T-SQL Merge statements in an SSIS Execute SQL Task. There are several reasons for this.

What are the Pros and Cons of using Merge to load Slowly Changing Dimensions?

Pros

  1. T-SQL Merge is among the best performing options for loading Slowly Changing Dimensions.
  2. Merge statements are very versatile and allow for high degree of customization. This is important when you run into a scenario which deviates from standard data warehousing practices.
  3. The Merge source can be a table, view, common table expression, etc. In fact, the Merge destination (or target) could be a view as well.

Cons

  1. The Source and Destination should be on the same SQL instance. Yes, you can perform Merge with linked servers, but I wouldn’t do so.
  2. Merge statements can be wordy and tedious to write, as you’ll see in a minute. Merging a couple of tables with 20+ columns can be described as tedious at best.
  3. Merge performance relies upon proper indexing and setup. Small mistakes can result in drastically reduced performance.
  4. The initial load of a table can be painful since all of the information is typically written twice, once in staging and then again in the destination table.

Merge statements can be confusing at first so there’s a bit of a learning curve. However, once you get the hang of it, I think you’ll agree that Merge is a great way to go.

When discussing Merge today we are going to stipulate the following:

  1. Each entity in our Dimension will have 1 active record which will be designated by a RowIsCurrent (or another variation) field and/or RowStartDate and RowEndDate.
  2. Entities in our Dimension will be continuous. So, for our DimCustomer, we’ll assume that we don’t have customers cancelling their accounts and then re-starting them. There will be no gaps in time.
  3. On each load, there will only be 1 record per entity to Merge. In other words, for a particular customer, there will only be 1 changed record per load. You’ll notice that my code includes a field called “RankNo” which handles situations where this is not the case such as initial loads. I’ll dive deeper into that process in a future article.

Let’s Get to the Code!

First download the following file which will create a database called JediMergeDemo and the necessary schema and tables. For those who read my last article, these tables and data will look very familiar. As in the previous article, CustomerName will be a Type 1 change, while Planet will be Type 2. I’m including the Fact Table even though we aren’t going to load it. I’m including it because Foreign Key relationships affect how you write your Merge statements.

simple_jedi_star

First, let’s pre-load our Dimension table and also our Staging table. We’ll start by loading a record that requires a Type 1 change in staging. Obi-Wan Kenobi changes his name to Ben Kenobi before he goes into hiding.

INSERT INTO dbo.DimCustomer
(CustomerNum, CustomerName, Planet,  RowStartDate) 
VALUES (101,'Anakin Skywalker', 'Tatooine',   getdate() - 101),
       (102,'Yoda', 'Coruscant',  getdate() - 100),
       (103,'Obi-Wan Kenobi', 'Coruscant',  getdate() - 100)
 
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
VALUES
(103,'Ben Kenobi', 'Coruscant',  getdate() - 99)

base_dim_customer_jedi

Type 1 Merge

Unfortunately, using T-SQL Merge to process Slowly Changing Dimensions typically requires two separate merge statements. Also, it’s important to note that I’m covering the Type 1 Merge process first because it is the simplest to understand. In a real world scenario, I typically handle the Type 1 Merge step AFTER I process Type 2. I’ll explain why in a future article. So, did you read the previous articles on Merge? If you did, the Type 1 Merge code will look eerily similar. The Type 1 Merge statement is simple because all it does it check for changes to the Type 1 columns of the dimension. If any of those change, an update is run against the entire column with that new value. As I mentioned earlier, we only have a single Type 1 column, CustomerName. The Type 1 Merge statement is as follows:

MERGE  dbo.DimCustomer AS Target
 USING  Staging.DimCustomer as Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
                    (SELECT Source.CustomerNum, Source.CustomerName
                     EXCEPT
                     SELECT Target.CustomerNum, Target.CustomerName)
THEN
   UPDATE SET Target.CustomerName = Source.CustomerName;

In case you haven’t read the previous articles, I’m using Select/Except to find changes while correctly handling NULLs. Also, I could have written this query as:

 USING (Select  CustomerNum, CustomerName From  Staging.DimCustomer) as Source

However after extensive testing, even with a dimension where only 1 of many attributes is handled as Type 1, there was no difference in performance or query plan.So, Type 1 Merge is simple, right? Well, Type 2 Merge is a lot more difficult.

Type 2 Merge

Note: Properly handling Type 2 changes with T-SQL Merge is complex, and most people don’t perform it correctly. In order to convey the basics, as well as the complexities, I’m going to present this in layers. If you want to simply want to see the end result, feel free to scroll down to the final statement.
So, before we tackle the Type 2 Merge, let’s review what our code needs to accomplish. To process a Type 2 change, our code must:

  1. Find the current record for each entity (per business key) in the Dimension. Most often this is done by finding the record where RowisCurrent = ‘Y’ or some variation. However, In certain situations, you may need to find the current record by comparing the change date to RowStartDate and RowEndDate.
  2. If there are changes to any of the Type 2 columns for that current row, expire it. This is typically done by setting the RowIsCurrent value to ‘N’ as well as assigning the RowEndDate equal to the change date. In the last article, I explained that some people handle these dates differently.
  3. Insert a record with a new Surrogate Key for that Dimension record and insert all of the new values into it. The RowEndDate will be ’12/31/9999′, NULL, or some other value you choose.

The Merge Into portion of the Type 2 is just like the Type 1 Merge.

          Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum

Followed by a WHEN NOT MATCHED BY TARGET statement which inserts any new records. If my Staging table was durable, meaning it kept every single record that had been loaded into the Dimension, I may have wanted to add a WHEN NOT MATCHED BY SOURCE section as well.

          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')

The previous snippet inserts brand new records, but what about handing the Type 2 expiration process? Well, this next snippet of code handles just the expiration, not the insertion of the newly created records.

          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate

As I mentioned previously, I could have found the current record via Begin and End Dates instead, like this:

          WHEN MATCHED AND Target.RowStartDate <= Source.ChangeDate and Source.ChangeDate < Target.RowEndDate 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate

Now, this is one of the places Type 2 Merge statements get tricky. You cannot insert those expired records and insert them from within the Merge statement. You must capture them with an Output clause, which filters for only the Updated records. Enclose the entire statement (so far) with parenthesis and alias the sub-query.

         (Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum
          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')
          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate
           Output $ACTION ActionOut, Source.CustomerNum, Source.CustomerName, Source.Planet, Source.ChangeDate RowStartDate) AS MergeOut
  WHERE  MergeOut.ActionOut = 'UPDATE'

So, we’re filtering for only the expired records (Update) and outputting all of the columns, not just the Type 2 columns; because we’re going to use the results from this sub-query as the source of an insert statement. Before I share that portion of the code, I want to point out that this is another common error that people make when they write these statements. Suppose you appended the following to the beginning of Merge statement and ran it. You’d receive an error like the following:

Insert dbo.DimCustomer
  Select CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate from (Merge Stuff)

The target table ‘dbo.DimCustomer’
of the INSERT statement cannot be on either side
of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement.
Found reference constraint ‘fk_FactSalesDetails….’

So, how exactly do we get around this? As my good friend Peter Schott points out, a good method is to first insert these records in a Temp Table, and then insert them into the Dimension from there. Let’s look at the code:

IF OBJECT_ID('tempdb..#DimCustomer') IS NOT NULL
    DROP TABLE #DimCustomer;
 
CREATE TABLE #DimCustomer(
  CustomerNum int Not Null PRIMARY KEY CLUSTERED, 
  CustomerName VARCHAR(25) Not Null,
  Planet VARCHAR(25) NOT NULL, 
  RowIsCurrent Char(1) Not NULL Default 'Y',
  RowStartDate datetime Not Null Default getdate(), 
  RowEndDate datetime Not NUll Default '12/31/9999');   
 
INSERT INTO #DimCustomer(CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
  Select CustomerNum, CustomerName, Planet, 'Y', RowStartDate, '12/31/9999'
  From   (Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum
          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')
          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate
           Output $ACTION ActionOut, Source.CustomerNum, Source.CustomerName, Source.Planet, Source.ChangeDate RowStartDate) AS MergeOut
  WHERE  MergeOut.ActionOut = 'UPDATE';
 
Insert dbo.DimCustomer
  Select CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate from #DimCustomer;

As far as I’m concerned it is not absolutely necessary to define the Temp Table the way I have. I’ve seen it successfully done by using a Select * into #DimCustomer from….. Since my Merge statements are mostly generated automatically, I spell it out. Also, it should be noted that you could simply have another version of the destination table that acts as a temporary storage for these records. That way, you wouldn’t have to regenerate a Temp Table each time, though you would need to remember to Truncate the additional holding table. Anyway, Let’s test it with the following Type 2 Change. You’ll notice that I’m truncating the Staging table before a new Merge. Yoda moves to Degobah.

Truncate Table Staging.DimCustomer;
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
VALUES (102,'Yoda', 'Degobah',  getdate()-97);

type_2_final

And we’ll test an insert as well. Suppose Darth Maul places an order from Coruscant.

Truncate Table Staging.DimCustomer;
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
VALUES (104,'Darth Maul', 'Coruscant',  getdate()-97);

Run the Type 2 Merge code and you should see the following:
after_insert_final

So, I bet you’re thinking, “Well, that was a bit more complex than I expected, but at least that’s it. That is it, right? Right?” Well, not completely.

Transactions and Error Handling

The code above will work under the conditions I stipulated when we started. However, what happens if there’s an error? Well, when these statements are executed, I either want the entire Merge set (both statements) to succeed or fail together. I also tend to include code which records any errors which occur in a table for analysis. For the sake of this demo, I’ll just display them to the user. So, if we add that code to our Merge statements we have:

BEGIN TRY
Begin Transaction
 
IF OBJECT_ID('tempdb..#DimCustomer') IS NOT NULL
    DROP TABLE #DimCustomer;
 
CREATE TABLE #DimCustomer(
  CustomerNum int Not Null PRIMARY KEY CLUSTERED, 
  CustomerName VARCHAR(25) Not Null,
  Planet VARCHAR(25) NOT NULL, 
  RowIsCurrent Char(1) Not NULL Default 'Y',
  RowStartDate datetime Not Null Default getdate(), 
  RowEndDate datetime Not NUll Default '12/31/9999');   
 
/* Type 2 Merge */
INSERT INTO #DimCustomer(CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
  Select CustomerNum, CustomerName, Planet, 'Y', RowStartDate, '12/31/9999'
  From   (Merge  dbo.DimCustomer AS Target
          USING  Staging.DimCustomer as Source
          ON     Target.CustomerNum = Source.CustomerNum
          WHEN NOT MATCHED BY TARGET
          THEN
            INSERT (CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate)
            Values (Source.CustomerNum, Source.CustomerName, Source.Planet, 'Y', Source.ChangeDate, '12/31/9999')
          WHEN MATCHED AND Target.RowIsCurrent = 'Y' 
          and EXISTS
                 (SELECT Source.CustomerNum, Source.Planet
                  EXCEPT
                  SELECT Target.CustomerNum, Target.Planet)
          THEN
            UPDATE SET RowIsCurrent = 'N', Target.RowEndDate = Source.ChangeDate
           Output $ACTION ActionOut, Source.CustomerNum, Source.CustomerName, Source.Planet, Source.ChangeDate RowStartDate) AS MergeOut
  WHERE  MergeOut.ActionOut = 'UPDATE';
 
Insert dbo.DimCustomer
  Select CustomerNum, CustomerName, Planet, RowIsCurrent, RowStartDate, RowEndDate from #DimCustomer;   
 
 /* Type 1 Merge */
MERGE  dbo.DimCustomer AS Target
 USING Staging.DimCustomer AS Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
                    (SELECT Source.CustomerNum, Source.CustomerName
                     EXCEPT
                     SELECT Target.CustomerNum, Target.CustomerName)
THEN
   UPDATE SET Target.CustomerName = Source.CustomerName;
Commit
END TRY
 
BEGIN CATCH
 
  IF @@TRANCOUNT > 0
     ROLLBACK
 
    SELECT
        ERROR_NUMBER() AS ErrorNumber
        ,ERROR_SEVERITY() AS ErrorSeverity
        ,ERROR_STATE() AS ErrorState
        ,ERROR_PROCEDURE() AS ErrorProcedure
        ,ERROR_LINE() AS ErrorLine
        ,ERROR_MESSAGE() AS ErrorMessage;
       RETURN;
END CATCH

To test this, let’s deliberately cause an error. We’ll add two records for Yoda. If you’ll remember, one of the stipulations above was that we would have only 1 changed record per entity per merge. Run the following code and re-run our final merge statement.

Truncate Table Staging.DimCustomer;
INSERT INTO Staging.DimCustomer(CustomerNum, CustomerName, Planet, ChangeDate)
Values
(102,'Yoda', 'Kashyyyk',  getdate()-94),
(102,'Yoda', 'Bespin',  getdate()-93);

This should return an error like the following:

The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.

Feel free to take the code samples and test out merging records with both Type 1 and Type 2 changes. Also, I’m constantly refining my processes, including Merge so if anyone has suggestions for improvement, I’d love to hear them.

In upcoming articles, we’ll cover:

  • A detailed explanation of how historical Data Warehouse loads (should) work.
  • How to perform those kinds of loads with T-SQL Merge. This will also handle incremental load situations where more than one Type 2 change may occur between extracts.
  • Merge statement performance tuning.
  • How to optimally Load Fact Tables using Merge.
  • Options for T-SQL Merge statement “automatic” generation. This will include 3rd party products, free open source projects, and my own generator code. I have T-SQL Code which generates Merge statements, but I’m also generating them using BIML Script.

Stay Tuned!

EMAG TRESNI and Data Warehouse Implementation

Those who follow this blog know that I’m a huge video game geek. I have a life-sized Mario statue and a fairly large video game collection which includes about 700 original complete in box NES Games.

Games

Background

I also own almost 40 books about video game history. One of the best books I’ve read is The Ultimate History of Video Games: From Pong to Pokemon–The Story Behind the Craze That Touched Our Lives and Changed the World by Steven Kent. Even if you only casually played video games back in the day, it’s a great read.

In the book, Kent discusses the design of the Atari Lynx handheld, which attempted to compete with Nintendo’s Game Boy. I’m sure most of my readers have never heard of the Atari Lynx since it was crushed by the Game Boy. The Lynx received rave reviews, had a much better full color screen, and the ability to link together for multi-player. Even though the Game Boy was antiquated and completely out-classed, Lynx lost the battle because Nintendo was much better at marketing than Atari by this time. There’s an important lesson in there, but we’ll leave that for another day.

Atari Lynx

Atari Lynx


On page 418, Kent relates a story about the inherent problems when designing complex systems like the Lynx. The engineering teams, despite being meticulous in their planning and documentation, made one minuscule error in the binary coding. When the Lynx was powered up, if there wasn’t a cartridge in the slot, the message “Insert Game” should have been displayed. Because of this one mistake, the screen displayed “EMAG TRESNI” or “Insert Game” backwards. The first version of the Lynx systems actually shipped this way. “EMAG TRESNI” became an inside joke for the team, because I guess it’s cooler than referring to Murphy’s Law.

Data Warehouse Implications

What does this have to do with Data Warehouse Implementation? At the time I read this, I was in the middle of a particularly hairy Data Warehouse implementation. The business leaders demanded a data warehouse as fast as possible, and were unwilling to dedicate the necessary time in the investigative phase of the implementation. I’m referring to steps like:

  • Gathering Business Requirements
  • Determining Technical Requirements and Designing Architecture
  • Data Source Quality Testing

You know… Minor things like that. These steps can take a very long time, but they are absolutely necessary and rushing through them almost certainly guarantees project failure. Further, implementing an appropriate ETL Framework is essential. For example, if you don’t plan ahead for SQL Server Integration Services (SSIS) package failure, due to data errors and whatnot, you’ve set yourself up for a rude awakening some night at 3 am scrambling to correct a botched load. While I’ve never designed a piece of hardware like the Lynx, I have worked on some incredibly complex Data Warehouse implementations, and I can confidently say that EMAG TRESNI runs rampant in our efforts as well.

Until fairly recently, making changes to the dimensional model could be disastrous because making the necessary changes to the ETL was so time consuming and painful. How many times have you been in the process of implementation, had to make a change (for business or technical reasons), and then had to alter data structures such as stored procedures, views, etc. Even worse than that, the process to open all the relevant SSIS packages, often just to open specific tasks to refresh meta data, kills productivity. Let’s not forget that all that code needs to be check out and back into source control, quality assurance testing is necessary, etc.

Did you notice how I prefaced the previous paragraph with “Until fairly recently?” Well, I’ve been pursuing a mastery of BIML Scripting. I’m using it, along with my existing design patterns, to automate my Data Warehouse Development. Yes, I said automate. Stay tuned, as I’m about to begin sharing with you how BIML Scripting will change everything.

What exactly are Dimensions and Why do They Slowly Change?

As I was writing the articles on handling Slowly Changing Dimensions with T-SQL Merge, I realized that some background on Dimensions may be necessary first. Astute readers may remember that I briefly covered Dimension table basics a few years ago, but we need to go into a bit more detail on how to build them and how their structure supports Slowly Changing Dimensions.

What is a Dimension?

In general, Dimensions are tables in your data warehouse, composed of wide and descriptive columns, which give your data meaning. Some in the industry refer to the columns of a Dimension Table as “Dimensions” as well. The entities described by Dimension Tables are often physical in nature such as Customers, Patients, Products, Stores, etc. Identifying Dimensions and their attributes is easy when you know the trick. When you were in school, did you learn to do Word or Story Problems in Math class? I was taught that an important step in doing so was to translate the question into math operations by recognizing and translating key words.

Luckily spotting Dimensions is even simpler. Consider this story problem that your boss may give you:
“I want to know net sales in dollars by Product, Territory, and Month.”

When you hear this, you should immediately look for words like “by” or “per.” In this case, Product, Territory, and Month are Dimensions or attributes of them. Easy huh?

Remember that the primary driving force behind Data Warehouse initiatives is to simplify data analysis. In general, the fewer tables we have in our Data Warehouse, the easier analysis will be. Therefore, we combine similar attributes into Dimensional tables which tend to be denormalized and flattened. In the Story Problem above, you might have the following Dimension Tables.

  • DimProduct, which may be composed of fields like Product Name, Product Category, etc.
  • DimSalesTerritory, which may have fields like Territory Name, Territory Code, etc.
  • DimCalendar, the most common Dimension, which of course would have fields like Month Number, Month Name, Quarter, Year, Day Of Year, Day Of Week Name, etc.

Dimension Table Structure

Almost without exception, every Dimension should have a primary key which has no meaning, essentially a numbering column. The users users of a Data Warehouse won’t care about these Primary Keys, and in fact may never even see them. However, something like a Customer Number attribute has meaning, and points back to a specific customer. These meaningless keys, called Surrogate Keys, insulate the data warehouse from any prior business systems, they are typically an integer data type, and typically assigned by SQL Server through the use of Identity property. Basically as each new record is added to a Dimension, the next number is assigned to it. Dimensions join to related Fact tables, where the dollar amounts and other measures are kept, via the Surrogate Keys. These table groups form what is called a Star Schema which you can see below. 

Dimensions in Blue, Fact is Green

Dimensions in Blue, Fact in Green, joined by Surrogate Keys.


So, in more detail Dimensions need to use Surrogate Keys for the following reasons.

  1. The source database(s) may recycle its keys. If they do so, the uniqueness of your dimension primary key is violated.
  2. Multiple sources for the same dimensional entity likely have different keys. If you extract Customer information from a Customer Relationship Management system, Shipping Software, and your Billing System, an additional key field is necessary to unify it.
  3. If you acquire another company, which has it’s own set of keys, how could you incorporate them into your data warehouse without Surrogate Keys?
  4. Suppose one of your attributes is a ship date and that isn’t always known the first time a record is loaded? Surrogate keys allow a special value to be substituted instead like “Not Available” rather than a null.
  5. Surrogate Keys are compact which saves space, especially with indexes. To that end, you should be using the smallest integer datatype appropriate for these keys.
  6. Surrogate Keys facilitate easier tracking of historical changes. More on this in a bit.

In addition to the Surrogate Key, a Dimension will have one or more Business Keys, sometimes also referred to as Alternate Keys. These are fields, usually numeric or alpha-numeric (codes), from the source system(s) which identify each entity. Some Dimensional Modelers indicate the type of key in the field name. So, in a Customer Dimension you may see fields named Customer_SK (Surrogate Key), Customer_BK (Business Key, or Customer_AK (Alternate Key). Personally, I’m a big proponent of using naming conventions religiously. Unless I’m forced to deviate, my standard method is to name the Dimension DimCustomer, the surrogate key becomes CustomerKey, and the business keys are named exactly as the user understands them. It could be CustomerId, CustomerNumber, etc. One reason I do this is I am unlikely to run into a source which uses the exact naming convention which I use for my surrogates, and I dislike having to answer the same questions over and over again. “Umm… what does Customer_AK mean again?” I would urge you to follow a standard, but as long as you’re consistent feel free to create your own conventions.

Slowly Changing Dimension Maintenance Columns

Most Dimensions will have column(s) which handle SCD Changes. The columns you’ll typically find are some variation of RowisCurrent, RowStartDate, and RowEndDate. Not everyone uses the same fields, and some only use a subset of them. Jamie Thomson (Blog/Twitter) makes a very well reasoned argument for only using RowStartDate in his “Debunking Kimball Effective Dates” Part One and Part Two. I’ll weigh in on this in a future article, but for the sake of this discussion, I will be using all three.

  • RowIsCurrent – In my implementation this a Char(1) field which contains ‘Y’ or ‘N’. It’s also common for modelers to use a bit (1 or 0) for this same purpose. This simply indicates if a particular record is the Current or Active record for a particular entity.
  • RowStartDate – This is some form of a Date/DateTime column which indicates when the record first became current.
  • RowEndDate – This is some form of a Date/DateTime column which indicates when the record was no longer current or active. The row was expired at this point. For the active records, some modelers leave RowEndDate NULL. I choose to assign it to the maximum date for the data type. So, if I use datetime, I may assign this field as ’12/31/9999′.

Further, it should be noted that some prefer when expiring a record to set the RowEndDate to a tiny increment less than RowStartDate for the next record for this entity. They do this because they would prefer to write “Between” statements in their T-SQL. However, I have found that maintaining such a structure is problematic. Therefore, I almost always set the expired record’s RowEndDate and new record’s RowStartDate to be equal.

General Indexing Strategies

There is a lot of conflicting information on the topic of how to structure and index Dimension Tables. The most basic disagreement is whether the Primary Key of a Dimension should be clustered or non-clustered. In most examples, you’ll find the Primary Key being clustered as in Microsoft’s own AdventureWorksDW. Further, in their definitive work,
The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2, the Kimball group espouses the same practice.

Also, Microsoft’s own Customer Advisory Team (SQL CAT) indicates that Dimensions should have clustered Primary Keys.

I tend to deviate and create a clustered index on the business key(s), which means the Primary Key must be non-clustered. The following articles available on the web explain why this is a good idea with SQL Server, and my own personal testing supports this practice.

Regardless of which method you choose, be sure to index on the Business Key(s) and Slowly Changing Dimension columns.

Why do Dimensions Slowly Change?

One of the primary problems with source systems is that many of them don’t track history. For example, if a customer moves, many of these systems simply over-write the old address information in place. If you were reporting from this database, then any sales for this customer would also “move” to the new location, which may not be desirable. Slowly Changing Dimensions allow us to track history in the data warehouse. Since Dimension Tables join Fact Tables via the Surrogate key, we can have many records per business entity (customer in our case) with its history of Dimension Attributes.

It’s important to note that Dimension Tables themselves are rarely a specific type, rather we track the SCD type per Dimensional Attribute (column). Ralph Kimball, who many consider to be the father of Data Warehousing, creatively named the types of SCDs as Type 1, Type 2, etc.

  • Type 0 – Fixed attributes that shouldn’t change like a Customer’s birth date.
  • Type 1 – These columns don’t track history and simply over-write values in place when change is encountered.
  • Type 2 – A change in these columns causes expiration of the current row and a new row to be added with the newly changed value.

Fear not dear reader, if you are new to Dimensional Modeling the handling of these types will make sense with the screen shots provided below. There are additional types of Slowly Changing Dimensions, but they are beyond the scope of this article. The following articles by The Kimball Group explain Slowly Changing Dimensions in more depth:

dimcustomer_only

So as you can see, our simplified Customer Dimension has only two attributes. CustomerName will be handled as Type 1 and Planet as Type 2. While our example is extremely simple, it is also “realistic.” In many real world systems, attributes like Customer Name are often Type 1; and Planet is analgous to a location or address, which are often processed as Type 2. Let’s see an example of SCD Changes in action. Suppose you had the following data in DimCustomer:
base_dim_customer_jedi

Now suppose that Obi-Wan Kenobi changed his name to Ben Kenobi before he went into hiding. Well, in our system, that’s a Type 1 change. So, after we process this change, we find the following:
type_1_fini

Be sure to note the following:

  1. If there had been several records for Obi-Wan, then every one of them would be updated to “Ben Kenobi”.
  2. Notice that the SCD fields, RowIsCurrent, RowStartDate, and RowEndDate are un-affected.

Simple huh? Well, let’s move to Type 2. In our example, we only have one Type 2 column which is planet. For our Type 2 change, let’s say that Yoda escapes to Degobah after failing to destroy Emperor Palpatine. I have no idea why Obi-Wan and Yoda didn’t attempt to destroy him together later since Vader was beaten and incapacitated, which left the entire galaxy to live in tyranny for almost 2 decades. And if Obi-Wan loved Anakin, why on earth did he allow him to burn to death (as far as he knew) rather than mercifully ending him? Maddening. Umm, sorry I digress. So, how do we process a Type 2 change like this? The basic steps are as follows:

  1. Find the current record for each entity (per business key) in the Dimension. Most often this is done by finding the record where RowisCurrent = ‘Y’ or some variation. However, In certain situations, you may need to find the current record by comparing the change date to RowStartDate and RowEndDate.
  2. If there are changes to any of the Type 2 columns for that current row, expire it. This is typically done by setting the RowIsCurrent value to ‘N’ as well as assigning the RowEndDate equal to the change date. Again remember what I said previously that there are differing opinions on this.
  3. Insert a record with a new Surrogate Key for that Dimension record and insert all of the new values into it. The RowEndDate will be ’12/31/9999′, NULL, or some other value you choose.

So, what does this look like? Well, we know that the current row for Yoda will have a Type 2 change since he’s moving planets. So, we need to expire that row:
type_2_expire
Followed by the insertion of the new record for Yoda:
type_2_final

As you can see, Yoda still has the same CustomerNum (Business Key), but now he has 2 surrogate key values. To illustrate how important this is, consider our Dimension along-side a sample Fact table.
dim_customer_and_fact
Yoda’s sales are the last three in that Fact table. Those with the Customer Key of 2, will be properly attributed to Coruscant, while the final will be attributed to Degobah. That way, even if the source system updated the Planet in place, you still retain history in your data warehouse.

In the next article, I’ll show you how to properly process both Type 1 and Type 2 Dimension changes using T-SQL Merge.

Using the Output Clause with T-SQL Merge

The Output clause, first implemented in SQL Server 2005, can be used to return information for each row modified by an Insert, Update, Delete or Merge statement. This functionality greatly increases the power and usefulness of Merge, and is required in the processing of Slowly Changing Dimensions.

This post is the second in a series called Have You Got the Urge to Merge? and is a follow up to Writing T-SQL Merge Statements the Right Way. If you just happened upon this article, feel free to jump to the beginning and follow along through the entire series.

As usual, I feel the easiest way to learn something is by example. For the sake of simplicity, I’m going to continue using the same tables and code from the first article in the series. To set up, we’ll run the following code:

 USE TempDb;
GO
 
IF OBJECT_ID ('tempdb..#Customer_Orig') IS NOT NULL DROP TABLE #Customer_Orig;
IF OBJECT_ID ('tempdb..#Customer_New')  IS NOT NULL DROP TABLE #Customer_New;
 
 
CREATE TABLE #Customer_Orig
(  CustomerNum    TINYINT NOT NULL
  ,CustomerName   VARCHAR (25) NULL
  ,Planet         VARCHAR (25) NULL);
 
CREATE TABLE #Customer_New
(  CustomerNum    TINYINT NOT NULL
  ,CustomerName   VARCHAR (25) NULL
  ,Planet         VARCHAR (25) NULL);
 
INSERT INTO #Customer_New (CustomerNum, CustomerName, Planet)
   VALUES (1, 'Anakin Skywalker', 'Tatooine')
         ,(2, 'Yoda', 'Coruscant')
         ,(3, 'Obi-Wan Kenobi', 'Coruscant');  
 
INSERT INTO #Customer_Orig (CustomerNum, CustomerName, Planet)
   VALUES (2, 'Master Yoda', 'Coruscant')
         ,(3, 'Obi-Wan Kenobi', 'Coruscant')
         ,(4, 'Darth Vader', 'Death Star');
 
SELECT * FROM #Customer_Orig Order by CustomerNum;
SELECT * FROM #Customer_New Order by CustomerNum;

When you run the code above, you should have the following tables. Remember that in the previous example, Customer_Orig and Customer_New started off being identical. I’ve skipped ahead to the point where the following changes were made to Customer_Orig in preparation for the Merge demo.

  1. The Darth Vader record was added to Customer_Orig.
  2. Yoda’s name was changed.
  3. Anakin Skywalker was deleted from Customer_Orig. I know that in the following screenshot I show the record in the table. I included it, and formatted it to suggest that it was once there but it has been deleted.The effect will be to delete Anakin Skywalker from Customer_New.

Merge Changes To Base
Customer New

So, an appropriate Merge statement for these tables can be taken from the previous article as well. However, this time we will add an Output clause in its most basic form. I’ve used the T-SQL comment marks to separate the new section and help it stand out.

MERGE  #Customer_New AS Target
 USING #Customer_Orig AS Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
                    (SELECT Source.CustomerName, Source.Planet
                     EXCEPT
                     SELECT Target.CustomerName, Target.Planet)
THEN
   UPDATE SET
      Target.CustomerName = Source.CustomerName
     ,Target.Planet = Source.Planet
WHEN NOT MATCHED BY TARGET
THEN
   INSERT (CustomerNum, CustomerName, Planet)
   VALUES (CustomerNum, Source.CustomerName, Source.Planet)
WHEN NOT MATCHED BY SOURCE THEN DELETE
-------------------------------------
OUTPUT $action, inserted.*, deleted.*
-------------------------------------
;

When that code is run, you’ll receive the following:
basic_output

The results table may look confusing, but it’ll make sense in a minute. First, the word OUTPUT is essentially a substitute for SELECT. Second, the $action variable obviously indicates the type of action performed on that row. The actions are Insert, Update, or Delete. If you’ve ever used Triggers you’ll know that they work the same way. When a record is modified two temporary tables are created for Inserted and Deleted values per record. If a record is updated, then it has a record in both of those tables.

Knowing this, you can interpret the results and see that our merge statement was effective and did make Customer_New identical to Customer_Orig. So, if you re-run the select statements from above, you see the following:
Result of Merge

Keep in mind that you don’t have to simply Output the values to the screen. You could insert those records into a physical table, temp table, or table variable as well. Next, let’s insert them into another Temp Table. Re-run the setup code above with the following modifications.

IF OBJECT_ID( 'tempdb..#CustomerChanges') IS NOT NULL DROP TABLE #CustomerChanges;
 
CREATE TABLE #CustomerChanges(
  ChangeType         NVARCHAR(10)
 ,CustomerNum        TINYINT NOT NULL
 ,NewCustomerName    VARCHAR(25) NULL
 ,PrevCustomerName   VARCHAR(25) NULL
 ,NewPlanet          VARCHAR(25) NULL
 ,PrevPlanet         VARCHAR(25) NULL
 ,UserName           NVARCHAR(100) NOT NULL
 ,DateTimeChanged    DateTime NOT NULL);
 
MERGE  #Customer_New AS Target
 USING #Customer_Orig AS Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
                    (SELECT Source.CustomerName, Source.Planet
                     EXCEPT
                     SELECT Target.CustomerName, Target.Planet)
THEN
   UPDATE SET
      Target.CustomerName = Source.CustomerName,
      Target.Planet = Source.Planet
WHEN NOT MATCHED BY TARGET
THEN
   INSERT (CustomerNum, CustomerName, Planet)
   VALUES (Source.CustomerNum, Source.CustomerName, Source.Planet)
WHEN NOT MATCHED BY SOURCE THEN DELETE
-------------------------------------
OUTPUT
   $ACTION ChangeType,
   coalesce (inserted.CustomerNum, deleted.CustomerNum) CustomerNum,
   inserted.CustomerName NewCustomerName,
   deleted.CustomerName PrevCustomerName,
   inserted.Planet NewPlanet,
   deleted.Planet PrevPlanet,
   SUSER_SNAME() UserName,
   Getdate () DateTimeChanged
    INTO #CustomerChanges
-------------------------------------
;
 
SELECT * FROM #CustomerChanges;

Output Refined
I added the last two columns because I often use the Output clause with Merge (and other DML statements) for auditing purposes.

Now that we’ve covered the basics of Merge and the Output clause, our next article will cover how to use both to process Slowly Changing Dimensions.

Writing T-SQL Merge Statements the Right Way

In a previous article, I discussed Merge statement basics. However, in extensive testing I’ve come to realize that my article, like most articles I’ve read about Merge leaves out or mis-handles several important aspects. Rather than edit that article, I’ve decided to publish a series of articles which I hope will clear up some of these misconceptions. If you already read the original article, a good portion of this will be review as I’m using the original as a base for this one.

Let’s start by glancing at the syntax portion of the Books Online T-SQL Merge Page. I’ll take the liberty of re-posting just the first 25% or so below.

[ WITH [,…n] ]
MERGE
[ TOP ( expression ) [ PERCENT ] ]
[ INTO ] [ WITH ( ) ] [ [ AS ] table_alias ]
USING ON
[ WHEN MATCHED [ AND ]
THEN ] [ …n ]
[ WHEN NOT MATCHED [ BY TARGET ] [ AND ]
THEN ]
[ WHEN NOT MATCHED BY SOURCE [ AND ]
THEN ] [ …n ]
[ ]
[ OPTION ( [ ,…n ] ) ]
;

::=
{
[ database_name . schema_name . | schema_name . ]
target_table
}

::=
{
{ [ [ ,…n ] ]
[ [ , ] INDEX ( index_val [ ,…n ] ) ] }
}

Simple right? Great, I guess I’m done here…. No seriously, who can easily absorb that? So, what is Merge really and how do we use it?

T-SQL Merge Basics


In a nutshell, the Merge statement allows you to Insert, Update, or Delete data in an entity, referred to as the Target, with data from another entity called the Source. The entities are compared on Fields which uniquely identify records in each, a Join if you will. Notice how I keep using the word entity rather than table, and the reason is that the Target and Source could be many SQL Server objects such as Tables, Temp Tables, Views, Table Variables, or even Common Table Expressions. The Source could also be a complete Select statement as well. In this case, for the sake of simplicity, I’ll use Temp Tables.

I think most people learn best from examples, by doing rather than reading descriptions of syntax, so I’ve provided a brief script to create the tables required for the following example.

USE TempDB;
GO
 
IF OBJECT_ID ('tempdb..#Customer_Orig') IS NOT NULL DROP TABLE #Customer_Orig;
IF OBJECT_ID ('tempdb..#Customer_New')  IS NOT NULL DROP TABLE #Customer_New;
 
CREATE TABLE #Customer_Orig
(  CustomerNum    TINYINT NOT NULL
  ,CustomerName   VARCHAR (25) NULL
  ,Planet         VARCHAR (25) NULL);
 
CREATE TABLE #Customer_New
(  CustomerNum    TINYINT NOT NULL
  ,CustomerName   VARCHAR (25) NULL
  ,Planet         VARCHAR (25) NULL);
 
INSERT INTO #Customer_Orig (CustomerNum, CustomerName, Planet)
   VALUES (1, 'Anakin Skywalker', 'Tatooine')
         ,(2, 'Yoda', 'Coruscant')
         ,(3, 'Obi-Wan Kenobi', 'Coruscant');
INSERT INTO #Customer_New (CustomerNum, CustomerName, Planet)
   VALUES (1, 'Anakin Skywalker', 'Tatooine')
         ,(2, 'Yoda', 'Coruscant')
         ,(3, 'Obi-Wan Kenobi', 'Coruscant');

So, I’ve created two temporary tables called Customer_New and Customer_Orig with identical data in each. In this case, Customer_New will be the Target and Customer_Orig will be the Source.
Merge Base
Now, we’re going to make the following changes to Customer_Orig. Notice that I’ve performed a single Insert, Update, and Delete.

-- Update Yoda's Name
UPDATE #Customer_Orig
   SET CustomerName = 'Master Yoda'
 WHERE CustomerNum = 2
-- Delete Anakin
DELETE #Customer_Orig
 WHERE CustomerNum = 1
--Add Darth
INSERT INTO #Customer_Orig (CustomerNum, CustomerName, Planet)
VALUES (4, 'Darth Vader', 'Death Star')

Merge Changes To Base

Now, being the geek that I am, I realize that Anakin became Darth Vader, which could have been seen as a change in name. However, Obi-Wan clearly states that Darth Vader betrayed and murdered Anakin, effectively becoming a new person. If that bothers you, then you’re a scruffy looking nerf herder.

Old School CRUD

Prior to SQL Server 2008, we could have accomplished this merge with the following code. Please note that I have used Joins in my example, which will align with my Merge code later.

--Process Updates  
Update Tgt
Set    Tgt.CustomerName = Src.CustomerName, Tgt.Planet = Src.Planet
FROM   #Customer_New Tgt Inner JOIN #Customer_Orig Src ON Tgt.CustomerNum = Src.CustomerNum
Where  Tgt.CustomerName <> Src.CustomerName Or Tgt.Planet <> Src.Planet -- Eliminates needless updates.
--Process Inserts
Insert Into #Customer_New
  SELECT Src.CustomerNum, Src.CustomerName, Src.Planet
  FROM   #Customer_Orig Src LEFT JOIN #Customer_New Tgt ON Tgt.CustomerNum = Src.CustomerNum
  Where  Tgt.CustomerNum is null;
--Process Deletes
Delete FROM Tgt
from        #Customer_New as Tgt LEFT JOIN #Customer_Orig Src ON Tgt.CustomerNum = Src.CustomerNum
Where       Src.CustomerNum is null;

This works, but it’s less than optimal for a few reasons. First, writing those statements can be tedious, especially if this been a typical table with 20+ fields to deal with. Second, this represents three separate groups of work and SQL Server processes them that way. Once you understand it, the T-SQL Merge Statement is easier to write and can accomplish this in one transaction. It’s basically a win-win.

Components of Merge Statements

So, lets break a Merge statement into it’s component parts to make it easy to understand. First, the Target and Source tables are specified along with the business key which identifies each record. This is the field that one would use in a join.

MERGE  #Customer_New AS Target
 USING #Customer_Orig AS Source
    ON Target.CustomerNum = Source.CustomerNum

The When Matched clause determines what will happen when records exist in the Source and Target with the same CustomerNum. Notice the additional conditions I’ve added which limits the updates only to records where a value has changed. Strictly speaking, this isn’t required, but every record in the target would be updated regardless of need, which wastes resources.

WHEN MATCHED  
                AND (Target.CustomerName <> Source.CustomerName 
                  OR Target.Planet <> Source.Planet)
Then 
  UPDATE SET --Updates Yoda's Name
      Target.CustomerName = Source.CustomerName,
      Target.Planet = Source.Planet

The When Not Matched by Target clause specifies what should be done with records in the Source that aren’t in the Target. The typical scenario is to insert records which are new. I could have added additional conditions or only added certain new records as well.

WHEN NOT MATCHED BY TARGET
THEN
   INSERT (CustomerNum, CustomerName, Planet) -- Inserts Darth
   VALUES (Source.CustomerNum, Source.CustomerName, Source.Planet)

The When Not Matched by Source clause specifies what should be done with records in the Target that aren’t in the Source. Keep in mind that if this was a Staging table which wasn’t comprehensive, perhaps the result of an incremental extraction, then you’d want to omit this portion of the statement.

WHEN NOT MATCHED BY SOURCE THEN DELETE; -- Deletes Anakin

Also, keep in mind that any Merge statement must be terminated in a semicolon. So, when you put your script together, it looks like the following. Go ahead and run it on your test data.

MERGE  #Customer_New AS Target
 USING #Customer_Orig AS Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED
                AND (Target.CustomerName <> Source.CustomerName
                     OR Target.Planet <> Source.Planet)
THEN
   UPDATE SET
      Target.CustomerName = Source.CustomerName
     ,Target.Planet = Source.Planet
WHEN NOT MATCHED BY TARGET
THEN
   INSERT (CustomerNum, CustomerName, Planet)
   VALUES (Source.CustomerNum, Source.CustomerName, Source.Planet)
WHEN NOT MATCHED BY SOURCE THEN DELETE;

Result of Merge

What about Null Values?


Ah, very astute of you to notice that my young padawan. This is the first mistake that many people make. You’ll notice in the When Matched portion above, that I also check to see if a value changed before I run my update statement. However, depending on your settings, Null values are not equal to each other. Therefore, any record with a Null value will not be updated if one of them is NULL. A great explanation of NULL handling can be found here. First, let’s set up a NULL field comparison issue. Run the following update statement to the Customer_Orig table.

Update CO set Planet = NULL
  FROM #Customer_Orig CO
 WHERE CustomerNum = 3 -- Obi-Wan Kenobi

If you’ll re-run the original merge statement, and compare the tables, you’ll find the following:
Result of Merge Non NULL Handling
Notice that Obi-Wan’s record in Customer_New has not been updated. There are a few ways to deal with this which I’ve seen online. The first two methods each have their problems, which I’ll briefly explain. The third method is the one I recommend if you’d like to “skip ahead” using Select and Except.

Adding “Is NULL” Statements
We could re-write the When Matched clause like this:

WHEN MATCHED
                AND (   Target.CustomerName <> Source.CustomerName
                     OR Target.CustomerName IS NULL
                     OR Source.CustomerName IS NULL
                     OR Target.Planet <> Source.Planet
                     OR Target.Planet IS NULL
                     OR Source.Planet IS NULL)

Technically this method works in that it will update if one of your values is NULL. However, its really cumbersome to write, imagine writing a statement like that for 50 columns. Also, the resulting query plan is complex. Perhaps the biggest problem is that if both of your values are NULL, then an update will happen anyway. Go ahead and substitute the code above and re-run the merge statement where one of the values is NULL. You should receive the message “1 Row Effected.” However, if you repeatedly run the merge statement you will ALWAYS receive “1 Row Effected,” again depending on your database settings. For these reasons, this method isn’t recommended.

Using ISNULL or Coalesce
Coalesce can also be used because it chooses the best datatype between arguments based on datatype precedence. Therefore, we could use the same basic Coalesce statement for most comparisons without throwing an error and the When Matched portion from above becomes the following:

WHEN MATCHED
                AND (COALESCE(TARGET.CustomerName, '') <> COALESCE(SOURCE.CustomerName, '') 
                  OR COALESCE(TARGET.Planet, '') <> COALESCE(SOURCE.Planet, ''))

This works as well, with some caveats. First, the second value in each Coalesce statement has to be something that should NEVER occur in that column, therefore an empty string ” is most likely not a good choice. Second, notice how I emphasized that coalesce would automatically cast most comparisons, well there are some data types that will fail and require you to use something other than a default string value. The nice thing about this method is that the T-SQL code is short and easy to understand.

Using Select and Except to Handle Nulls(Recommended)


I used the Coalesce method successfully for some time until I came across this excellent article by Paul White. In it he correctly points out that when using Except or Intersect Null values are considered equal. Therefore, when comparing two fields with NULL values, they will be equivalent, and when comparing a NULL value to some other actual value they will be evaluated as Not Equal. He seems to favor using “Not Exists” and Intersect. However, using Exists and Except just makes more sense to my brain. Therefore, I’d write the query like the following. If you take this code and run it on the original tables, as well as the NULL test,you’ll see that it works perfectly.

MERGE  #Customer_New AS Target
 USING #Customer_Orig AS Source
    ON Target.CustomerNum = Source.CustomerNum
WHEN MATCHED AND EXISTS
                    (SELECT Source.CustomerName, Source.Planet
                     EXCEPT
                     SELECT Target.CustomerName, Target.Planet)
THEN
   UPDATE SET
      Target.CustomerName = Source.CustomerName
     ,Target.Planet = Source.Planet
WHEN NOT MATCHED BY TARGET
THEN
   INSERT (CustomerNum, CustomerName, Planet)
   VALUES (CustomerNum, Source.CustomerName, Source.Planet)
WHEN NOT MATCHED BY SOURCE THEN DELETE;

Limitations of Merge?


Of course, there are always limitations. The most important limitation is that both data sources should be on the same SQL Server instance. I’ve seen people use Merge with linked servers, but I wouldn’t recommend it.

Another factor that might give you some pause is that these statements are fairly complex and wordy. A long merge statement, such as merging two tables with 25 fields each, is tedious to write and it’s very easy to make a simple mistake. Well, stay tuned because later in this series I’ll share some code which will practically write the statements for you.

In the next article in this series we’ll discuss how to use the Merge statement with the Output clause as it’s required to load Slowly Changing Dimensions.

Changes in Priorities and Direction

Buy In

Everyone who knows me knows that I absolutely love Data Warehousing. I’m one of those fortunate few who finds their work fascinating, and I tend to dive head first into new projects.

I was recruited to join a team on a multi-terabyte data warehousing project late last summer. I was leery to take the position because project had already been attempted by numerous other teams, which had failed, and subsequently no longer worked for the company. However, the Architect in charge was very persuasive and the lure of a ground breaking project, using some of the best toys out there, was something that I couldn’t resist. I knew going in that this might be a short term situation, but I calculated that the benefits outweighed the negatives.

After I started, it became obvious that the team and project was in a state of flux. The Architect, who I previously mentioned, left shortly after I started, leaving the team to push forward anyway. The executive in charge of the project was fond of saying, “Teams pick their own captains,” and that’s what happened here as I was shortly placed in charge of the project as the new Architect.

I was running a team of a dozen in-house data professionals, as well as directing the efforts of world class consultants in SSIS and Dimensional Modeling. Though I already knew these consultants personally, working directly with them were some of the highlights of my career so far. While I’ve already performed almost every individual role in the complete data warehouse life cycle, being in the position of Architect (and project leader) was particularly enlightening since it gave me a deeper appreciation of the entire process.

It was an amazing, rewarding, and exhausting experience. I got the chance to prove my skills to some of the industry’s best; finding solutions to problems that stymied some of the biggest names in our industry. However, it was not all rainbows and butterflies.

Stress and Burnout

In many data warehouse projects, the technical challenges (though formidable) can pale in comparison to the political and social challenges. As many of us have experienced, business decisions aren’t always based on technical reality. I worked 70 to 80 hours a week for more than 6 months to try to fulfill very aggressive insane deadlines; with no end in sight. I’ve always been a bit of a workaholic, but I had completely lost my work/life balance. Being the geek that I am, and all of the learning that was taking place, I didn’t see the situation clearly. It was taking a toll on all of the other aspects of my life. I wasn’t able to sleep, wasn’t eating properly. I fell way behind in my personal and professional correspondence.

Moments of Clarity

regan

My sister, Regan, and I in 2011


My sister became critically ill the third week of March, and I caught the first flight back home to Michigan to be with family. I stayed a week, but the doctors said the crisis would be long term, and they wouldn’t be able to provide a prognosis for weeks or months. I did not want to leave, but I had a multi-million dollar project with a large team that was counting on me in DFW. This incident put a spotlight on what I was doing to myself and my family due to overwork and lack of balance. After a good deal of quiet reflection, I gave two weeks notice to my previous employer near the end of March. Tragically, my sister died early Easter morning. She was only 41, had young children, and her loss devastated my entire family.

Lessons Learned

The death of my sister really put an exclamation point on the whole mess. I’ve learned a lot through this whole ordeal, and I’ll share some of them with you in the future. However, the most important thing I learned was that while career is important, and what we do affects a lot of people, it isn’t as important as your physical and emotional well-being. Family commitments absolutely have to come first. No matter how fascinating the project is, I cannot neglect my personal life for very long. It’s not fair to myself, my family, or my clients.

New Beginnings

It’s always important to focus on what is good in our lives. My career continues to go well. As I was deciding to leave my previous gig, one of my independent consulting clients came to me needing some long term help with their data warehouse project. It’s a fascinating project, terabytes of data, challenging text file imports, and I’m learning C# to boot. This gig should last for several months and has been perfect for me as I transition to the next phase of my career. This good fortune can be directly attributed to my work in the PASS Community. I want to sincerely thank my #SQLFamily for all of their help and support.

More to come in future posts.

Setup and Performance Issues with the Integration Services (SSIS) 2012 Catalog

For those who don’t know, I’m currently the Data Warehouse Architect for a large scale and complex multi-tenant data warehousing project. Fortunately, the project is being created in SQL Server 2012 and I thrilled to be using the new features. Let me preface this blog post by admitting that my experience with using the SSIS 2012 Catalog, SSISDB, is fairly limited, but then again most people are in the same boat since adoption of 2012 has been fairly slow.

However, I’ve run into some problems which aren’t getting a lot of attention yet, but I’m fairly certain they will in the near future. Let’s break them down.

Initial Catalog Setup Issues

I won’t explain step by step how to create the Integration Services Catalog, because others already have. However, you’ll notice from the following screenshot that very few settings are available when the catalog is created.
ssis_catalog

I wish that Microsoft would upgrade the Create Catalog screen to enable the user to specify SSISDB settings upon creation. For example, right click the SSISDB under Integration Services Catalogs, and select Properties. The following settings can and should be set:

  • Whether to Clean Logs Periodically – Defaults to True
  • Retention Period in Days – Defaults to 365
  • Default Logging Level – Defaults to Basic
  • Maximum Number of Old Versions Per Project – Defaults to 10
  • Whether to Periodically Remove Old Versions – Defaults to True

SSISdb_properties_small

I wish we could set those at create time. As I’ll discuss in a minute, the default values can be disastrous for many installations.

SSISDB Database Ignores Model Database Settings

For some reason, when you create the SSISDB Catalog, the corresponding SSISDB database ignores the Model database settings. It appears that no matter what my Model database is set to, SSISDB defaults to set file sizes for the Data File and Log, 10% Growth Rate, and Full Recovery Model. I’ve reached out to Microsoft, but haven’t received an answer as to why.

This has become problematic for us because we’ve deliberately set up our Model Databases so that Simple Recovery is used in our Development and QA Environments while Full Recovery is used in our Production environments. I’ve had to set up periodic checking on all of our servers looking for databases in full recovery mode that aren’t getting transaction log backups to avoid runaway T-Log growth issues.

The script I’ve used to re-set the SSISDB database is as follows:

ALTER DATABASE [SSISDB] MODIFY FILE ( NAME = N'data', SIZE = 512000KB , FILEGROWTH = 1048576KB )
GO
ALTER DATABASE [SSISDB] MODIFY FILE ( NAME = N'log', SIZE = 512000KB , FILEGROWTH = 512000KB )
GO
ALTER DATABASE [SSISDB] SET RECOVERY SIMPLE WITH NO_WAIT
GO

SSISDB Performance Issues

I’ve come to the conclusion that the SSISDB Database will not scale at all. I became painfully aware of this because of a problem we had in a “pseudo” Production environment. Our first nasty surprise was T-Log growth, because the database incorrectly defaulted to Full Recovery mode which I fixed by shrinking the log. Before you mentally criticize me dear reader, please keep in mind this wasn’t an actual Production environment and the data being kept in SSISDB was inconsequential so far.

Another issue became apparent shortly afterward. We were running with the default settings of Basic level logging, 365 Days Retention, and 10 Versions of the project being kept. Our SSIS project is already fairly large with around 50 packages, but it was only being run periodically in this environment approximately once or twice per week. We are not even hitting this thing very hard. However, the SSISDB database size was growing rapidly anyway. In this particular situation, it had grown to 60GB. While I realize that 60GB is not very large, SSISDB performance slowed to a crawl. In any of our environments, when the SSISDB approaches 50GB the performance degrades dramatically. The first sign of this is that the built in reports take longer and longer to return data and eventually seem to stop returning data at all.

Unfortunately this pseudo Production server was set up with very little hard drive space, 100GB. So, I started receiving free space warnings after the SSISDB database reached 60GB.

I did a little research about reducing the size and performance issues with the SSISDB. I lowered the Retention Period to 7 days, set the Default Logging Level to None, and reduced the Maximum Number of Old Project Versions to 1. We don’t need to save old versions of our project as we are using TFS source control. You are using source control right?

Keep in mind that changing your logging level to None is not advisable in most production environments, however as of this time we are using a 3rd party product to monitor our packages so this isn’t an issue for us. I suspect that most of you will use the Basic level, but set the Retention Levels back and run the cleanup job often to avoid performance issues. Anyway I changed my settings using the following script.

EXEC [SSISDB].[catalog].[configure_catalog] @property_name=N'MAX_PROJECT_VERSIONS', @property_value=1
EXEC [SSISDB].[catalog].[configure_catalog] @property_name=N'RETENTION_WINDOW', @property_value=7
EXEC [SSISDB].[catalog].[configure_catalog] @property_name=N'SERVER_LOGGING_LEVEL', @property_value=0
GO

Then I proceeded to start the “SSIS Server Maintenance Job” automatically generated when you create the catalog. Wow, that was a mistake.

Don't Touch That Button

Don’t Touch That Button


The job ran for 2 hours and blew up the transaction log even though it was in Simple Recovery Model. I then started receiving emails that the server had completely run out of space. When I checked the error log, I found an error indicating that the Log was out of space and a checkpoint record could not be written.

I immediately started researching and found this article about the cleanup job causing user packages to fail. If you read between the lines, you’ll see that the SSISDB is not properly indexed to handle the cleanup job. In my case, I had to drop the SSISDB, recreate it, and apply the script above to avoid similar issues in the future. Once again, this was not a huge problem for me because it wasn’t truly a Production environment and we rely on a third party product for our SSIS logging anyway. Don’t forget that even if you do change your settings, you must also schedule the clean up job to run regularly. If you allow too much history to be accumulated, you may suffer the same fate I did when running it.

My take on all of this is that SSISDB is not ready for prime time. I don’t think I’m being unfair with that statement. I do appreciate the fact that the SSIS Catalog is a great addition to SSIS 2012, but I expect to be able to use it in Production, running packages every 5 or 15 minutes, retaining months worth of history, and for it to perform regardless.

After writing this article I ran across this excellent article on SSIS 2012 Catalog Indexing, and time permitting I will implement his suggestions and report back. Are any of you having similar issues with your SSIS 2012 Catalog (SSISDB)?

Page 4 of 32« First...23456...102030...Last »