Archives

Presenting Tonight at North Texas SQL Server User Group

Hey folks, just a quick message that I’ll be presenting this evening at North Texas SQL Server User Group. I’m presenting an updated version of “Have you got the Urge to Merge?” which I gave last month in Fort Worth. I’ve made good progress with my code which generates T-SQL Merge Statements and will be able to show more of it tonight. I hope to post a beta version of the code next week. The abstract is below.

Have you got the Urge to Merge?

The T-Sql Merge statement, introduced in SQL 2008, is powerful, versatile, and performant. However, it is also one of T-SQL’s least understood statements. Most DBAs know they can use It to insert, update, and delete data in a target table in a single transaction, generally outperforming separate statements. Fewer know, however, that it can be used to load slowly changing dimensions, facilitate auditing, and for many other uses. Come to the user group meeting and learn the in and outs of T-SQL Merge. I’ll even show you how to write complex Merge statements with lightning speed.

I hope to see some of you folks in Irving tonight.

What is the T-SQL Merge Statement and How do you use it?

I mentioned in a previous article that many posts on T-SQL Merge read like Books Online. Speaking of Books Online, let’s start by glancing at the syntax portion of the T-SQL Merge Page. I’ll take the liberty of re-posting just the first 25% or so below.

[ WITH [,…n] ]
MERGE
[ TOP ( expression ) [ PERCENT ] ]
[ INTO ] [ WITH ( ) ] [ [ AS ] table_alias ]
USING ON
[ WHEN MATCHED [ AND ]
THEN ] [ …n ]
[ WHEN NOT MATCHED [ BY TARGET ] [ AND ]
THEN ]
[ WHEN NOT MATCHED BY SOURCE [ AND ]
THEN ] [ …n ]
[ ]
[ OPTION ( [ ,…n ] ) ]
;

::=
{
[ database_name . schema_name . | schema_name . ]
target_table
}

::=
{
{ [ [ ,…n ] ]
[ [ , ] INDEX ( index_val [ ,…n ] ) ] }
}

Simple right? Great, I guess I’m done here…. No seriously, who can readily absorb that? So, what is Merge really and how do we use it?

T-SQL Merge Basics?


In a nutshell, the Merge statement allows you to Insert, Update, or Delete data in a Table, referred to as the Target, with data from another Table called the Source. The tables are compared on Fields which uniquely identify records in each, a Join if you will. That’s not to say that the Source has to be an actual Table, a view or a Select statement may be used as a source as well.

I think most people learn best from examples, by doing rather than reading, so I’ve provided a short script to create the tables required for the following example.
Slowly Changing Dimension Table Setup Scripts Jedi 1.4

Basically I’ve created two tables called DimCustomer and CustomerStaging with the following data in them:

Now, we’re using a CustomerStaging table for sake of simplicity. However, I could have used a T-SQL query from an OLTP database or other source as the Source object in my merge. For this article, let’s assume there are no Slowly Changing Dimensions and ignore the fields in each table to the right of Affiliation.

There are several differences between Staging and DimCustomer which should be noted:

  • Anakin Skywalker is no longer in the source system.
  • Yoda’s name was changed.
  • Obi-Wan Kenobi changed planets.
  • Darth Vader was added.


Now I realize that Anakin became Darth Vader, which could have been seen as a change in name. However, Obi Wan clearly states that Darth Vader betrayed and murdered Anakin, effectively becoming a new person. If that bothers you, then you’re a scruffy looking nerf herder. Anyway, if we used separate T-SQL statements, we’d delete Anakin, Update Yoda and Obi-Wan, and Add Darth Vader. With Merge, we can do that in one transaction. For me, the easiest way to learn something is to go through an example. We’ll break the statements into sections to make it easy to understand.

First, the Target and Source tables are specified along with the business key which identifies each record. This is the field that one would use in a join.

MERGE  DimCustomer AS Target
 USING CustomerStaging AS Source
    ON Target.CustomerKey = Source.CustomerKey

The When Matched clause determines what will happen when records exist in the Source and Destination with the same CustomerKey. Notice the additional conditions I’ve added. Strictly speaking, they aren’t required, but every record in the target would be updated regardless of need, which wastes resources.

WHEN MATCHED  
                AND (   Target.CustomerName <> Source.CustomerName -- These are optional. 
                     OR Target.Planet <> Source.Planet
                     OR Target.Affiliation <> Source.Affiliation)
THEN
   UPDATE SET --Updates Yoda and Obi-Wan records.
      Target.CustomerName = Source.CustomerName,
      Target.Planet = Source.Planet,
      Target.Affiliation = Source.Affiliation

The When Not Matched by Target clause specifies what should be done with records in the Source that aren’t in the destination. The typical scenario is to insert records which are new. I could have added additional conditions or only added certain new records as well.

WHEN NOT MATCHED BY TARGET
THEN
   INSERT (CustomerKey, -- Inserts Darth Vader
           CustomerName,
           Planet,
           Affiliation)
   VALUES (CustomerKey,
           Source.CustomerName,
           Source.Planet,
           Source.Affiliation)

The When Not Matched by Source clause specifies what should be done with records in the Target that aren’t in the Source. Keep in mind that if your Staging table isn’t comprehensive, maybe the result of an incremental extraction, then you’d want to omit this portion of the statement.

WHEN NOT MATCHED BY SOURCE THEN DELETE; -- Deletes Anakin

Also, keep in mind that any Merge statement must be terminated in a semicolon. So, when you put your script together, it looks like the following. Go ahead and run it on your test data.

MERGE  DimCustomer AS Target
 USING CustomerStaging AS Source
    ON Target.CustomerKey = Source.CustomerKey
WHEN MATCHED
                AND (   Target.CustomerName <> Source.CustomerName
                     OR Target.Planet <> Source.Planet
                     OR Target.Affiliation <> Source.Affiliation)
THEN
   UPDATE SET
      Target.CustomerName = Source.CustomerName,
      Target.Planet = Source.Planet,
      Target.Affiliation = Source.Affiliation
WHEN NOT MATCHED BY TARGET
THEN
   INSERT (CustomerKey,
           CustomerName,
           Planet,
           Affiliation)
   VALUES (CustomerKey,
           Source.CustomerName,
           Source.Planet,
           Source.Affiliation)
WHEN NOT MATCHED BY SOURCE THEN DELETE;

What about Null Values?


Ah, very astute of you to notice that my young padawan. You’ll notice in the When Matched portion that I also check to see if a value changed before I run my update statement. However, depending on your settings, Null values are not equal to each other. Therefore, any record with a Null value would be updated even if there were no changes. I handle this with Coalesce because it chooses the best datatype between arguments based on datatype precedence. Therefore, I can use the same basic Coalesce statement for most comparisons without throwing an error. Therefore, the When Matched portion from above becomes the following:

WHEN MATCHED 
                AND (COALESCE(TARGET.CUSTOMERNAME, '') <> COALESCE(SOURCE.CUSTOMERNAME, '') 
                  OR COALESCE(TARGET.PLANET, '') <> COALESCE(SOURCE.PLANET, '')
                  OR COALESCE(TARGET.AFFILIATION, '') <> COALESCE(SOURCE.AFFILIATION, ''))

Limitations?


Of course, there are always limitations. The most important limitation is that both data sources need to be on the same SQL Server instance. I suppose you could use linked servers, but I wouldn’t recommend it.

Another factor that might give you some pause is that these statements are fairly complex and wordy. A long merge statement, such as merging two tables with 25 fields each, is tedious to write. Well, stay tuned because later in this series I’ll share some code which will practically write the statements for you.

In the next article we’ll discuss how to use the Output clause with Merge, and how I use it occasionally to perform auditing during data warehouse loads.

Any questions?

Got the Urge to Merge?


The T-Sql Merge statement, introduced in SQL 2008, is powerful, versatile, and performant. However, it is also one of T-SQL’s least understood statements. Most DBAs know they can use it to insert, update, and delete data in a target table in a single transaction, which generally outperforms separate statements. Fewer know, however, that it can be used to load slowly changing dimensions, facilitate auditing, as well as other uses. I briefly discussed how to use Merge some time ago, but I wanted to take a deeper dive into it because it is so darn usefull.

Though the statement is documented on many sites, many tend to cover it tersely or are little more than variations on Books Online, which I find difficult to read and absorb. So, in the following articles, we’ll cover the following basic topics.

  1. This post represents my original attempt, and I leave the link here for tracking purposes, but most people should skip it in favor of the second link. What is the Merge statement and how do you use it?
  2. Writing T-SQL Merge Statements the Right Way
  3. Using the Output Clause with T-SQL Merge
  4. What exactly are Dimensions and Why do They Slowly Change?
  5. How to Properly Load Slowly Changing Dimensions using T-SQL Merge
  6. Related: Loading Dimensions Historically
  7. Related: Extracting Historical Dimension Records using T-SQL
  8. Data Warehouse Initial (Historical) Dimension Loading With T-SQL Merge

SQL Saturday Dallas and Pre-Cons

The North Texas SQL Server User Group recently announced our next SQL Saturday. SQL Saturday 163 will be another business intelligence centric event, with a limited number of attendees. The last time we had an event like this, it sold out in less than 12 hours, so I’d urge everyone to sign up right away.

I won’t bother ranting yet again about how great these events are, except to say that if you aren’t taking advantage of them, you’re making a huge mistake. You won’t find better learning and networking opportunities almost anywhere, and the events are essentially free. That being said, I have another announcement to make.

Data Warehousing In A Day – Pre-Conference Seminar

Presenting
I’ve been invited to give a full day pre-conference seminar on Data Warehousing. On Friday October 12th, you can join me to learn how to build a Data Warehouse from the ground up and load it efficiently. I’ll teach you the core skills you need to succeed, and the pitfalls to avoid. I’ll share the experience I’ve gained building Data Warehouses from scratch, as well as rescuing these projects from failure. For the complete abstract, head on over to Data Warehousing In A Day and take a look.

I’m really pumped about this opportunity for several reasons. The first is that I was born to teach. I really love it and people say that it shows. Second, many people have seen some of my presentations on Data Warehousing or SQL Server Integration Services and commented how they wished I would go into more depth. When giving a regular presentation of an hour to 90 minutes, it’s hard to do that. The Pre-Con is a golden opportunity to deliver both breadth and depth.

So, for the low, low price of $75, you get a full day of example and demo focused learning on Dimensional Modeling, SSIS, and more. How can you beat that?

Sheesh! I never worked for so little, except once, and that was a very noble cause. – Miracle Max

Be sure to sign up early because the space is limited and the Pre-Con will fill up fast. Also, $75 is an early bird price. On July 15th, the price goes up to $99.

So, I hope many of you will join me for Data Warehousing In A Day on Friday, October 12th and I’m looking forward to hanging out with all of you at SQL Saturday 163!

More info about SQL Saturday 163 and my Pre-Con to come.

My Father's Footsteps

Everyone has heard the quote, “The apple never falls far from the tree.” As we get older, we tend to rebel against the notion that we will eventually become our parents. As we age, it seems less a notion and more a certainty. Well, in many ways I have fought against this my entire life because of some of the consequences my father has to endure. Of course, everyone has flaws, and I strive to correct and improve mine all of the time. However, there is one that is threatening to destroy me.

Pain

Like my father, I have been fat for most of my life. Not just overweight, but fat. I can remember getting into fights with neighbor kids about it. I’d come home all marked up from a fight and my father would ask me why. “Because they called you fat,” I’d say. And he’d just respond with, “David, I am fat. Who cares what they say?” Well, as a kid, you have to care as teasing and bullying can be brutally painful. Anyway, as I got older my weight continued to grow.

Being fat is painful. It affects your social life and your career. People assume that if you’re fat that you’re undisciplined and discount your intelligence and abilities. I usually don’t take it personally as it’s not malicious, it’s just natural. A few years ago a friend from the PASS community asked me to apply to their company because their team could benefit from my skills. However, the Human Resource person didn’t recommend me for a second interview. The friend asked why and the HR person stupidly admitted that I was too fat. They didn’t want to incur extra health care costs and openly doubted my skills. Keep in mind that this is a person charged with protecting employees with disabilities and also has no understand about what DBAs actually do. I was rejected simply because I was fat. I was angry, but got over it.

Dieting is painful, dealing with deprivation, and eventual failure. I’ve gained and lost tons of weight during long term diet and exercise stints. The most recent was 2 years ago when I lost over 50 lbs. However, I had to ride almost 200 miles a week on the bike to do it. This lead to terrible saddle sores which required periodic surgical procedures to remove. Your body is programmed to hold onto weight and tends to sabotage your efforts. This can be demoralizing.

Revelation


As time goes on, I’ve become closer with my father. His health has been declining in the past decade and his obesity and related problems have been devastating. He suffers from diabetes and has impaired vision because of it. My father, who used to throw around 400lb tool boxes with ease can no longer walk. His knees and hips have completely eroded and he’s confined to a wheelchair or walker and in extreme pain. He’s on the most powerful pain killers available and still in agony. For the longest time surgeons refused to fix his problems because he “wouldn’t” lose weight. However, its nearly impossible to lose weight when all you can do is lay around.

A few years ago, before my last dieting and exercising stint, I had a talk with him. Although it may sound weird, I thanked him for being an example for me. My father is like the Ghost of Christmas Future for me. I have a living example of what will happen to me if I cannot fix this problem. His response was telling. Basically he apologized for what I have to go through, but that none of us avoids it. His father died of diabetes related complications and presumably his grandfather did as well. My father, who was one of the strongest men I have ever met, apologized because I was doomed to follow his path.

So far, he’s been right. Despite my best efforts I am still fat, and suffering physically from it. Two months ago I tore the meniscus in my knee when I got up off the couch. Now my knee hurts constantly. I am becoming my father.


I simply cannot accept that because it’s terrifying beyond words.

My Decision

I’m having Gastric Sleeve Surgery. Basically, they are going to remove 80% of my stomach which reduces the amount of food I can eat as well as suppress my hunger. I’ll be in the hospital three days and will require some time to heal. The decision is scary and the surgery is painful, but not nearly so as being morbidly obese the rest of my life.

I made the final decision after appointments with a registered dietitian and my physician. Independently they both told me that because my metabolism was so slow, as it is with most morbidly obese people, that I would need to restrict my daily caloric intake to between 800 and 1200 calories for the rest of my life. This is the only way I can remain at a normal weight. When told this, I realized that I should have done this a decade ago. I don’t know of anyone, even my thinnest friends, who would be able to voluntarily restrict their eating that much. Heck one of my favorite meals, Prime Rib at a national chain restaurant, is at least 1700 calories. I would never be able to achieve this without the benefits of surgery.

I understand that this will permanently alter my relationship with food. The surgery is a tool which will allow me to eat less and stick to the healthy foods which will allow me to avoid this fate. The next year will be very hard, but I simply have to do it.

Why am I sharing all this on my blog? Well, the surgery is next Monday and I’m presenting at SQL Saturday Houston this weekend. I’m on the preparatory diet right now and that’s hard to conceal since I can’t eat anything but 800 calories of protein shakes a day. Since I’m still attending the networking events, I’ll stick out like a sore thumb. I thought I would just let my SQL Family know as you folks mean a lot to me.

I’m hoping that some of you who are struggling with similar issues will read this and take strength from it. If I can inspire someone else to make this commitment that would be great too. It doesn’t matter to me if people think my decision is rash or crazy.

It is what I need to do for my health so I don’t continue walking in my father’s footsteps.

Presenting at SQL Saturday Houston

This Saturday I’ll be down in Houston presenting at SQL Saturday #107. I can’t wait to hang out with some of the best minds in SQL Server, which just happen to be great friends of mine as well. The abstract for my session is below.

Data Warehousing – How to Convince “The Bobs”

I got a meeting with the Bobs in a couple of minutes...


Building your first Data Warehouse is a long, and often difficult process. How can you get your boss to approve a Data Warehouse project? What’s the best way to explain dimensional modeling and the benefits of a Data Warehouse to a business person? What are the best/most cost effective ways of learning it? What kind of materials, hardware, software, etc do you need? What’s the best way to build a proof of concept that will impress your boss, as rapidly as possible?

Come to this presentation, and I’ll answer all of these questions plus the most important question of all.

“Is this good for the company?” Absolutely.

Downloads

You can download everything for the presentation here. If you’re brand new to Data Warehousing, you should my read week of Dimensional Modeling Articles which I wrote for SQL University. These will bring you up to speed to follow this fast paced presentation.

Sessions I’m Targeting

I’m hoping to see the following presentations:

However, these are just the sessions that most interested me. You should check out the schedule for yourself. Just look at all the “names” who will be there. Besides those folks I already listed, other notables include Sean and Jenn McCown, Jack Corbett, Wendy Pastrick, Kendal Van Dyke, Tim Mitchel, Mike Hotek, Wes Brown, Kalen Delaney, and so many others.

You just can’t beat the learning and networking opportunities of SQL Saturdays. Hope to see many of you there.

Should you Pursue a Business Intelligence Career?

The best way to answer that question is with other questions. Answer them honestly. Let me preface the questions by saying that I fully expect the majority of people who take the time to read this will answer yes to most of them.

  • Are you intelligent in general? You don’t have to be a Rocket Surgeon to be a good BI specialist, but it isn’t for everyone. Many aspects of this job involve solving puzzles, but some of them don’t have clear solutions. It’s always easier to solve a problem that is known to have a good answer, than one that is unknown.
  • Do you love to learn new things? Notice that I did NOT say like, or are you willing, but are you really passionate about learning? This career path requires constant learning of several very different skills to achieve mastery. Also, keep in mind that your job is to understand the user’s jobs and therefore their data needs. Consequently, you spend a lot of time learning vs actually doing. Which leads us to..
  • Do you have a business mindset? You need not have a Masters Degree in Business Administration, but you have to understand, or quickly learn, how disparate business models work. It’s not enough to write excellent T-SQL, or have mad skills in SSRS, you have to understand what the user needs even if they can’t articulate it.
  • Are you well-spoken? Do you enjoy working with people? Working as a member of a team? What about public speaking? These are all valuable traits for a BI Specialist. A great deal of your time is likely to dedicated to client interviews. If you’re socially awkward, this may not be the career for you.
  • How well do you write? Proposals, project plans, etc. are all written and how well you write influences others’ opinion of you. Those opinions determine the opportunities that will come your way.
  • Do you have the mind of the teacher? When sitting in front of a CEO, there is a fine line between boring them with deep technical details and simplifying things too much and which insults their intelligence. Executive buy in is critical and you must be savvy enough to get it.
  • How’s your attitude? Are you generally an upbeat person? The epitome of this is Jes Borland (Blog/Twitter), who has the most enthusiastic personality I’ve ever seen. This attitude helps with all of the other attributes. People will be more interested in talking with you, helping you, etc. This is something I struggle with as my attitude can be closer to Eeyore than Tigger sometimes.
  • Are you self-managed and organized? Often the BI DBA works alone on a specific aspect of a project with little to no supervision. Will you be productive in this situation?
  • Are you detail oriented? How meticulous are you in your work habits? With experience you will learn certain patterns which increase efficiency in data modeling, ETL, and report creation. However, even more importantly is how meticulously you check your work and notice details. You are not the business expert, but the person requesting the work is. The first time you deliver a report which isn’t right, users will mistrust your figures for a long time. In our business, trust is everything, and it’s just too easy to miss something. Also, documentation for BI project is crucial and I find that only the most meticulous people are disciplined enough to do it.
  • Are you driven? Striving for BI mastery and staying on top of new products and techniques can be exhausting. It’s not a sprint, it’s a long distance run. Are you up for that?

Please don’t misunderstand me. You don’t need to be excellent in all of those categories, none of us are. However, if you want to excel in Business Intelligence, you’d better have a good grasp of the majority of them and diligently be working toward mastery.

What do you think? Have I missed anything?

Why I Chose to Become a Business Intelligence DBA

Recently I’ve received several emails with the same basic theme. I was sought out because I am a Business Intelligence Consultant and they want to know whether the career would be good for them. Well, it’s hard to read a short synopsis of someone’s experience and recommend a life altering directional change. So, I decided to explain a bit about my history and why I made that decision.

My History

Teaching PowerPivot to M2M Admins
Five years ago, I had a very comfortable position working with a software product called Made2Manage, which utilizes SQL Server, hence the name of my blog. I was an expert at it and became able to solve problems very quickly. This left me with some idle time and I got comfortable. However, over time I got bored and needed challenge in my work life and realized that my procrastination to grow was costing me dearly in opportunities and money. Once you reach a point where you are comfortable, your value levels off. Also, I lived in Michigan, which was an economic black hole, so I was unlikely to earn a good living.

The first step I took was to move to a better economic area, Dallas Texas, and work for a much larger company as their Made2Manage expert. I got more involved with SQL Server and reporting and this led me to Brent Ozar. After asking some technical questions, I asked him a similar question that I’m addressing in this article. Basically, what should I do with my career? He put me in contact with experts in various disciplines, some not involving SQL Server, so I could decide my future direction. He convinced me to blog, get involved in the SQL community, and I became hooked.

Why did I choose Business Intelligence?

  • Fascination. I love working with data, cleaning it up, and creating systems to report on it. I enjoy showing a user their data in a meaningful way and watching them get excited. Data analysis is both a science and an art. Also, I enjoy practicing a skill that many don’t understand and cannot perform.
  • Variety. I don’t do well with repetitive boring tasks. As a Business Intelligence Consultant, most areas of SQL Server are involved so there are many different tasks to accomplish. This staves off boredom and insures that you are always learning.
  • Challenge. Because the Business Intelligence discipline involves so many aspects of SQL Server, it’s a real challenge to gain a level of mastery. Do you think that BI is just about reporting? Think again. I have to understand data modeling, both OLTP and OLAP. I need to have a level of mastery in SSRS, SSIS, and SSAS; not to mention the new and evolving tools such as SQL Server 2012 Power View and PowerPivot. Further, you need to know how to set up SQL Servers, disaster recovery procedures, and performance tuning. In other words, my work at some point or another touches almost every aspect of SQL Server.
  • Competition. Well, really the lack of competition as compared to other areas of technology. For example, I enjoy .NET software development, but I did not want to compete with outsourced labor or kids just coming out of college. Why try to compete with people who will do quality work for very low wages. Database Administration is not a topic that is typically taught in college so the practitioners tend to be more mature, by that I mean older but not necessarily in temperament. One reason why I didn’t take the Production DBA path is that it’s just too easy to outsource. Someone in a foreign land can check your backups, perform maintenance such as index and query tuning, etc. Also, Microsoft is pushing everything toward the Cloud, and this may cause issues for those with primary functions like disaster recovery, administration, and performance tuning.
  • Demand. To quote Paris Hilton, “Its hot!” Microsoft is continuing to invest boatloads of money into Business Intelligence and more and more companies are investing in their BI infrastructure. In the world of SQL Server, Business Intelligence skills seem to be in the highest demand to me.
  • Compensation. It’s basic supply and demand. Being a BI Professional is difficult so many people can’t do it. The demand is increasing. This disparity causes salaries to rise.
  • Freedom. My skill set allows me considerable freedom as to where I work and for whom. Ideally, I would work for a company which relies on SQL Server, which is almost everyone. However, even if a company doesn’t have SQL server databases, dimensional modeling skills and the flexibility of Integration Services to connect to nearly any data source means that BI professionals are still valuable.
  • Interactivity. “I’m a people person damnit!” I enjoy talking to people which is one of the reasons I’m so active in the SQL community. Business Intelligence lends itself to more interaction because you are developing your projects for users. Also, this is one of the reasons why BI is resistant to outsourcing. Does your CEO want to spend hours talking with someone he can’t understand or someone he can meet face to face and feel confident that his/her needs are being addressed?

  • Community – The SQL Server Community is the best technical community I’ve ever seen or heard of. One of the primary reasons I went the SQL Server route is because our community is so much better than that of competing products. When I have a question and Google it, someone I know has already blogged about it. They’re friendly, helpful, generous, exceedingly knowledgeable, and make it easier to learn SQL Server than other technologies.

On Monday, I’ll answer the other question posed to me in those emails. Should you pursue a Business Intelligence Career?

Multithreading Within the SSIS Dataflow

We often hear of the ability of SQL Server Integration Services (SSIS) to leverage multithreading to increase performance. However, multithreading in the control flow, executing two or more packages simultaneously, is what usually comes to mind. I recently ran into an issue with a large data set that I needed to multithread in the data flow. I was dealing with a large set of addresses and the client owned a third party SSIS product which they used to parse them and the performance wasn’t stellar. At the same time the server wasn’t even breathing hard, so I suspected that multithreading would speed up this process.

Since the addresses had sequentially numbered id values, the first thing that occurred to me was to use the modulo function in SSIS. Modulo returns the remainder when one number is divided by another. Naturally, I took to the web and found this excellent article by Josef Richberg (Blog/Twitter) regarding the use of the T-SQL modulo function, but my source was a huge text file, not SQL Server, so that option was out. Ultimately, this is what I decided to go with.

I surmised that if I used a Conditional Split and checked for numbers evenly divisible by 2 (modulo 0), I could evenly split the data flow. This worked, but I felt I could increase performance further by adding a 3rd option as shown below.

For the purposes of this article, I created a Numbers or Table using code from the great Jeff Moden. I created a table of 10,000 sequential numbers (Field N) and got the following results:


The Derived Column transforms represent some kind of process, in my case a 3rd party address parser. The data flow is evenly divided into three streams and then reunited by the Union All transform.

What if you want to divide the data flow into more than 3 even streams? Well, you could insert N % 4 == 0 as the first flow, but the results are too uneven in my opinion.

Using 5 as your first option yields a more even division and the flow is divided into 4 streams.

This is great, but what do you do if you don’t have a sequential ID in your data set? Well, there are several options. Remember that the integer field need not be an ID. Any integer field, particularly if it has varied values, is a good candidate.

If you don’t have an integer type field, then you can create one using the script task. I’ve used this method several times to generate surrogate keys and it seems to have very little overhead.

Also, Tim Mitchell (Blog/Twitter) blogged about how to split your data flow alphabetically if you have a character type field with sufficient variation. I think this method is less advantageous because experimentation would be required to determine cut off values which result in evenly divided streams.

Any of you have other methods of splitting the data flow?

State Name and Abbreviation Lookup Table Script

This is a quick script which creates a table consisting of State Names and their corresponding Abbreviations. I use this table when doing basic data quality checking and also standardizing state designations.

CREATE TABLE dbo.StateLookup
(
   StateID       INT IDENTITY (1, 1),
   StateName     VARCHAR (32),
   StateAbbrev   CHAR (2),
 
)
 
INSERT INTO StateLookup
VALUES ('Alabama', 'AL'),
       ('Alaska', 'AK'),
       ('Arizona', 'AZ'),
       ('Arkansas', 'AR'),
       ('California', 'CA'),
       ('Colorado', 'CO'),
       ('Connecticut', 'CT'),
       ('Delaware', 'DE'),
       ('District of Columbia', 'DC'),
       ('Florida', 'FL'),
       ('Georgia', 'GA'),
       ('Hawaii', 'HI'),
       ('Idaho', 'ID'),
       ('Illinois', 'IL'),
       ('Indiana', 'IN'),
       ('Iowa', 'IA'),
       ('Kansas', 'KS'),
       ('Kentucky', 'KY'),
       ('Louisiana', 'LA'),
       ('Maine', 'ME'),
       ('Maryland', 'MD'),
       ('Massachusetts', 'MA'),
       ('Michigan', 'MI'),
       ('Minnesota', 'MN'),
       ('Mississippi', 'MS'),
       ('Missouri', 'MO'),
       ('Montana', 'MT'),
       ('Nebraska', 'NE'),
       ('Nevada', 'NV'),
       ('New Hampshire', 'NH'),
       ('New Jersey', 'NJ'),
       ('New Mexico', 'NM'),
       ('New York', 'NY'),
       ('North Carolina', 'NC'),
       ('North Dakota', 'ND'),
       ('Ohio', 'OH'),
       ('Oklahoma', 'OK'),
       ('Oregon', 'OR'),
       ('Pennsylvania', 'PA'),
       ('Rhode Island', 'RI'),
       ('South Carolina', 'SC'),
       ('South Dakota', 'SD'),
       ('Tennessee', 'TN'),
       ('Texas', 'TX'),
       ('Utah', 'UT'),
       ('Vermont', 'VT'),
       ('Virginia', 'VA'),
       ('Washington', 'WA'),
       ('West Virginia', 'WV'),
       ('Wisconsin', 'WI'),
       ('Wyoming', 'WY')

Also, you can add common misspellings to the table such as ‘Pensylvania’ to further clean up your data. If your data set is international, you may want to add Canadian provinces as well. One way I use this table is with a SQL Server Integration Services (SSIS) Lookup transform and the following query.

SELECT StateName State, StateAbbrev FROM StateLookup
UNION ALL
SELECT StateAbbrev State, StateAbbrev FROM StateLookup

This standardizes everything to 2 character states abbreviations. Any records which don’t match can be handed in another branch of the data flow through fuzzy matching or another method.

Do you use anything similar?

Page 6 of 33« First...45678...2030...Last »