We often hear of the ability of SQL Server Integration Services (SSIS) to leverage multithreading to increase performance. However, multithreading in the control flow, executing two or more packages simultaneously, is what usually comes to mind. I recently ran into an issue with a large data set that I needed to multithread in the data flow. I was dealing with a large set of addresses and the client owned a third party SSIS product which they used to parse them and the performance wasn’t stellar. At the same time the server wasn’t even breathing hard, so I suspected that multithreading would speed up this process.
Since the addresses had sequentially numbered id values, the first thing that occurred to me was to use the modulo function in SSIS. Modulo returns the remainder when one number is divided by another. Naturally, I took to the web and found this excellent article by Josef Richberg (Blog/Twitter) regarding the use of the T-SQL modulo function, but my source was a huge text file, not SQL Server, so that option was out. Ultimately, this is what I decided to go with.
I surmised that if I used a Conditional Split and checked for numbers evenly divisible by 2 (modulo 0), I could evenly split the data flow. This worked, but I felt I could increase performance further by adding a 3rd option as shown below.
For the purposes of this article, I created a Numbers or Table using code from the great Jeff Moden. I created a table of 10,000 sequential numbers (Field N) and got the following results:
This is great, but what do you do if you don’t have a sequential ID in your data set? Well, there are several options. Remember that the integer field need not be an ID. Any integer field, particularly if it has varied values, is a good candidate.
If you don’t have an integer type field, then you can create one using the script task. I’ve used this method several times to generate surrogate keys and it seems to have very little overhead.
Also, Tim Mitchell (Blog/Twitter) blogged about how to split your data flow alphabetically if you have a character type field with sufficient variation. I think this method is less advantageous because experimentation would be required to determine cut off values which result in evenly divided streams.
Any of you have other methods of splitting the data flow?