Archives

Multithreading Within the SSIS Dataflow

We often hear of the ability of SQL Server Integration Services (SSIS) to leverage multithreading to increase performance. However, multithreading in the control flow, executing two or more packages simultaneously, is what usually comes to mind. I recently ran into an issue with a large data set that I needed to multithread in the data flow. I was dealing with a large set of addresses and the client owned a third party SSIS product which they used to parse them and the performance wasn’t stellar. At the same time the server wasn’t even breathing hard, so I suspected that multithreading would speed up this process.

Since the addresses had sequentially numbered id values, the first thing that occurred to me was to use the modulo function in SSIS. Modulo returns the remainder when one number is divided by another. Naturally, I took to the web and found this excellent article by Josef Richberg (Blog/Twitter) regarding the use of the T-SQL modulo function, but my source was a huge text file, not SQL Server, so that option was out. Ultimately, this is what I decided to go with.

I surmised that if I used a Conditional Split and checked for numbers evenly divisible by 2 (modulo 0), I could evenly split the data flow. This worked, but I felt I could increase performance further by adding a 3rd option as shown below.

For the purposes of this article, I created a Numbers or Table using code from the great Jeff Moden. I created a table of 10,000 sequential numbers (Field N) and got the following results:


The Derived Column transforms represent some kind of process, in my case a 3rd party address parser. The data flow is evenly divided into three streams and then reunited by the Union All transform.

What if you want to divide the data flow into more than 3 even streams? Well, you could insert N % 4 == 0 as the first flow, but the results are too uneven in my opinion.

Using 5 as your first option yields a more even division and the flow is divided into 4 streams.

This is great, but what do you do if you don’t have a sequential ID in your data set? Well, there are several options. Remember that the integer field need not be an ID. Any integer field, particularly if it has varied values, is a good candidate.

If you don’t have an integer type field, then you can create one using the script task. I’ve used this method several times to generate surrogate keys and it seems to have very little overhead.

Also, Tim Mitchell (Blog/Twitter) blogged about how to split your data flow alphabetically if you have a character type field with sufficient variation. I think this method is less advantageous because experimentation would be required to determine cut off values which result in evenly divided streams.

Any of you have other methods of splitting the data flow?

Related posts:

4 comments to Multithreading Within the SSIS Dataflow

  • Toine

    I almost always use the modulo operator. Though the reason why it went wrong when you added “N % 4 == 0″ is because of the usage of the modulo.

    Normally if one would want to seperate an input into 4 streams one would create the following conditions on the conditional split:
    N % 4 == 0
    N % 4 == 1
    N % 4 == 2

    Then the one where the rest value would be 3 would go to the default output.

    This will create a perfectly even split if your incoming data is sequential.

  • Cool. This is one of the reasons I blog. That never occurred to me but it certainly works. Thanks!

  • Sushant

    Hi David,

    How about programmatically multithreading the SSIS package? Here is what i have done so far for the same.

    I have written code inside a Script Task to create a package at run time. This dynamic package would load data to multiple tables using single DataFlowTask i.e. parallel source and destinations inside dataflowtask. To achieve this I have used “for loop” to iterate (the logic of adding source & destination) for each source/destination table. This code works fine for 1st iteration but gives error in 2nd iteration at ReinitializeMetadata() right after setting up the Accessmode for destination.

    I have added my question (with code) on MSDN forum as well but was unable to get any response yet.

    Here is the question on forum
    http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/daa4dcc8-7c61-4137-94e0-fb725c45c0f7

    I am not adding anything here as everything is already on forum.

    I will appreciate if you could take some time and help me by pointing me to right direction.

    I mailed you the same content on your mail id before i reached to this article. Sorry for spaming your mailbox.

    Thanks,
    Sushant

  • Sushant

    Hi,
    This problem is now solved. Details can be found in MSDN forum.

    Thank you.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>