Sunday, November 4, 2007

I Want My CTE!

In my recent Best Practices post I mentioned that it was a best practice to “do it in the data source.” By this, I meant that when using SSIS to extract data from a relational database you can often get much better performance by completing tasks like sorting, filtering, grouping and aggregating in the source query as opposed to completing these tasks in the SSIS data flow. Now don’t get me wrong – the SSIS data flow performs exceptionally well, but it is not always the right tool for every job.

Here’s a specific case in point. I was recently helping someone troubleshoot a poorly performing SSIS package. The developer in question was pulling a variety of date values (different dates and times that described a business entity) from a single table in the source system and loading them into a time dimension in a data mart. As part of the data flow he was taking the seven different source columns and putting them together into a single column like this:

1) He used the Multicast transform to split the one data flow path into seven data flow paths, one for each date column.
2) He created a set of “DateValue” columns with the Derived Column transform, with each one based on one of the date columns from the source table, and replacing NULL dates with a placeholder date value.
3) He used the Sort transform to remove duplicates from each of the derived columns.
4) He used the Union All transform to bring the seven derived columns together into a single “DateValue” column.
5) He used another Sort transform to remove duplicates from the single data column.
6) He loaded the discrete dates into his dimension table.

It looked something like this:

And it performed about as well as you could imagine.

As you can probably imagine, he was a little frustrated, and was saying all sorts of bad things about SSIS which I won’t repeat here and which, for the most part, weren’t true.
And as you can probably also imagine, my solution for this problem was to perform the required logic not in the data flow, but in the source query. In fact, I said “this looks like a job for… Common Table Expressions!” That’s right – since the source system was using a SQL Server 2005 database, we could quickly and easily perform all of this logic in the SELECT statement used by the data flow, and eliminate 90% of the work being done in the data flow. The source query ended up looking something like this:

WITH FilteredDatesCTE
SELECT TimeRaised
FROM SourceTable
WHERE LastModified >= '1/1/2007'
SELECT TimeRaised AS DateTimeString FROM FilteredDatesCTE UNION
SELECT TimeOfFirstEvent FROM FilteredDatesCTE UNION
SELECT TimeOfLastEvent FROM FilteredDatesCTE UNION
SELECT TimeResolved FROM FilteredDatesCTE UNION
SELECT LastTimeStateWasModified FROM FilteredDatesCTE UNION
SELECT TimeAdded FROM FilteredDatesCTE UNION
SELECT LastModified FROM FilteredDatesCTE
SELECT DISTINCT ISNULL(DateTimeString, '1/1/9999') AS DateTimeString
FROM UnionedDatesCTE;

Of course, if we were using a different database platform for the source system we could have used derived tables instead of CTEs, but the CTE syntax is cleaner and more elegant, and certainly does the trick.

The updated data flow was much simpler – none of the transforms shown above were required anymore – and it performed superbly, completing in minutes where the original data flow with all of its asynchronous transforms took over a day to complete with the same production data volume.

And as you’ve doubtless noticed as well, this example also ties in nicely with the “Don’t use asynchronous transforms” best practice.

So what’s the moral of the story? Yes, this time I actually do have one! The moral is that in order to be an effective SSIS developer, you also need to understand the capabilities of the platforms on which your packages are based. In this case (and in many, many cases) this means understanding T-SQL or some other SQL dialect. If you don’t have a thorough understanding of SQL, you’re going to be at a significant disadvantage when using SSIS. Part of it comes down to choosing the right tool for each job, part of it comes down to understanding the strengths and weaknesses of each tool.

No comments: