Thursday, February 21, 2008

Sorting and Metadata

A recent post to the SSIS Forums on MSDN made me realize how un-intuitive it can sometimes be to get sorted data in an SSIS data flow. Let's review the obvious facts:

  • When you query data from a relational database like SQL Server, the OLE DB source component gets metadata from the database server based on the SELECT statement.
  • This metadata includes things like column names, data types and so on.
  • The SSIS data flow designer uses this metadata to build the data flow pipeline.
  • The SQL SELECT statement includes an ORDER BY clause which specifies the sort order for the data to be returned.
  • This sort order is not provided by the database server, so the SSIS data flow has no way to tell if the data is sorted or not.

Hey, wait! That last fact may not be so obvious unless you think about how the database server is presenting this metadata to the SSIS tools. Essentially the OLE DB source component is sending the SELECT statement to the server, with some modifications to limit the number of rows being returned. The server returns the rows along with metadata that describes them. If the database is SQL Server, this data and metadata will be returned in a Tabular Data Stream (TDS) protocol stream - this is the protocol that SQL Server uses, so this is what SSIS gets back.

But if you've ever looked at the TDS spec, you know that TDS doesn't have any concept of sort order, so there's no way that SSIS can get the sort order from SQL Server.

So how does SSIS know that the data is sorted, and by what columns?

It knows because you tell it. And it trusts you implicitly[1] so you need to be careful to be honest.

So how do you do it?

Consider this SELECT statement:

  SELECT [AddressID]
        ,[AddressLine1]
        ,[AddressLine2]
        ,[City]
        ,[StateProvinceID]
        ,[PostalCode]
        ,[rowguid]
        ,[ModifiedDate]
    FROM [Person].[Address]
ORDER BY [PostalCode]
        ,[ModifiedDate] DESC

So how would you tell the SSIS data flow how this data is sorted?

There are two things you need to do, both of them in the Advanced Editor for the OLE DB source component. First, set the IsSorted property of the OLE DB Source Output to True. This essentially tells SSIS that the data will be sorted, but not how it will be sorted.

01 - IsSorted

Next, set the SortKeyPosition property for the PostalCode and ModifiedDate columns in the Output Columns collection. Set SortKeyPosition to 1 for PostalCode because this is the first column in the ORDER BY clause. Set SortKeyPosition to -2 for ModifiedDate because this is the second column in the ORDER BY clause, and the column is marked as DESC - this is why it's -2 and not 2.

02 - SortKey 1 

03 - SortKey 2

Now, when this package is executed, not only will the data be sorted (because we added an ORDER BY clause to our source query) but SSIS will know that the data is sorted as well, because we told it that it was. Now any downstream components (such as a Merge or Merge Join transformation) that require sorted inputs will be able to do what they need to do.

Life is good, because SSIS loves you.

 

[1] SSIS trusts you because SSIS loves you. More than you know.

No comments: