Friday, September 14, 2007

Handling Delimited Fields

I'm working with a set of input data that has multiple logical values "encoded" within a single field for each record. The source system allows users to select multiple values from a list, and then crams all of these values in a single field in the underlying database table, with a multi-character delimiting string between them. This is (believe it or not) the first time I've had to deal with this in a production SSIS scenario, so I investigated several different approaches to splitting the values into multiple rows with one value per row.

The first approach involves loading the data from the source system into a table in the staging database without modifying the shape of the data, and then using a SQL Server table-valued function to split the delimited field during the process of loading the data from the staging database into the data warehouse. I did some searching for a decent "split" function online and found one here that did most of what I needed. The only significant thing that I added was support for a "key" field to be passed in and included in the return table, so that the records extracted from the delimited field could then be correlated easily with the data in the source record. Here's the function I ended up with:



IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[KeySplit]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))
DROP FUNCTION [dbo].[KeySplit]
GO

CREATE FUNCTION dbo.KeySplit
(
@Key NVARCHAR (4000)
,@String NVARCHAR (4000)
,@Delimiter NVARCHAR (10)
)
RETURNS @Results TABLE
(
KeyColumn NVARCHAR (4000)
,ItemColumn NVARCHAR (4000)
)
AS
BEGIN


DECLARE @CurrentItem VARCHAR(8000)

WHILE CHARINDEX (@Delimiter ,@String, 0) <> 0
BEGIN
SELECT
@CurrentItem = RTRIM (LTRIM (SUBSTRING (@String, 1, CHARINDEX (@Delimiter, @String, 0) -1))),
@String = RTRIM (LTRIM (SUBSTRING (@String, CHARINDEX (@Delimiter, @String, 0) + LEN (@Delimiter), LEN (@String))))



IF LEN(@CurrentItem) > 0
INSERT INTO @Results (KeyColumn, ItemColumn) SELECT @Key, @CurrentItem


END



IF LEN(@String) > 0 -- Record after final delimiter
INSERT INTO @Results (KeyColumn, ItemColumn) SELECT @Key, @String

RETURN

END
GO


I can then use it like this in the source query from to load the data warehouse:

SELECT UDF.[KeyColumn] AS [RecordKey]
,UDF.[ItemColumn] AS [ItemName]
,RFC.[DelimitedList] AS [RawItemList]
FROM dbo.RFC_DATA RFC
CROSS APPLY dbo.KeySplit (RFC.[ID], RFC.[DelimitedList], '!#!') UDF


This performs reasonably well (although as I mentioned above, I don't really have enough data to say if it performs well enough) and returns the data I need.

Next, I wanted to look at a way to perform the same "splitting by key" functionality in memory in the SSIS data flow. This was obviously a job for... the Script Component![1] I personally tend to shy away from using the Script Task and Script Component in my SSIS packages (largely because there is so rarely anything that I need done that cannot be done using the built-in tools) but this is a case that screams out for scripting. And SSIS makes this incredibly easy. Here's what I needed to do:

First, I updated the data flow I'd created to load the data into the staging database by adding a Multicast transformation, a Script Component transformation, a Row Count transformation and an OLE DB destination. The image below shows the finished data flow after everything has been configured and connected.




Next, I right-clicked on the Script Component and set up the inputs and outputs to include the columns I needed.




Next, I selected my output and set its SynchronousInputID property to None to mark this as an asynchronous transformation[2]. It's important to do this before going into Visual Studio for Applications (VSA) to write any code, because Visual Studio will put in place the method stub code appropriate for the component when you first launch VSA, and it's mildly annoying to have to change it later on.




Next, I added the Delimiter package variable (which I added earlier to my package with the data type string and the value of the delimiter in the data I needed to split) to the ReadOnlyVariables list for the Script Component.




I then clicked on the Design Script button to launch VSA, and updated the stub code with the splitting logic I needed.




As you can see, this is much cleaner in VB.NET than it is in T-SQL, since the requisite functionality is already included in the .NET System.String class, so we don't need to re-invent this particular wheel.


Finally, I executed the package and made sure everything ran as desired.





I also did a little testing in the database to verify that the two techniques produced identical results, which they did. As you can see from the final image above, there just isn't enough data at this point to reach any meaningful performance conclusions, but I'm now armed and ready with two tested techniques for once the production data is available.



[1] If you didn't read that to yourself in a superhero voice, please go back and re-read this sentence until you get the tone right.


[2] If you're not sure about the differences between synchronous and asynchronous components in SSIS data flow, check out this TechNet article, or better yet, this excellent book by former SSIS Group Program Manager Donald Farmer.

2 comments:

Matthew Roche said...

I apologize for the formatting of the T-SQL code in this post. Blogger kept killing my whitespace, and after fixing it for the 5th or 6th time I decided it wasn't worth the effort...

Mick Wagner said...

Great article and description of your solution. This saved me when I was in a tight spot.

I ran it against a 4.8M row set, and it took roughly 20 minutes.

Thanks for helping contribute to the MS BI world!