A recent post on the SSIS forums on MSDN has inspired me to finally post my first technical blog post of the year. Here's the basic scenario:
- A source system has multiple departments
- For each department there are multiple employees
- The SSIS package must extract all employees from all departments and load the employees from each department into a department-specific text file
For example, if there are Sales, Management and IT departments, the package should create sales.txt, management.txt and it.txt text files with the appropriate records in each file. Should be pretty simple, right?
Right!
Well, maybe.
The issue at hand is not that this sort of thing is difficult or complicated, but that unless you've spent a decent amount of time working with SSIS - and SSIS expressions and variables in particular - it's not really obvious where to begin. So we, of course, shall begin at the beginning, setting up our data sources.[1]
First, let's create a one-table source system in SQL Server, populated with data. Here's a SQL script that defines the source system:
USE TempDb
GO
CREATE TABLE SourceData
(
ID INT IDENTITY,
FirstName VARCHAR (50),
Department VARCHAR (50)
)
INSERT INTO SourceData VALUES ('Bob', 'Sales')
INSERT INTO SourceData VALUES ('Susie', 'Sales')
INSERT INTO SourceData VALUES ('Steve', 'Sales')
INSERT INTO SourceData VALUES ('Bill', 'Sales')
INSERT INTO SourceData VALUES ('Ken', 'Sales')
INSERT INTO SourceData VALUES ('Ted', 'Sales')
INSERT INTO SourceData VALUES ('Darren', 'IT')
INSERT INTO SourceData VALUES ('John', 'IT')
INSERT INTO SourceData VALUES ('Jamie', 'IT')
INSERT INTO SourceData VALUES ('Chuck', 'IT')
INSERT INTO SourceData VALUES ('Rafael', 'IT')
INSERT INTO SourceData VALUES ('Larry', 'IT')
INSERT INTO SourceData VALUES ('Lutz', 'Management')
INSERT INTO SourceData VALUES ('Ken', 'Management')
INSERT INTO SourceData VALUES ('Bill', 'Management')
INSERT INTO SourceData VALUES ('Mike', 'Management')
INSERT INTO SourceData VALUES ('Chris', 'Management')
Pretty simple so far, right? Right!
Next, let's create a sample text file that represents our destination files. This will make it much easier to build the SSIS package, because as we all know, SSIS thrives on metadata. Here's the sample text file:
ID,Name
1,Ken
2,Bill
3,Lutz
Again, very simple. Now let's build the SSIS package. We'll start by creating a new package in Visual Studio and defining two connection managers, like so:
The OLE DB connection manager is very straightforward - it just references the local tempdb database where the source data is stored. The Flat File connection manager is a little more interesting in that we've customized it just a little bit to say that the column names are in the first row, and that the ID column is a 4-byte integer.
Next, we need to create a few package variables to make everything else possible:
The first variable, CurrentDepartment, exists to store the current department we're loading from the source system. It will get populated by a Foreach Loop container once we start building the package logic.
The second variable, DepartmentCollection, exists to store the whole list of departments we're going to pull from the source system with an Execute SQL task.
The third variable, SqlSelectScript, exists to store the SELECT statement that will extract the employees for the current department from the source system. The fact that it has to be for the current department is what makes it interesting, and to solve that problem we need to use an SSIS expression. We set the SqlSelectScript variable's EvaluateAsExpression property to True and then set its Expression property to this expression:
"SELECT ID,FirstName FROM SourceData WHERE Department = '" + @[User::CurrentDepartment] + "'"
As you can see, the expression evaluates to a valid SQL SELECT statement, and its value will always reflect the current department from the CurrentDepartment variable.
Now that we have all of our variables, we also want to update the Flat File connection manager with an expression as well, so that the name of the file will be based on the name of the current department. Let's add an expression to the connection manager's expressions collection to base its ConnectionString property on this expression:
"C:\\Projects2005\\SSIS.BlogFodder\\SSIS.BlogFodder\\" + @[User::CurrentDepartment] + ".txt"
Obviously you should use whatever folder path makes sense for you, but make sure that you use a double backslash for any backslash characters in your path - this is necessary for the expression to be properly evaluated.
Finally, we're ready to start building the package's actual control flow and data flow. The first thing we need is an Execute SQL task that will run a SELECT statement to get a list of distinct departments and store that list in the DepartmentCollection variable. The SELECT statement looks like this:
SELECT DISTINCT Department FROM SourceData
And the Execute SQL task looks like this:
The main things to notice are that we're specifying Full Result Set for the result, and we're storing it in the DepartmentCollection variable that we created earlier on - and remember, we created the DepartmentCollection variable with the Object data type, so we can store the whole department list in a single Recordset.
Next, let's add a Foreach Loop container to the package's control flow, and connect it to the Execute SQL task with a precedence constraint. Let's configure the Foreach Loop so that it will loop over each department added to the DepartmentCollection variable, and will store the current department from the DepartmentCollection variable in the CurrentDepartment variable, like so:
And last but not least, let's add a Data Flow task to the Foreach Loop container, so that the control flow looks like this:
And at this point, all of the hard work is done. The data flow is very simple, just mapping the fields from an OLE DB source component which uses the existing OLE DB connection manager, to the a Flat File destination component, which uses the existing Flat File connection manager. The astute reader will notice at this point that there is a little "dynamic magic" going on in this simple scenario. Because the Flat File connection manager is configured with an expression on its ConnectionString property, the data flow that uses it will automatically be dynamic, and will automatically load the data into the correct flat file for the department being extracted.
The one thing to remember in the data flow is that the OLE DB source component needs to get its query from the SqlSelectScript variable, which is also dynamic because we based the variable on an expression.
With the source and destination connected, the super-simple data flow looks like this:
And when we run the package, what happens? This is what happens:
- The Execute SQL task runs, loading the list of departments into the DepartmentCollection variable
- The data flow task executes once for each department, because it is contained within the Foreach Loop container, which is looping over the DepartmentCollection variable
- Each time the data flow task executes it is extracting only employees from the current department, because the source query is based on an expression
- Each time the data flow task executes it is loading data into a different department-specific text file, because the ConnectionString property of the Flat File connection manager is based on an expression
And it all looks like this:
See how easy that was? It will probably take half as long to actually do this as it did to read this blog post. A few expressions, a decent understanding of the business requirements, and SSIS does all the hard work.
Enjoy!
[1] At this point, my laptop crashed, and I lost all of my work. This is just a reminder to myself: even if it's not "real" project work, always save all of your work regularly. It's annoying to have to recreate a bunch of SQL scripts for a blog post because they "weren't important enough to save" in the first place...