A recent post on the SSIS forums on MSDN has inspired me to finally post my first technical blog post of the year. Here's the basic scenario:
- A source system has multiple departments
- For each department there are multiple employees
- The SSIS package must extract all employees from all departments and load the employees from each department into a department-specific text file
For example, if there are Sales, Management and IT departments, the package should create sales.txt, management.txt and it.txt text files with the appropriate records in each file. Should be pretty simple, right?
Right!
Well, maybe.
The issue at hand is not that this sort of thing is difficult or complicated, but that unless you've spent a decent amount of time working with SSIS - and SSIS expressions and variables in particular - it's not really obvious where to begin. So we, of course, shall begin at the beginning, setting up our data sources.[1]
First, let's create a one-table source system in SQL Server, populated with data. Here's a SQL script that defines the source system:
USE TempDb
GO
CREATE TABLE SourceData
(
ID INT IDENTITY,
FirstName VARCHAR (50),
Department VARCHAR (50)
)
INSERT INTO SourceData VALUES ('Bob', 'Sales')
INSERT INTO SourceData VALUES ('Susie', 'Sales')
INSERT INTO SourceData VALUES ('Steve', 'Sales')
INSERT INTO SourceData VALUES ('Bill', 'Sales')
INSERT INTO SourceData VALUES ('Ken', 'Sales')
INSERT INTO SourceData VALUES ('Ted', 'Sales')
INSERT INTO SourceData VALUES ('Darren', 'IT')
INSERT INTO SourceData VALUES ('John', 'IT')
INSERT INTO SourceData VALUES ('Jamie', 'IT')
INSERT INTO SourceData VALUES ('Chuck', 'IT')
INSERT INTO SourceData VALUES ('Rafael', 'IT')
INSERT INTO SourceData VALUES ('Larry', 'IT')
INSERT INTO SourceData VALUES ('Lutz', 'Management')
INSERT INTO SourceData VALUES ('Ken', 'Management')
INSERT INTO SourceData VALUES ('Bill', 'Management')
INSERT INTO SourceData VALUES ('Mike', 'Management')
INSERT INTO SourceData VALUES ('Chris', 'Management')
Pretty simple so far, right? Right!
Next, let's create a sample text file that represents our destination files. This will make it much easier to build the SSIS package, because as we all know, SSIS thrives on metadata. Here's the sample text file:
ID,Name
1,Ken
2,Bill
3,Lutz
Again, very simple. Now let's build the SSIS package. We'll start by creating a new package in Visual Studio and defining two connection managers, like so:
The OLE DB connection manager is very straightforward - it just references the local tempdb database where the source data is stored. The Flat File connection manager is a little more interesting in that we've customized it just a little bit to say that the column names are in the first row, and that the ID column is a 4-byte integer.
Next, we need to create a few package variables to make everything else possible:
The first variable, CurrentDepartment, exists to store the current department we're loading from the source system. It will get populated by a Foreach Loop container once we start building the package logic.
The second variable, DepartmentCollection, exists to store the whole list of departments we're going to pull from the source system with an Execute SQL task.
The third variable, SqlSelectScript, exists to store the SELECT statement that will extract the employees for the current department from the source system. The fact that it has to be for the current department is what makes it interesting, and to solve that problem we need to use an SSIS expression. We set the SqlSelectScript variable's EvaluateAsExpression property to True and then set its Expression property to this expression:
"SELECT ID,FirstName FROM SourceData WHERE Department = '" + @[User::CurrentDepartment] + "'"
As you can see, the expression evaluates to a valid SQL SELECT statement, and its value will always reflect the current department from the CurrentDepartment variable.
Now that we have all of our variables, we also want to update the Flat File connection manager with an expression as well, so that the name of the file will be based on the name of the current department. Let's add an expression to the connection manager's expressions collection to base its ConnectionString property on this expression:
"C:\\Projects2005\\SSIS.BlogFodder\\SSIS.BlogFodder\\" + @[User::CurrentDepartment] + ".txt"
Obviously you should use whatever folder path makes sense for you, but make sure that you use a double backslash for any backslash characters in your path - this is necessary for the expression to be properly evaluated.
Finally, we're ready to start building the package's actual control flow and data flow. The first thing we need is an Execute SQL task that will run a SELECT statement to get a list of distinct departments and store that list in the DepartmentCollection variable. The SELECT statement looks like this:
SELECT DISTINCT Department FROM SourceData
And the Execute SQL task looks like this:
The main things to notice are that we're specifying Full Result Set for the result, and we're storing it in the DepartmentCollection variable that we created earlier on - and remember, we created the DepartmentCollection variable with the Object data type, so we can store the whole department list in a single Recordset.
Next, let's add a Foreach Loop container to the package's control flow, and connect it to the Execute SQL task with a precedence constraint. Let's configure the Foreach Loop so that it will loop over each department added to the DepartmentCollection variable, and will store the current department from the DepartmentCollection variable in the CurrentDepartment variable, like so:
And last but not least, let's add a Data Flow task to the Foreach Loop container, so that the control flow looks like this:
And at this point, all of the hard work is done. The data flow is very simple, just mapping the fields from an OLE DB source component which uses the existing OLE DB connection manager, to the a Flat File destination component, which uses the existing Flat File connection manager. The astute reader will notice at this point that there is a little "dynamic magic" going on in this simple scenario. Because the Flat File connection manager is configured with an expression on its ConnectionString property, the data flow that uses it will automatically be dynamic, and will automatically load the data into the correct flat file for the department being extracted.
The one thing to remember in the data flow is that the OLE DB source component needs to get its query from the SqlSelectScript variable, which is also dynamic because we based the variable on an expression.
With the source and destination connected, the super-simple data flow looks like this:
And when we run the package, what happens? This is what happens:
- The Execute SQL task runs, loading the list of departments into the DepartmentCollection variable
- The data flow task executes once for each department, because it is contained within the Foreach Loop container, which is looping over the DepartmentCollection variable
- Each time the data flow task executes it is extracting only employees from the current department, because the source query is based on an expression
- Each time the data flow task executes it is loading data into a different department-specific text file, because the ConnectionString property of the Flat File connection manager is based on an expression
And it all looks like this:
See how easy that was? It will probably take half as long to actually do this as it did to read this blog post. A few expressions, a decent understanding of the business requirements, and SSIS does all the hard work.
Enjoy!
[1] At this point, my laptop crashed, and I lost all of my work. This is just a reminder to myself: even if it's not "real" project work, always save all of your work regularly. It's annoying to have to recreate a bunch of SQL scripts for a blog post because they "weren't important enough to save" in the first place...
13 comments:
Love it!! Thanks!!! Can't wait to try it! :D
Thanks, Kathy!
I have been meaning to blog on something similar to this for quite some time - it was great to have your post on the forums as an incentive to get off my butt and actually post it. That's one of the reasons I love the community forums - I get to see questions and problems that I would not run into in my day-to-day work. And besides, anything that lets me talk about the beauty of SSIS expressions is a good thing. ;-)
I like the fact that I made it into the IT department :)
Darren
Darren - My lawyers have informed me that I need to add a disclaimer that says that any resemblance to any persons, living or Australian, is purely coincidence. ;-)
Just wanted to say thanks again. I went thru all the tutorials and did your example here. It works beautifully. Thanks so much! I really feel like I have a very good handle on SSIS now!! :)
I've used this solution as a basis to create a customer extract. I have just one more loop in mine. For mine I have a stored procedure that returns a list of regions. Each region consists of one or more states. I loop thru my list of regions and return a list of states for the current region I am on in the loop. With my list of states that is returned for the region, I then loop thru the state list calling my customer extract stored procedure, sending the state as a parameter and return the customers for that state. That is the data flow task portion. The customers are written to a dynamic flat file and the file gets added to (not overwritten) with the next state's customer list until the program gets to a new region. The dynamic flat file's connection string expression is based in part on the current region, so that triggers a new file when we get to a new region. This all works FANTASTIC, EXCEPT for one small but important detail. When I have more than one state per region (which is normally the case), it writes out my column names as the first row and then the customers for the first state, then writes the column names again, then the customers for the next state and so on. Since I have to add to the file until I get to a new region, I do not want column names in the middle of the file, but I HAVE to have them at the beginning of the file. Since these are such huge files, I did not even catch this until final testing. Is there a simple way to get the column names to write out at the beginning of the file, but not again until I am on a new file/new region?
Thanks Matthew - this post was very helpful to figure out using an expression to create dynamically named flat files.
One thing I got snagged on was that if your destination is a remote server (that doesn't use your drive mappings), your expression needs to start: "\\"+"\\servername\\path..." - to get the double backslash at the beginning.
Thanks again.
John
"\\"+"\\servername\\path..." - to get the double backslash at the beginning.
That is the same as putting...
\\\\servername\\path
The trick is to just copy the path you would normally use for mapping... then just replace each "\" with "\\" i.e. \\\\
Exactly what I was looking for! Thank you!
jules
This is great!! I was wondering if you had any tips on how to do this if you want to have one package read multiple database dynamically. Any info would be greatly appreciated.
Jeff
@Jeff Z: There are lots of different ways to approach this depending on how you interpret the question. So long as you have a list of databases (or a list of connection strings) stored somewhere, you can loop over that list by using the Foreach Loop container, kind of the opposite of what we're doing here. If you take a look in Books Online, there is a "Simple ETL Tutorial" that shows how to loop over multiple input files. The concepts are the same with multiple "input databases" except you would need to build the list of databases yourself...
Thanks for the reply, this will help.
hey this post was very helpful to me. but if we have to add to header to the flat file..having the department name as the header..then what is the possible way...it will be of great help to me if you can let me know the solution
Post a Comment