BI Polar: BI

Showing posts with label BI. Show all posts

Wednesday, November 30, 2011

5 Tips for a Smooth SSIS Upgrade to SQL Server 2012

The SSIS team has just released a white paper on upgrading to SSIS in SQL Server 2012:

Summary: Microsoft SQL Server 2012 Integration Services (SSIS) provides significant improvements in both the developer and administration experience. This article provides tips that can help to make the upgrade to Microsoft SQL Server 2012 Integration Services successful. The tips address editing package configurations and specifically connection strings, converting configurations to parameters, converting packages to the project deployment model, updating Execute Package tasks to use project references and parameterizing the PackageName property.

If you’ve been using SSIS in versions prior to 2012[1] please take a moment to check it out. The evenings you save may be your own…

Direct link: http://msdn.microsoft.com/en-us/library/hh667275(d=lightweight).aspx

[1] And if you read this blog, odds are this is a safe bet.

Thursday, July 21, 2011

Your Window Improvements May Vary

I love my new job. This morning I’m sitting in a training session organized by the SQL Server team to bring Microsoft team members up to speed on new Business Intelligence capabilities in SQL Server “Denali”. The session is about to begin, and I’m reviewing the SSIS slides so I can be prepared for questions as they arise. (Did I mention I love my new job?)

I also love ambiguity.[1]

One of the features listed in the presentation is “Variable Window Improvements”. Reading this, all I could think is “we’ve really improved the windows, but they’re different all the time, so we can’t tell you what they are.”[2]

Of course, that’s not what the slide was about. The topic was the specific improvements made to the Variables Window for the SSIS package designer in SQL Server “Denali” Business Intelligence Development Studio. This may sound less than exciting[3] and even in the context of a two-day training event it only rates one slide bullet. But despite this, these tool improvements address a significant real-world need, and are certainly worthy of a blog post or two.

Before we look at the new “Denali” goodness, let’s look at a scenario in pre-“Denali” SSIS today[4]:

You’re editing an Execute SQL task in your package, and realize that you need a variable in which to store the results of the query you’re running.
You exit the task editor, and in the Variables window you create a new variable.
Later on, you’re looking for the variable, and you can’t find it. You frown, scratch your head, and re-create the variable.
When you run your package, you find that although the package (and the Execute SQL task) runs successfully, the variable is never populated with the value from the query.
You get yourself another coffee and settle down for a long troubleshooting session…

Experienced SSIS developers will probably be saying something along the lines of “well, of course, you needed to click on the package Control Flow design surface between steps 1 and 2. Duh!”

Those who are newer to SSIS may be missing the key point in this scenario: When you create a new variable in SSIS, the variable is created at the scope of whatever container or task was selected in the designer. Although this behavior is consistent and documented, it often comes as a surprise, because nobody actually reads the documentation.[5] It also often comes as a frustration, because most of the time variables should be created at the package level in order to be useful.[6]

In SQL Server “Denali” SSIS, all new variables are created at the package scope. This change ensures that the “what the heck” moments we experienced in previous versions of SSIS will no longer occur, and that all new variables will be defined at the most frequently used scope.

What about those situations when you actually do want a variable at a different scope? In previous versions of SSIS, variable scope could not be changed – you needed to delete the old variable and create a new one at the new scope.[7]

In SQL Server “Denali” SSIS, you can move or copy variables between scopes. Odds are you will need this functionality less than you needed it in earlier versions of SSIS (you know, when it wasn’t there) but when you need it, you’ll love it.

And that’s that. As per usual, I wrote two pages when I could have written two sentences (the ones I highlighted in red) but there you have it. The facts are always more meaningful when presented in the context of a story, and hopefully the story helped turn these two bullets into something more than a”yawn” moment. I’m sure you’ll let me know…

[1] In case you’re interested: http://en.wikipedia.org/wiki/Syntactic_ambiguity.

[2] Yes, I crack myself up.

[3] And less funny that my ambiguity joke, at least a little.

[4] This is true with SSIS in SQL Server 2005, 2008 and 2008 R2.

[5] Except you. I know you do, I meant everybody else. You know those guys…

[6] One of the “lightweight best practices” I have included in my SSIS presentations over the years is “always right-click on the package design surface and choose Variables from the context menu” because it ensures that the new variable is at the package scope regardless of what task you were working on beforehand.

[7] This functionality does exist in the amazing BIDS Helper add-in, but it was not included in SSIS.

Wednesday, October 7, 2009

Transformations are NOT Tasks

This is another one of those “I thought I blogged this years ago” posts. I was having a conversation with a fellow MCT a few days ago, and he described SSIS data flow transformations as “tasks on the data flow tab.” I didn’t say anything then, but this phrase has been sticking in my mind ever since – not unlike a nasty splinter – and I think I need to get it written down before my brain lets me move on.

Why is this seemingly innocuous phrase causing me so much mental distress?

Because transformations are not tasks!

Let’s list a few characteristics of tasks in SSIS. Tasks:

Are used in the SSIS Control Flow.
Are Containers, with all of the properties (such as the ability to control transaction scope, or to be enabled or disabled) of containers.
Support property expressions.
Support breakpoints and debugging.
Can be individually configured for logging.
Are presented as boxes connected by arrows on the BIDS design surface.

The only characteristic that data flow transformations share with tasks is the last one: they’re presented as boxes, connected by arrows.

But the similarities end there. Transformations are metadata-bound functions that operate on data stored in the memory buffers managed by the data flow engine. They are subcomponents of the Data Flow task, which is just another task. It’s an important task, and it gets its own designer because of this, but it is just another task nonetheless.

Why is this important to me? Some (ok, most) may argue that I’m just being pedantic. And perhaps they’re right, to some degree. But any technical field has its own terminology that allows its practitioners to communicate efficiently and precisely. Hearing an SSIS developer use “task” for “transformation” is kind of like hearing a surgeon use “abdomen” for “stomach” or a .NET developer use “component” for “assembly.” The terms are related, but their meanings are very different. In each case it’s probably likely that the speaker knows the difference and is simply not speaking carefully, but sometimes (particularly with the surgeon ;-) you just want to be sure.

So the next time someone tells you that he’s using “the Lookup Task” please send him here.

P.S. I should also be sure to mention that the MCT who used this offending phrase probably does not need this rant to let him know what’s going on. MCTs tend to have a solid grasp of the technologies they teach; you tend to crash in burn in front of the classroom if this is not the case. But I knew that I was going to have this stuck in my mind forever if I didn’t say something somewhere…

Monday, September 21, 2009

SSIS Package Templates: The Good, The Bad and The Ugly.

This is one of those “I could have sworn I blogged on this topic years ago” posts. I speak to this topic whenever I’m presenting on SSIS development best practices, and with my half-day mini-seminar in Stockholm last week[1] I was thinking of it – and speaking about it - again. Although SSIS is a great platform and provides pretty much all the tools you need to get the job done, sometimes it’s not always easy to know what tools to use and how to use them.

Which brings us, of course, to package templates. So before we continue, please read this Knowledge Base article on how to create a package template in BIDS.

Done? Good – let’s review, and paraphrase while we do it.

Package templates are pretty much the entire reuse story in SSIS.[2] You define the baseline package, including connection managers, log providers, configurations, variables and expressions, tasks, containers and precedence constraints and sometimes data flow components as well. When you create a new package you create a new copy of the template with this pre-defined starting point instead of starting with a blank package, so you can be sure that the package is just like the baseline you created.

In keeping with the theme established in the post title, this is the “good” side of SSIS package templates. They’re valuable, and when you’re building a large number of packages for the ETL subsystem of a DW/BI project, they’re invaluable, in that the common “plumbing” required by all packages can be defined once and used consistently everywhere. And this is particularly vital when working in a team environment, where different developers (each of whom inherently has his own style and own way of building packages) are working on the same project, which is typically the case on ETL projects of anything but the most trivial in scale.

Which brings us to the “bad” side of package templates, at least as implemented when following the guidance in the KB article referenced above. This guidance is firmly targeted at the lone developer; did you notice how all the instructions about creating a package and saving it in the Visual Studio install folder never mentioned on what computer you should be performing these steps? That’s because the article assumes the context is “your machine” – the lone developer’s machine or - at scale – on every developer’s machine.

Why is this bad, you ask? Surely it can’t be so much work to copy a DTSX file to the same folder on a few computers, can it?

Of course not. The “bad” side of following this approach comes not when creating the templates, but when the real world comes into play and the project continues to grow and evolve. In a perfect world, when the needs of the project are understood 100% before development begins, and the needs of the project never change after development begins, this approach will work just fine for team development. Your mileage may vary, but I’ve yet to work an SSIS project[3] like this.

In the real world, most successful BI projects are iterative and agile, just like most other successful development projects. If you look at SQL Server MVP John Welch or MVP and VSTS Ranger Ted Malone, their blogs focus strongly on agile BI. If you read Kimball, you will see an overarching theme of agile and iterative development. And if you’ve ever seen former SSIS program manager Michael Entin present on SSIS performance tuning[4] you’ve noticed this best practice under the heading of “Principles of Good Package Design”: “Follow Microsoft Development Guidelines - Iterative design, development & testing.”

Ok, so why did I just go off on that tangent and drop so many names? Regular readers will quickly note that my writing style practically revolves around tangents and will likely seize on this as the reason, but today it is to drive home a point: The real world is neither pretty nor neat, and people who know more about BI than I will ever forget[5] have gone out of their way to point this out.

Why is this bad, you ask yet again? Surely it can’t be so much work to copy a DTSX file to the same folder on a few computers, can it? And can you please just get to the point, Matthew?

Maybe. Picture this story:

A small team of five or six SSIS developers are working together on a project. The lead developer has created a package template for all developers on the team to use moving forward. He emails the DTSX file to each team member along with a reference to KB article 908018 for instructions on how to install and use the template.

A week later the lead developer identifies a problem with the package template.[6] He updates the template package and emails it out to his team with instructions to update their local machines with the new package right away.

Unfortunately, one of the team members is a “heads-down” kind of developer who turns off email during the day, and finishes a few more packages before he gets the message, even though he does save the new template package once he checks his email.

And another developer is on vacation when the email is sent, and by the time he gets back he has too many new messages in his inbox to ever get through, so he decides that if anything is really important then the sender will re-send it, and he just deletes anything that is unread to get a fresh start with no email backlog.[7] So he keeps on developing packages using the outdated template until the next update is sent out a few weeks later…

You see where this is going, right? This is gong to the “ugly” side of SSIS package templates. The ugly part comes in downstream when integration testing begins. The packages developed by the individual team members need to be deployed into a test environment and tested together, and of course at that point things start to fall apart. Everyone works extra hours and deadlines slip by while people try to track down problems they knew had been solved long ago. Although to be honest, the worst case scenario is that the test cases don’t find the problem at this point, and the ETL solution is deployed into production without the errors being resolved.

Unfortunately, I have seen this story play out in one way or another on several different projects over the years. The details are always different, but the ending is always the same: extra work, unhappy clients, and pain.

Now picture this story retold, with a few different details:

The lead developer works with his team to establish Team Foundation Server (TFS) source code control[8] before any development begins. He develops a set of packages that serve as “template examples” and checks them into TFS.

He also establishes standards with his team, so that each team member knows to Get Latest whenever starting work on a new package. This ensures that – no matter what happens – each developer always has the most recent versions of all packages with no need for “special case” tasks like updating a template.

The lead developer also instructs each development team member to create any new packages by right-clicking in Solution Explorer and selecting Add Existing Package, and then browsing to a current file from TFS.[9] This ensures that whenever a new package is created, it is being created as a new instance of a “rolling template” that is stored in TFS and as such the same version is always available for use by all developers.

Then, whenever problems are identified in existing packages, the “problem packages” are updated immediately, and the fixes are applied to any existing packages created from a now-outdated template.[10]

And whenever anyone goes on vacation, problems are found earlier, their impact is lessened, deadlines are met, and everyone lives happily ever after.

Ok, so the results are probably not so extreme, but in my experience on multiple projects this “rolling template” approach is the best practice way to approach SSIS package templates. Any attempt to utilize the Microsoft guidance in a team environment runs into problem after problem, and at the worst extreme I have seen a development team manually re-implement dozens of SSIS packages using a new template after churning for a full week while trying to reconcile and resolve the problems disseminated by using the “official” template approach.

And to be fair, the definition and implementation of policies and procedures can help make following the “official” approach less bad, but the starting point is inherently flawed for team development. If anyone were to suggest to me that following KB 908018 was the right way to approach package templates, I would have to politely[11] disagree.

What do you think? Even though I have been around the block a few times with SSIS, I don’t claim to know it all. I’d love to hear your thoughts on how you use package templates – especially if I took your name in vain earlier on - including what works and what does not work in your team environment.

[1] I never seem to find the time to do tourist stuff when traveling for work, but for some reason I just could not say no when I was invited to speak on SSIS best practices when in Sweden for a friend’s wedding. I have the world’s most understanding wife.

[2] I say “pretty much” because using small, modular packages and executing them via the Execute Package task or some other client, or using the SSIS .NET API to dynamically construct and execute packages are the other parts. In SQL Server 2008 and SQL Server 2008 there is not a lot to say about reuse in SSIS.

[3] Or projects using any other technology, for that matter. If you know of any projects like this and are looking for a consultant, I’m pretty sure I can work out a moonlighting arrangement with Microsoft. ;-)

[4] Or have seen me blatantly plagiarize his slide decks for my own talks.

[5] No Ted, this doesn’t include you. Sorry.

[6] I won’t go into the details of the problem because it isn’t really important to the story, but it could be anything along the lines of:

a) A missing variable, which needed to be added.
b) A variable defined at an incorrect scope, hiding a variable at the package scope when accessed from a Script task.
c) A log provider needed to be added to support new project requirements.

You get the idea, right?

[7] Yes, people really do this. Yes, I’m thinking about doing it right now. I’ve been on vacation for a week and have been checking email at least once a day to respond to high priority items, but I still have nearly 700 unread messages in my inbox. ;-(

[8] Or Visual SourceSafe or pretty much any modern SCC system. But TFS has so many additional benefits, it’s hard to recommend anything else.

[9] Just to be clear, I’m not implying that there is any ability to browse TFS repositories in BIDS, but instead am saying that the developers all have the most recent packages from TFS and can browse to them on their local file systems.

[10] Odds are they use a tool like PacMan to batch update the existing packages. ;-)

[11] Or perhaps not so politely depending on the hour…

Thursday, April 9, 2009

More on Package Configurations

Douglas Laudenschlager from the SSIS documentation team has posted an excellent article to his unofficial SSIS blog on how package configurations are applied in SQL Server 2005 and SQL Server 2008 Integration services. Check it out here.

I don’t often post on my blog simply to reference someone else’s original work, but I wanted to do it today because:

Package configurations are a vitally important feature that are often misunderstood.
The manner in which configurations are applied has changed between SQL Server 2005 and SQL Server 2008.
These changes can cause changes in package behavior when executing packages with specific DTEXEC options

So if you are moving from SQL Server 2005 to SQL Server 2008, definitely be sure to read about these changes.

Sunday, April 5, 2009

Visual Studio Window Layout for SSIS Development

My friend and colleague Lynn Langit has a habit of blogging where she posts an image and a brief question and that’s it. I’m very jealous, because I can’t seem to write a blog post that’s shorter than 20 pages or so.

So in this post I’m trying to be More Like Lynn.

This is how I configure my Visual Studio development environment for SSIS development:

(Click on the thumbnail for a larger image.)

How do you configure your windows?

Ok, so now if I were More Like Lynn, that would be the whole post. But since I haven’t had a mango martini in far too long, I feel compelled to elaborate:

I dock the Variables window in the upper left corner in its own tab group.
I dock the Toolbox window in the lower left corner in a different tab group.
(This lets me see both the Variables and the Toolbox at the same time during package development, which is often very useful.)
I dock the Log Events window in the same tab group as the Toolbox, because you generally don’t use them both at the same time.
I dock the Solution Explorer window in the upper right corner.
I dock the Team Explorer window in the same tab group as Solution Explorer.
I dock the Properties window in the lower right corner in its own tab group.
I dock the Error List window on the bottom.
I dock the Output window on the bottom in the same tab group.
I dock the Pending Changes window on the bottom in the same tab group as the Error List and the Output windows.
I leave all of the designers in the middle, since this is where they always go. :-)

This layout gives me ready access to all of the tools I need, when I need them. And I would still love to know how you lay out your windows in your SSIS development environment…

SSIS and the Package ProtectionLevel Property

This is another one of those “I could have sworn I blogged on this years ago” posts.[1] I was looking for something I’d already written that discussed the ProtectionLevel property of the SSIS Package, but search as I might I could not find it. So I guess it’s time to write it, eh?

(As a note for the lazy reader, the moral of the story is that you should use the DontSaveSensitive ProtectionLevel for your packages, and use package configurations to apply the sensitive property values to your packages. Feel free to read the next five pages, but it all boils down to this.)

First of all, let’s start off by reviewing the property itself. Take a look at the properties of any SSIS package and you’ll see it there:

Notice the description there at the bottom of the Properties window? “Specifies the protection mode and the protection method of the package.” Hey – that’s pretty self-explanatory, so maybe we don’t need a blog post after all. Right?

Maybe not.

Before we continue, why not take a look at what SQL Server Books Online (BOL) has to say about this property?

Ok, so you’ve read the docs. Now let’s review:

Component developers get to specify which properties of their components are considered sensitive.
For the built-in components, this includes the password property of connection managers, some system variables and the XML nodes in the underlying DTSX file that store this information.
SSIS users (that means us, the package developers) cannot control what is considered sensitive.

Why does this matter?

Well, SSIS is designed to be secure. One aspect of this is that any “sensitive” information that is part of your package (think about it – many packages include information like passwords in connection strings for connecting to databases and legacy systems where valuable data is stored) needs to be protected. And by protected, we mean never, ever stored in clear text.

That’s right, donkey – never, ever.[3]

But wait, you say – I’ve built many SSIS packages and I’ve never been prompted to encrypt any sensitive anything. How am I supposed to believe you?

Good question. Like most developers[4], the fine members of the SSIS product group wanted to avoid irritating their users.[5] Because of this, they wanted to make protecting sensitive data seamless, even transparent. So they did. Take another look at that screen shot of the Properties window above. Do you notice the EncryptSensitiveWithUserKey option? That’s the default.

This means that, by default, SSIS uses the Data Protection API (DPAPI) to encrypt any sensitive data with the user key of the user who saves the package. Every Windows user has a user key, so this is pretty straightforward. Then, when the user goes to open the package again, the encrypted data is automatically decrypted without the user needing to do anything. Pretty slick, eh?

Yep. Pretty slick. Especially for SSIS developers who live in a world where they never work in teams and never have their packages run in an unattended batch job that uses a service account.

Unfortunately, I don’t live in that world, and neither do many of the SSIS developers I know.

Ok, let’s be fair. That sounded awfully critical, but it wasn’t intended to be. I generally don’t like to criticize unless I can suggest a better alternative, and in this context I don’t see a more reasonable default behavior. It just works, and it lets developers be productive with SSIS without needing to worry about manually configuring security settings.

The criticism that I have is that people just don’t seem to understand this, and that causes problems. If one developer didn’t understand it, then it’s probably his fault. But if dozens or hundreds of people have this same problem (and they do – just take a look at the SSIS forums on MSDN to see how often questions related to the ProtectionLevel property come up) then the communication about the feature hasn’t been what it could be.

So what is the problem, and how does it show up? Well, consider one of these situations:

The package developer checks his package into source control[6] and another developer checks it out to make changes. The second developer can open the package, but gets an error saying “Failed to decrypt protected XML node "DTS:Password"” and the package will not execute.
The package developer deploys the package to the production server and the DBA schedules it to execute via a SQL Server Agent job. When the job is run, the job step fails with an error saying “Failed to decrypt protected XML node "DTS:Password"” and the package will not execute.

I won’t go on. You probably see the trend already. With the default settings, initial development is easy, but downstream tasks are more difficult. Only the user who saves the package can edit and execute it without error. This is probably not what we want.

What about the other options? Let’s consider:

EncryptAllWithUserKey – This option has all of the drawbacks of the default EncryptSensitiveuserKey option, but has the additional drawback of encrypting the whole darned package, so not only can no other user execute the package without re-entering the sensitive data, no other user can even open the package. Probably not what you want.
EncryptSensitiveWithPassword – This option replies on the package developer entering a password for the package, and then entering that password every time the package is opened or executed. Um… right. I don’t know about you, but my real-world SSIS projects tend to involve many dozens of packages. And I typically have a dozen or more open in Visual Studio at any given time. So picture me staggering to my desk on a blurry Monday morning, desperately clutching my mug of coffee, and launching Visual Studio. Visual Studio loves me, so it has remembered the packages I had open when I shut down for the weekend, and in a display of digital affection it re-opens them all.[7] Now picture the look on my face as I am prompted to enter the &^%$%@(* package $#^%#& password for every &@%$(^@ package as it opens. (And for that matter, picture the look on the faces of the poor people from accounting down the hall as they run for the fire escape.) And to make matters worse, this option pretty much negates the security it’s designed to provide, because every developer on the team needs to know the package password, and at least one developer[8] will put that password on his white board or on a Post-It Note.
EncryptAllWithPassword – Just when you thought it was safe to go back into the Properties window, here comes an option which combines the worst of all worlds. Not only do you have to enter the &%#*& password every time you open the package, the whole package contents are also encrypted so if you forget the password[9] you’re pretty much SOL[10].

So what does that leave us? That leaves us with two options. The first is the one that I use and advocate: the DontSaveSensitive option.

This option, like the others we’ve seen so far, is pretty much self-describing. It protects the sensitive data by simply not saving it in the package file. What’s not to like? No password, no user keys, everyone can open the file without error. It’s a little slice of perfection in our imperfect world!

Maybe.

Maybe not.

The difficulty, of course, comes in when you actually want to execute the package. The package can’t run if it doesn’t have a valid connection string to connect to that legacy database server from which it extracts data, now can it? So where does the sensitive data get stored, if not in the package? The answer is that we, the package developer, need to store that sensitive data in package configurations. This could be XML config files, it could be a SQL Server database, or the Windows Registry – there are many options. And each one of them can be secured in its own way, through database permissions, NTFS permissions or Registry ACLs.

The problem that some developers have with this is that it doesn’t let them get down to doing what they want to do the most, which is build their packages. And what do we say to those developers?

We say “tough luck!”

We don’t have much sympathy for those developers, because they’re missing the big picture. Yes, it can slow things down up front to have to add configurations to your packages, but this saves a ton of time later on. Configurations are a required feature of packages that will need to get deployed to multiple environments (and unless your packages will only ever be run by you on your machine, then this means your packages) and by adding them right at the beginning of package development you will have a much greater chance of having a successful deployment than if you wait until you start deploying to start thinking about deployment.

So this is what I recommend that SSIS developers do when building their packages:

Set the ProtectionLevel property to DontSaveSensitive
Add a package configuration for each connection string or other sensitive property in the package.
Test everything and make sure that it works.
Save the package and use it as a template package moving forward.[11]

Ahhh… life is good. Life is so much simpler when you have prescriptive guidance.

But wait… What about that last option. What about ServerStorage? Should we ever use that one?

ServerStorage can only be used when deploying packages into a SQL Server database[12]. If you deploy to SQL, then you can rely on the database to protect your sensitive data. But as you probably know, this doesn’t apply to package development. Visual Studio can only work with DTSX files on the file system – it cannot open packages from or save packages to SQL Server. This means that while ServerStorage is a viable option for once package development is complete, but not during development. During development you want to use DontSaveSensitive, even if you are planning to deploy to SQL.

So there you have it. I’ve spent five pages[13] saying “use the DontSaveSensitive package ProtectionLevel option when you develop SSIS packages.” But even though it took longer to read, hopefully you’ll have a better idea of why you should choose this option than if I’d just come out and told you…

[1] Just as a side note, re-read the subject line. Doesn’t it sound like the name of a children’s story? It was all that I could do to not write this post as an AeSISSop’s Fable or modern cautionary tale[2].

[2] I also started putting it to the tune of the Gilligan’s Island theme before i caught myself. I haven’t been feeling well lately, you see…

[3] Say it in your best Shrek voice and it may make sense. Especially in the fable context.

[4] Read the rest of the sentence and then have a good laugh with me, ok?

[5] Wait, then wipe the tears from your eyes, and then start reading again.

[6] And we all use source control, right?

[7] Awww….

[8] You know who I’m talking about.

[9] And can’t find your way down the hall to Bob’s white board.

[10] That’s “Server Offline” to you.

[11] There will likely be other things that need to be set up in your template package too, but this is a good start.

[12] I personally prefer file system deployments and so do most of the SSIS professionals I know. Check out this post from Jamie Thompson where he discusses this topic, and make sure you also click on the link in the top of his post too.

[13] And if memory serves me, set a new record for the number of footnotes in a single blog post.

Saturday, March 28, 2009

SSIS Development and Source Code Control

In my personal best practices for SSIS development, I claim that we need to treat SSIS packages like code. When I give presentations on this topic, I have a slide that elaborates and talks about how we must use source code control for our SSIS projects just like we do[1] for our “traditional” VB and C# projects. It’s one of those irrefutable best practices of software development; you wouldn’t drive your car without your seat belt, and you’d never even think of building software without source code control.

In fact, this is so obvious and so important, I had assumed that I must have blogged about it years ago. So I came here tonight to get the URL so I could reference it somewhere else.

And I couldn’t find it. So here’s the post I was expecting to find,[2] so that the next time I look I will find it here.

The points that I like to bring up in my presentations are how even though SSIS development is software development, there are inherent differences in the tools that can turn into “gotchas” if you’re not careful. In particular, walk through these basic steps just to make sure you’re following me:

Open an SSIS package in Visual Studio – a package that has at least one user variable defined.
Look up at the tab at the top of the package designer. You’ll see that the tab displays the name of the DTSX file followed by the word [Design] in square brackets, in case you weren’t sure you were looking at the designer and not the raw XML.
Ensure that there is no “*” asterisk after the word [Design] – this asterisk shows that the file has been modified and needs to be changed, and for the purposes of this exercise we need it to not show up. If your package does show an asterisk, save it now to make the asterisk go away.
In the Variables window, click on the name of a variable. Don’t change anything, just click.
Then click on the design surface. Again, don’t change anything – just click.
Then look up at the tab. What do you see?
That’s right! An asterisk!! The package has been changed by a non-editing action on your part. Has the world gone mad?!?

Um… probably not. Or at least, not yet.

But let’s assume for the sake of argument that you followed this best practice and have been using source code control. And let’s further assume that since no one told you any different that you set up your source control plug-in in Visual Studio and left it with the default settings.

Uh-oh.

Let’s take a look at those default settings, shall we?

That’s right. The default settings are evil.

Why evil? Well, think back to the “not really editing a variable name” scenario from earlier. If this package were under source control, you just checked out that package without meaning to, and probably without realizing it. And that means that as soon as you walk out the door to go on that two-week vacation[3], you just became that guy. That’s right – the guy who gets blamed for things because he’s not there. Because the default source control settings in Visual Studio are to silently check out a file when the developer starts editing it. And while this may make sense for traditional code files, it definitely does not make sense for DTSX files.

And the worst part is, you really are to blame here, at least in part, because you did not resist, you did not fight back against the evil that consumed you and is now ravaging your team’s productivity. You’re kind of like Gollum.[4]

But do not despair – it is not too late to turn back from the path of evil! In fact, you really just need to change one drop-down, like so:

That’s right – just say “Prompt for exclusive checkouts” and the problem goes away[5]. Any time that you perform a step in the SSIS designer that would cause the package to get checked out (we walked through one with the variable name thing, but there are many more) you’ll be prompted to check out the file.

And that’s about it. As usual, I spent five pages to say what could have been done with one screen shot and ten words, but hopefully you’ll understand the context a little better this way.

So always use source code control, but also be certain that you have configured it correctly for SSIS.

[1] We do, don’t we?

[2] Yes, it’s late on a Saturday night, but hopefully you too will see that this is more than a little like time travel. Or is it just me?

[3] I did mention that you are going on vacation, right? Go on – you’ve earned it!

[4] Did I mention how much I loved you in those movies?

[5] Yes, this means that Sam and Frodo’s budding romance will probably never get a chance to bloom, but that is a price I’m willing to pay for you to enjoy your vacation.

Wednesday, December 31, 2008

Christmas Came Late!

Santa has just delivered the gift that many SSIS developers have had on their Christmas lists: more samples and options for programmatically building SSIS packages. Although in this case, the role of Santa is played by Matt Masson of the SSIS team, who must have gained quite a bit of weight since the last time I saw him. Matt posted seven new articles on his blog yesterday, including this index post here: http://blogs.msdn.com/mattm/archive/2008/12/30/samples-for-creating-ssis-packages-programmatically.aspx

There are samples for building a package with a data flow task, and for adding OLE DB Source and Destination components, adding the ADO.NET Source component, adding a Row Count transformation and adding a Conditional Split transformation. Each post includes nicely commented C# code, and each one goes a long way towards filling the documentation gap around the SSIS data flow API.

And if that's not exciting enough, Santa was just getting started. Evgeny Koblov, a tester on the SSIS team has gone far beyond simply it easier to work with the less-than-intuitive COM API exposed by the SSIS data flow. He has built a better API. It's called EzAPI and you can read about it on Matt Masson's blog here: http://blogs.msdn.com/mattm/archive/2008/12/30/ezapi-alternative-package-creation-api.aspx. You can also download it and start using it on the CodePlex web site here: http://www.codeplex.com/SQLSrvIntegrationSrv/Release/ProjectReleases.aspx?ReleaseId=21238

EzAPI is probably the most exciting thing I've seen coming into the world of SSIS since the release of SQL Server 2008, if not before. It's essentially a native .NET wrapper around the underlying COM API, which doesn't sound particularly interesting at first glance, but since it delivers the ability to quickly and easily build SSIS packages and data flows through code, it's sure to be a time-saver (if not life-saver) for many SSIS developers.

Now please excuse me while I download EzAPI and start to play...

Monday, December 15, 2008

Changes in SQL Server 2008

MCT Russ Loski recently shared this great link with the SQL Server trainer community:

http://msdn.microsoft.com/en-us/library/cc280407.aspx

This is the top-level "Backward Compatibility" topic from SQL Server 2008 Books Online, and includes sub-topics for the DRBMS, SSAS, SSIS, SSRS and replication, covering deprecated features, behavioral changes, breaking changes and more.

If you're looking into what's new and different (and what might bite you if you're not careful) with SQL Server 2008, this is a great place to start. Thanks, Russ!

Wednesday, December 10, 2008

"Bonus" Technical Deep Dive Session at the MCT Summit - Redmond and Prague

The vast majority of the content at the 2009 MCT Summit events is heavily weighted toward the IT Professional audience. This is largely due to the fact that the major developer and database product releases took place in 2007 and 2008, while there are significant releases in-flight for Windows and Exchange Server. But I believe there should still be some deep technical content for at least one underrepresented trainer audience: the BI developer.

So I'm going to fill in this gap by presenting an "off the schedule" technical deep dive on SQL Server Integration Services. I did something similar at the 2008 Redmond summit and it was very well received, so I'm going to model the 2009 Summit session on the same model. Here's the deal:

“Everything You Ever Wanted to Know About SQL Server Integration Services but Were Afraid Your Students Would Ask”

In this technical “deep dive” session, Matthew Roche will take attendees on a wild and sometimes horrifying ride into the dark underbelly of real world SSIS development that existing SSIS books and courseware doesn’t effectively cover, including development and deployment best practices, data flow internals and performance tuning and more. But be warned – there will be no fixed agenda for this session! The topics covered will be driven by attendee involvement, so the more questions you bring to the session the more everyone will get out of it. If you teach (or fear you may be asked to teach) the SSIS courses for SQL Server 2005 or SQL Server 2008, this is a session you don’t dare to miss.

Does this sound interesting to you?

If you are attending either event (Prague or Redmond) and would be willing to attend this session after the summit sessions end one day[1] then please reply here[2] to express your interest. If anyone (even one attendee) is interested then I will come prepared to spend as much time as necessary (I think we ran around three hours at the 2008 Summit in Redmond) to give each attendee everything he needs. So speak now or forever hold your data...

[1] This is a key point, as it will mean skipping on some other planned after-hours event, I'm sure.

[2] Or in the microsoft.private.mct.mctsummits newsgroup.

Thursday, December 4, 2008

Know Your Data... Werewolves

(That's right. Werewolves.)

Over a year ago, I posted a list of SSIS best practices, and I've presented on this topic a dozen or more times since then at conferences and user groups and seminars. One of these best practices was "Really know your data – really!" and the advice went something like this:

"If there is one lesson I've learned it is that source systems never behave the way you expect them to and behave as documented even less frequently. Question everything, and then test to validate the answers you retrieve. You need to understand not only the static nature of the data - what is stored where - but also the dynamic nature of the data - how it changes when it changes, and what processes initiate those changes, and when, and how, and why. Odds are you will never understand a complex source system well enough, so make sure you are very friendly (may I recommend including a line item for chocolate and/or alcohol in your project budget?) with the business domain experts for the systems from which you will be extracting data. Really."

Well, last night I stumbled across some old meeting notes that reminded me of why this is so very important. The notes were incredibly brief, and consisted largely of a quote from the client project owner. But before I share the quote, please let me share the context and story[1]...

I was helping to design and build a BI application that my client was then re-selling as a service to their customers.[2] We were loading in data from a variety of source systems that provided the same type of data, but which were implemented by different vendors. One of the systems, called FooBar[3] was consistently causing us problems. The data import was running without error, and everything looked good on the surface, but the FooBar-using customers were unhappy[4] because the data they were seeing through the BI portal was incorrect.

Yuck.

The client project owner had, years earlier, worked with the development group that built FooBar, and with his insight we were able to discover that occasionally the data export process in FooBar was failing silently, so that we were getting only partial exports, and there was no way to tell - no error was raised, no checksum existed, and because of some nasty internal details (which I won't go into here) the effort involved in updating FooBar would be very significant. It was not a good time.

So what's the quote? It was this:

"You guys planned for wolves. FooBar is werewolves."

Even after all this time, I still remember the deadpan delivery for this gem. And it was so true. We'd gone in thinking we were prepared for all of the usual problems - and we were. But we weren't prepared for the horrifying reality of the data problems that were lying in wait. We weren't prepared for werewolves.

So what's the moral of the story? Plan for werewolves. Assume the worst. Test early and often, and test failure scenarios, not just happy-day scenarios. Because the time to learn that there really are werewolves is when there is still time to pack a crossbow and some silver bolts, not when the full moon is rising.

Trust me.

[1] Names have been changed to protect the innocent. And the other people too...

[2] This has been a recurring theme for me, so it's probably not the company you're thinking of.

[3] I changed the names, remember?

[4] I may also use a literary tool called "understatement" in this story...

Choosing the "Right" SSIS Configuration Location

One of the questions I get asked quite often boils down to this:

"Should I store my SSIS configuration data in SQL Server or in XML configuration files?"

And my every-so-helpful answer usually is:

"Yes!"[1]

That's right - very accurate, but not very useful.

Of course, for this question isn't quite as simple as it might seem at first glance, and there really is no single answer that is both generally accurate (like mine is) and useful (which mine is not) at the same time. The truly useful answer is more complex, because it relies on the context of the question more than it relies on the question itself. There is no "best" or "correct" or "right" choice for configuration storage in general - it all depends on the packages, the project and the project team.

What does this mean? I like to summarize it something like this:

If you are using a file system deployment, it probably makes more sense to use XML configuration files.
If you are using a SQL Server deployment, it probably makes more sense to use SQL Server configurations.
If your ETL solution is managed by the application owner or server administrator, it probably makes more sense to use XML configuration files.
If your ETL solution is managed by the database administrator, it probably makes more sense to use SQL Server configurations.
If project team members and/or administrators have past experience and success with a given configuration type, it probably makes sense to use that type unless there is some compelling project-specific reason to do otherwise.

Notice that there's nothing in this list about the capabilities of the different configuration types. I've never found a problem that could be solved with one configuration type that could not also be solved with the other.

Of course, there are differences in the approaches that you use when selecting and implementing a configuration strategy for an SSIS application. And it's not always obvious how to get started, or where to go one you do get started. Fortunately, there are quite a few resources available to help make the right decision for you and to implement your configuration strategy correctly the first time. In particular, there is a small set of MSDN community articles[2] written by SSIS-focused SQL Server MVPs about SSIS configurations. Take a look at these:

Best Practices for Integration Services Configurations by Jamie Thomson
Defining a Configuration Approach for Integration Services Packages by John Welch
Understanding Integration Services Package Configurations by Rafael Salas
Reusing Connections with Data Sources and Configurations by Matthew Roche

The articles are listed in the order of "most likely valuable" so if you don't have time to read all four, you should read the top ones first.

Check it out, and good luck!

[1] I blame my mother for this. My childhood was filled with conversations that went like this: "Are we having chicken for dinner, or fish?" "Yes!" "Um..." ;-)

[2] There is an interesting story behind these articles. Shortly before SQL Server 2008 shipped, the SSIS team asked a bunch of MVPs if they would like to write SSIS articles for MSDN. There was no guidance or suggestions provided on what topics these articles should cover. But all but one of the MVPs who provided articles independently chose to write about some aspect of configurations. To me this reinforces my belief that configurations are both one of the most important and least understood features in all of SSIS.

Friday, October 31, 2008

Just One More Reason to Love SSIS

Alternate title for this post: "Is Sarah Palin BI?" But I didn't want to go there, and I'm sure you appreciate that.

But here in the US, the presidential election season is nearly over (just a few more days to go) and many people are getting sick and tired of listening to the candidates talk. But for the last few months (or years) they have been talking a lot, and people have been transcribing what they say for the press and for prosperity.

And someone (sadly I do not know who[1]) has taken those words, particularly the words of vice presidential candidate Sarah Palin, and run them through the text mining features of SQL Server Integration Services. And the UK IT news web site The Register has the scoop.

What's not to love? Politics, plus killer BI tools, plus dry British humor[2] is a surefire recipe for holiday laughs.

Happy Halloween!

[1] Although if the author of the SSIS package reads this and wants to share the DTSX with me, I would love to have it ;-)

[2] Or does that need to be "humour?"

Thursday, October 30, 2008

Join Me in Prague!

Earlier this week the MCT team here at Microsoft Learning went live with the MCT Summit 2009 web site. There's information about the two currently scheduled summits:

Prague, Czech Republic - Jan 6, 2009 - Jan 9, 2009
Redmond, Washington, US - Jan 20, 2009 - Jan 23, 2009

We're also going to be announcing a 3rd Asia-Pacific region event soon, but the dates and location aren't yet finalized.

But the excitement here[1] today is that I will be attending the Prague event. I wasn't sure until today if I would be able to make it to Prague, but everything has come together and I will be there representing Microsoft Learning in my new Quality Architect role.

And for my own personal pleasure, I'm hoping to do a repeat performance of my "SSIS deep dive" session from February's MCT Summit event in Redmond, and present an "Everything you ever wanted to know about SSIS but were afraid your students would ask" technical session in addition to my Microsoft Learning presentations. The Summit schedule is still up in the air, but even if I can't get on the official schedule I will still do the session if I can find a room and an audience.[2] If you're interested, let me know early, ok?

I'll see you there!

[1] "Here" in this case is my office in New York - I'm sure that everyone at Microsoft Learning in Redmond is probably saying "Oh no, not that guy again!" ;-)

[2] This is what happened in Redmond at the 2008 MCT Summit. Everything came together at the last minute, so there was no way to get on the real schedule, but we ended up with a few dozen MCTs who stuck around for almost four hours of SSIS goodness, even though Microsoft was supplying MCTs with free drinks and food on the other side of town, How's that for commitment?

Wednesday, October 29, 2008

Certification Offers at PASS

Are you going to be at the PASS Community Summit 2008 conference next month in Seattle? I am, and you should be too. Why? Well, if the opportunity to attend amazing sessions and rub shoulders with members of the SQL Server product group and SQL Server MVPs wasn't enough, Microsoft Learning is offering a variety of certification-related offers to help PASS attendees get certified on SQL Server 2005 or SQL Server 2008.

First of all, there are two discounts offers available.

If you're among the first 100 to register for an exam, you get the exam for free. This offer has been published for a while, so act now for a chance to get this 100% discount.
If you're not among the first 100, you still get a 25% discount on any exams you want to take during the conference - just register and go.

Next, Microsoft Learning is also offering a set of exam prep sessions for PASS attendees who want an extra edge when taking a certification exam. Here's the tentative schedule (it's really unlikely that this is going to change, but you know how conferences are - you should always check the schedule on site once the event begins) for the exam prep sessions:

Tuesday, November 18

10:15am - 70 432 TS: Microsoft SQL Server 2008, Installation and Maintenance
1:00pm - 70 431 TS: Microsoft SQL Server 2005 - Implementation and Maintenance
2:45pm - 70-447 UPGRADE: MCDBA Skills to MCITP DBA by Using Microsoft SQL Server 2005

Wednesday, November 19

10:15am - 70 448 TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
1:30pm - 70 445 TS: Microsoft SQL Server 2005 - Business Intelligence
3:00pm - 70 455 Upgrade: Transition your MCITP Business Intelligence Developer Skills to MCITP Business Intelligence Developer 2008
4:30pm - Cert 101

Thursday, November 20

10:15am - 70 432 TS: Microsoft SQL Server 2008, Installation and Maintenance (repeat)
1:30pm - 70 450 PRO: Designing, Optimizing and Maintaining a Database Server Infrastructure using Microsoft SQL Server 2008
3:00pm - 70 453 Upgrade: Transition your MCITP Database Administrator Skills to MCITP Database Administrator 2008

As you can probably guess, I'm going to be presenting the three BI exam sessions on Wednesday. They asked me "can you do two?" and I said "no, but I can do three!" The other exam prep sessions are going to be delivered by veteran MCTs Telmo Sampaio and Rob Boek, and the Cert 101 session is going to be delivered by none other than Microsoft Learning's own Gerry O'Brien. What's not to love?

So - if you're going to be at PASS and are thinking about getting certified on SQL Sever, don't miss these offers. And be sure to look me up during the conference. I'll be in the certification area on Wednesday, and will probably be hanging out in the Ask The Experts Lounge at least some of the time (I have so many questions! ;-) and who knows where else I'll be the rest of the time - just keep your eyes open and be sure to say hello.

Monday, September 15, 2008

Business Intelligence vConference Update

If you've been following my posts about the SSWUG Business Intelligence vConference, you know that it's an online conference with 10 world-class speakers and 30+ sessions that cover the breadth of the Microsoft SQL Server BI stack.

But it just got even better better.

Based on popular demand from conference attendees, the BI vConference and its sister SharePoint, SQL Server and .NET vConferences will be consolidated into a single SSWUG Ultimate Conference[1] that includes all 120+ sessions from 40 industry-leading speakers

And you get all this for the $100 price tag of the BI-only conference.

The new conference dates are November 5, 6 and 7, so you also have some extra time to prepare for the show. Tell your friends, and I'll see you all there.

[1] I voted for the name of "ÜberCon" but got vetoed on this one...

Wednesday, September 10, 2008

Views as "Interfaces" for SSIS

This is a technique that I have used for quite some time, and while I have mentioned it to others often enough, I have never blogged on it. A recent conversation on the MVP newsgroups made me think of it again, and hopefully I'll get the chance to type it all up before I get pulled back to the real world.

Here's the scenario:

An SSIS package is extracting data from a 3rd party database.
The package is deployed to multiple dev/test/prod environments.
The package uses configurations so that deploying to multiple environments does not require changes to the package file.
The database schema of the various environments is "identical." [1]
The package works in one environment, but when deployed to another environment, it fails validation with the dreaded "VS_NEEDSNEWMETADATA" error.
After various debugging and diagnostic steps are completed, the package suddenly works again, even though "nothing has been changed."[2]

Hey wait - if the database schemas are identical, you shouldn't get this error, right? And since nothing has changed, the error shouldn't just disappear. What's going on here? Obviously SSIS shouldn't behave like this.

Of course, the real world story here is that someone was changing the schema of the test environment and had not communicated that change to anyone. Once this is known, the solution is pretty obvious, and it's clear that SSIS is behaving as desired, enforcing strict typing in the data flow.

But if this was a simple "user error" scenario it wouldn't be appropriate blog fodder, would it?

But unfortunately, this is not a rare, one-off scenario, and the solution generally involves more communication than it involves technology. But before looking at a solution, let's see if we can summarize the problem. These are the significant problem factors that I see when I look at this scenario:

The SSIS data flow is incredibly strict when it comes to data types. This strictness is closely related to interfaces in object oriented programming in that the names, order and data types of the columns being consumed cannot change in any way once a client (the SSIS data flow) exists that is built to use the interface.
When the "interface" of the source data changes, any data flows that reference it will break with the "VS_NEEDSNEWMETADATA" error mentioned above.
Database administrators are generally pretty good about not changing column names and data types on tables when they know that other database objects rely on those tables.[3]
There is no simple/easy/straightforward way for a DBA to tell if an SSIS package uses a given table. Solutions generally rely upon documentation that is external to the database, or just waiting until something breaks.

So how does this problem description help us reach a solution? Take a look at the last two bullets and the solution practically presents itself: Use database views as an "interface layer" within the database itself.

But what does this mean? In short, it means that:

The SSIS data flows should extract views from views and not from base tables.
The views should "self-documenting" to DBAs because they are database objects with concrete dependencies with the base tables from which they draw their data.
The views should explicitly CAST each column to an explicit data type and column alias - even when the alias name and data type exactly match the base table - to provide an additional layer of insulation against changes to the base tables.

The key here comes from the second and third bullets, above. Having the views as "clients" to the base tables should make it more obvious to people who are changing table schemas that the the tables in question are being used. And the explicit data type casting in the view definitions will define the "interface" required by the SSIS package clients in a way that is familiar to DBAs.

Now obviously this isn't always an option - many DBAs will not allow the creation of new database objects (such as views) but when this is viable it is a technique with which I have had great success. And when it's not, you can still get some of the same benefits (namely the data type change protection) by basing your data flow on a source query that explicitly casts and aliases each column in the SELECT statement. Then, so long as data type changes to the underlying tables can still support an explicit cast to the data type required by the SSIS data flow, the package still won't break.

So there you have it - a technique to isolate SSIS data flows from changes in source databases, and to reduce the likelihood of those changes happening in the first place.

What do you think? I'd love to hear if anyone has used similar techniques...

[1] Yes, this is in quotes for a reason

[2] Same thing here

[3] This is a pretty broad statement, I know...

Saturday, August 30, 2008

SSIS Sample Data Flow Transformations

Every few months I stumble across this old blog post by Euan Garden[1] that links to a bunch of sample transformation components that the SSIS team put out in the 2005 release timeframe. Check it out: http://blogs.msdn.com/euanga/archive/2006/02/14/IsTransforms.aspx

Now obviously this isn't new news, but I can never find it when I'm looking for it, so hopefully it will help me and you the next time one of us is looking for sample transformations...

[1] Always when I'm looking for something else entirely.

Thursday, August 28, 2008

New SSIS Article Online on MSDN

I have a new article on the SSIS Developer Center on MSDN, focusing on Data Sources and Configurations as tools for connection reuse across sets of SSIS packages. I wrote it months ago (looking at it now it seems even longer[1]) but it just made it online today.

Check it out here: http://msdn.microsoft.com/en-us/library/cc671619.aspx

This is actually one of a set of articles by SSIS-focused SQL Server MVPs that will be published over the next week of so - I just could not wait to let people know that it was out there. I'll post links to all of the articles once they're all online (and I probably won't be the only one to do so) but for now you can get started with this one. Enjoy!

[1] For example, the "about the author" blurb at the bottom lists me as still working at Configuresoft, even though I have not been working with Configuresoft since the end of July.