BI Polar

Tuesday, September 29, 2009

SQL Server MVP Deep Dives

I’m not a SQL Server Most Valuable Professional anymore[1], but I was when this amazing project got underway, and I am proud to contribute as an author on SQL Server Integration Services. SQL Server MVP Deep Dives is a unique book. It has:

53 authors
59 chapters
Coverage of database design, development, administration and performance tuning
Coverage of the SQL Server BI suite: SSIS, SSAS and SSRS

But that’s not the coolest thing about this book. The coolest thing is where the money goes. All author royalties (that’s right – 100%) go to War Child International, a network of independent organizations, working across the world to help children affected by war.[2] So not only will this book help you become a better SQL Server practitioner, it will also help children around the world.

You can pre-order the book online today at the SQL Server MVP Deep Dives web site, and get immediate access to a pre-release PDF copy of the book. (Note: if you order before the end of September and use the code pop0928 at the checkout you will also save 50%!) Also, if you’re attending the PASS Summit in Seattle in early November, you can buy a copy at the conference bookstore and it signed by many of the authors. I’ll be there, and based on the email I’ve seen we’ll see a majority of the authors and editors should be there as well.

If you only buy one SQL Server book this year, why not make it this book? And since all proceeds go to benefit a great cause, why not buy two?

I’ll see you at PASS!

[1] You can’t be an MVP and a Microsoft employee at the same time, so when I joined Microsoft they took away my MVP status. Which was OK.

[2] Not, as some may think, a non-profit organization dedicated to helping aging Jethro Tull fans.

Monday, September 21, 2009

SSIS Package Templates: The Good, The Bad and The Ugly.

This is one of those “I could have sworn I blogged on this topic years ago” posts. I speak to this topic whenever I’m presenting on SSIS development best practices, and with my half-day mini-seminar in Stockholm last week[1] I was thinking of it – and speaking about it - again. Although SSIS is a great platform and provides pretty much all the tools you need to get the job done, sometimes it’s not always easy to know what tools to use and how to use them.

Which brings us, of course, to package templates. So before we continue, please read this Knowledge Base article on how to create a package template in BIDS.

Done? Good – let’s review, and paraphrase while we do it.

Package templates are pretty much the entire reuse story in SSIS.[2] You define the baseline package, including connection managers, log providers, configurations, variables and expressions, tasks, containers and precedence constraints and sometimes data flow components as well. When you create a new package you create a new copy of the template with this pre-defined starting point instead of starting with a blank package, so you can be sure that the package is just like the baseline you created.

In keeping with the theme established in the post title, this is the “good” side of SSIS package templates. They’re valuable, and when you’re building a large number of packages for the ETL subsystem of a DW/BI project, they’re invaluable, in that the common “plumbing” required by all packages can be defined once and used consistently everywhere. And this is particularly vital when working in a team environment, where different developers (each of whom inherently has his own style and own way of building packages) are working on the same project, which is typically the case on ETL projects of anything but the most trivial in scale.

Which brings us to the “bad” side of package templates, at least as implemented when following the guidance in the KB article referenced above. This guidance is firmly targeted at the lone developer; did you notice how all the instructions about creating a package and saving it in the Visual Studio install folder never mentioned on what computer you should be performing these steps? That’s because the article assumes the context is “your machine” – the lone developer’s machine or - at scale – on every developer’s machine.

Why is this bad, you ask? Surely it can’t be so much work to copy a DTSX file to the same folder on a few computers, can it?

Of course not. The “bad” side of following this approach comes not when creating the templates, but when the real world comes into play and the project continues to grow and evolve. In a perfect world, when the needs of the project are understood 100% before development begins, and the needs of the project never change after development begins, this approach will work just fine for team development. Your mileage may vary, but I’ve yet to work an SSIS project[3] like this.

In the real world, most successful BI projects are iterative and agile, just like most other successful development projects. If you look at SQL Server MVP John Welch or MVP and VSTS Ranger Ted Malone, their blogs focus strongly on agile BI. If you read Kimball, you will see an overarching theme of agile and iterative development. And if you’ve ever seen former SSIS program manager Michael Entin present on SSIS performance tuning[4] you’ve noticed this best practice under the heading of “Principles of Good Package Design”: “Follow Microsoft Development Guidelines - Iterative design, development & testing.”

Ok, so why did I just go off on that tangent and drop so many names? Regular readers will quickly note that my writing style practically revolves around tangents and will likely seize on this as the reason, but today it is to drive home a point: The real world is neither pretty nor neat, and people who know more about BI than I will ever forget[5] have gone out of their way to point this out.

Why is this bad, you ask yet again? Surely it can’t be so much work to copy a DTSX file to the same folder on a few computers, can it? And can you please just get to the point, Matthew?

Maybe. Picture this story:

A small team of five or six SSIS developers are working together on a project. The lead developer has created a package template for all developers on the team to use moving forward. He emails the DTSX file to each team member along with a reference to KB article 908018 for instructions on how to install and use the template.

A week later the lead developer identifies a problem with the package template.[6] He updates the template package and emails it out to his team with instructions to update their local machines with the new package right away.

Unfortunately, one of the team members is a “heads-down” kind of developer who turns off email during the day, and finishes a few more packages before he gets the message, even though he does save the new template package once he checks his email.

And another developer is on vacation when the email is sent, and by the time he gets back he has too many new messages in his inbox to ever get through, so he decides that if anything is really important then the sender will re-send it, and he just deletes anything that is unread to get a fresh start with no email backlog.[7] So he keeps on developing packages using the outdated template until the next update is sent out a few weeks later…

You see where this is going, right? This is gong to the “ugly” side of SSIS package templates. The ugly part comes in downstream when integration testing begins. The packages developed by the individual team members need to be deployed into a test environment and tested together, and of course at that point things start to fall apart. Everyone works extra hours and deadlines slip by while people try to track down problems they knew had been solved long ago. Although to be honest, the worst case scenario is that the test cases don’t find the problem at this point, and the ETL solution is deployed into production without the errors being resolved.

Unfortunately, I have seen this story play out in one way or another on several different projects over the years. The details are always different, but the ending is always the same: extra work, unhappy clients, and pain.

Now picture this story retold, with a few different details:

The lead developer works with his team to establish Team Foundation Server (TFS) source code control[8] before any development begins. He develops a set of packages that serve as “template examples” and checks them into TFS.

He also establishes standards with his team, so that each team member knows to Get Latest whenever starting work on a new package. This ensures that – no matter what happens – each developer always has the most recent versions of all packages with no need for “special case” tasks like updating a template.

The lead developer also instructs each development team member to create any new packages by right-clicking in Solution Explorer and selecting Add Existing Package, and then browsing to a current file from TFS.[9] This ensures that whenever a new package is created, it is being created as a new instance of a “rolling template” that is stored in TFS and as such the same version is always available for use by all developers.

Then, whenever problems are identified in existing packages, the “problem packages” are updated immediately, and the fixes are applied to any existing packages created from a now-outdated template.[10]

And whenever anyone goes on vacation, problems are found earlier, their impact is lessened, deadlines are met, and everyone lives happily ever after.

Ok, so the results are probably not so extreme, but in my experience on multiple projects this “rolling template” approach is the best practice way to approach SSIS package templates. Any attempt to utilize the Microsoft guidance in a team environment runs into problem after problem, and at the worst extreme I have seen a development team manually re-implement dozens of SSIS packages using a new template after churning for a full week while trying to reconcile and resolve the problems disseminated by using the “official” template approach.

And to be fair, the definition and implementation of policies and procedures can help make following the “official” approach less bad, but the starting point is inherently flawed for team development. If anyone were to suggest to me that following KB 908018 was the right way to approach package templates, I would have to politely[11] disagree.

What do you think? Even though I have been around the block a few times with SSIS, I don’t claim to know it all. I’d love to hear your thoughts on how you use package templates – especially if I took your name in vain earlier on - including what works and what does not work in your team environment.

[1] I never seem to find the time to do tourist stuff when traveling for work, but for some reason I just could not say no when I was invited to speak on SSIS best practices when in Sweden for a friend’s wedding. I have the world’s most understanding wife.

[2] I say “pretty much” because using small, modular packages and executing them via the Execute Package task or some other client, or using the SSIS .NET API to dynamically construct and execute packages are the other parts. In SQL Server 2008 and SQL Server 2008 there is not a lot to say about reuse in SSIS.

[3] Or projects using any other technology, for that matter. If you know of any projects like this and are looking for a consultant, I’m pretty sure I can work out a moonlighting arrangement with Microsoft. ;-)

[4] Or have seen me blatantly plagiarize his slide decks for my own talks.

[5] No Ted, this doesn’t include you. Sorry.

[6] I won’t go into the details of the problem because it isn’t really important to the story, but it could be anything along the lines of:

a) A missing variable, which needed to be added.
b) A variable defined at an incorrect scope, hiding a variable at the package scope when accessed from a Script task.
c) A log provider needed to be added to support new project requirements.

You get the idea, right?

[7] Yes, people really do this. Yes, I’m thinking about doing it right now. I’ve been on vacation for a week and have been checking email at least once a day to respond to high priority items, but I still have nearly 700 unread messages in my inbox. ;-(

[8] Or Visual SourceSafe or pretty much any modern SCC system. But TFS has so many additional benefits, it’s hard to recommend anything else.

[9] Just to be clear, I’m not implying that there is any ability to browse TFS repositories in BIDS, but instead am saying that the developers all have the most recent packages from TFS and can browse to them on their local file systems.

[10] Odds are they use a tool like PacMan to batch update the existing packages. ;-)

[11] Or perhaps not so politely depending on the hour…

Friday, May 22, 2009

Coolest. Tool. Ever.

I’m struggling with how to best start this post. Here are a few possibilities:

I like to think that I’m a hard person to impress. But…
I swear at my computer many times a day, but usually in anger and almost never in delighted surprise.
Someone deserves a raise.
Holy crap!

All of them apply, so you can take your pick.

But what am I talking about?

I’m talking about the DocumentReflector tool that’s included with the Open XML Format SDK v2.0.

You see, I’ve been playing today with a few Office development tasks today for a work-related project. I’ve never looked too closely at working with the Office Open XML file format before this week; I’ve read white papers and such, but until you start writing your own code you never really “get” what an API is all about. After a few false starts and many searches, I was less than impressed.

But then I discovered DocumentReflector. This is a stand-alone SDK tool that will (wait for it) let you open an Office Open XML document, select a subset of the document or the entire document and it (here it comes) displays the C# code that will create it.

Holy crap indeed.

I’m sure I’ll be swearing at this amazing tool before the weekend is over (the new wears off pretty quickly for me) but today whoever built this app is officially my hero.

Edit: I almost forgot to add – there’s a great overview of using this tool on Erika Ehrli’s blog, here.

Now I can’t help but ask why we don’t have a tool like this for SSIS packages. Wouldn’t it be great to point a utility at a DTSX file and have it output the code required to build that package through the SSIS API? ;-)

Tuesday, May 19, 2009

Spatial Demo Goodness

If you’re someone who presents on SQL Server topics, you have probably run into something of a wall when it comes to getting interesting spatial data sets to demonstrate. The spatial data included with the SQL Server 2008 sample databases is functional, but not particularly complex or interesting. There is free spatial data available for many different sources online, but it tends to be difficult to find, in different formats, and annoyingly difficult to load into SQL Server.[1] And of course, the not-free spatial content out there tends to be really, really not-free, and while it may make sense to pay a premium price if you are developing premium software, but for demo purposes this is generally a non-starter.

Enter GeoNames.

GeoNames is an open source provider of spatial data. Essentially they have many disparate sources of free spatial content and have aggregated them into a single location, with many different access methods. They support web service access (and publish a nice set of client libraries too) which is nice for direct application integration, but to me the cool factor comes from the ability to download text dumps of the whole database or just the countries you want. Because then you can load the data into SQL Server 2008 and let the demo goodness begin.

And Ed Katibah, PM for the SQL Server spatial team, has posted instructions for loading GeoNames data into SQL Server 2008. It’s great to have these steps documented because there are quite a few of them, but hopefully you’ll only need to perform them once.

So if you have been waiting for great spatial data that’s available for free, wait no longer.

I should also point out that I became aware of this cool resource not based on my own hard work and research, but instead because of the excellent Simple Talk newsletter that Red Gate Software produces. And I should probably mention that the primary reason I blogged about it is that my friend and colleague, senior SQL Server trainer and all-around good guy Chris Randall has been working on building better spatial demo sets, and I’ve heard rumors that he occasionally reads this blog. Hopefully this will save him (and you, and me) some work…

[1] Please keep in mind that I don’t claim to be a SQL Server spatial expert, so what is “annoyingly difficult” for me may be “exceptionally simple” for someone with more experience, but it likely to be “annoyingly difficult” for many people.

Thursday, April 9, 2009

More on Package Configurations

Douglas Laudenschlager from the SSIS documentation team has posted an excellent article to his unofficial SSIS blog on how package configurations are applied in SQL Server 2005 and SQL Server 2008 Integration services. Check it out here.

I don’t often post on my blog simply to reference someone else’s original work, but I wanted to do it today because:

Package configurations are a vitally important feature that are often misunderstood.
The manner in which configurations are applied has changed between SQL Server 2005 and SQL Server 2008.
These changes can cause changes in package behavior when executing packages with specific DTEXEC options

So if you are moving from SQL Server 2005 to SQL Server 2008, definitely be sure to read about these changes.

Sunday, April 5, 2009

Visual Studio Window Layout for SSIS Development

My friend and colleague Lynn Langit has a habit of blogging where she posts an image and a brief question and that’s it. I’m very jealous, because I can’t seem to write a blog post that’s shorter than 20 pages or so.

So in this post I’m trying to be More Like Lynn.

This is how I configure my Visual Studio development environment for SSIS development:

(Click on the thumbnail for a larger image.)

How do you configure your windows?

Ok, so now if I were More Like Lynn, that would be the whole post. But since I haven’t had a mango martini in far too long, I feel compelled to elaborate:

I dock the Variables window in the upper left corner in its own tab group.
I dock the Toolbox window in the lower left corner in a different tab group.
(This lets me see both the Variables and the Toolbox at the same time during package development, which is often very useful.)
I dock the Log Events window in the same tab group as the Toolbox, because you generally don’t use them both at the same time.
I dock the Solution Explorer window in the upper right corner.
I dock the Team Explorer window in the same tab group as Solution Explorer.
I dock the Properties window in the lower right corner in its own tab group.
I dock the Error List window on the bottom.
I dock the Output window on the bottom in the same tab group.
I dock the Pending Changes window on the bottom in the same tab group as the Error List and the Output windows.
I leave all of the designers in the middle, since this is where they always go. :-)

This layout gives me ready access to all of the tools I need, when I need them. And I would still love to know how you lay out your windows in your SSIS development environment…

SSIS and the Package ProtectionLevel Property

This is another one of those “I could have sworn I blogged on this years ago” posts.[1] I was looking for something I’d already written that discussed the ProtectionLevel property of the SSIS Package, but search as I might I could not find it. So I guess it’s time to write it, eh?

(As a note for the lazy reader, the moral of the story is that you should use the DontSaveSensitive ProtectionLevel for your packages, and use package configurations to apply the sensitive property values to your packages. Feel free to read the next five pages, but it all boils down to this.)

First of all, let’s start off by reviewing the property itself. Take a look at the properties of any SSIS package and you’ll see it there:

Notice the description there at the bottom of the Properties window? “Specifies the protection mode and the protection method of the package.” Hey – that’s pretty self-explanatory, so maybe we don’t need a blog post after all. Right?

Maybe not.

Before we continue, why not take a look at what SQL Server Books Online (BOL) has to say about this property?

Ok, so you’ve read the docs. Now let’s review:

Component developers get to specify which properties of their components are considered sensitive.
For the built-in components, this includes the password property of connection managers, some system variables and the XML nodes in the underlying DTSX file that store this information.
SSIS users (that means us, the package developers) cannot control what is considered sensitive.

Why does this matter?

Well, SSIS is designed to be secure. One aspect of this is that any “sensitive” information that is part of your package (think about it – many packages include information like passwords in connection strings for connecting to databases and legacy systems where valuable data is stored) needs to be protected. And by protected, we mean never, ever stored in clear text.

That’s right, donkey – never, ever.[3]

But wait, you say – I’ve built many SSIS packages and I’ve never been prompted to encrypt any sensitive anything. How am I supposed to believe you?

Good question. Like most developers[4], the fine members of the SSIS product group wanted to avoid irritating their users.[5] Because of this, they wanted to make protecting sensitive data seamless, even transparent. So they did. Take another look at that screen shot of the Properties window above. Do you notice the EncryptSensitiveWithUserKey option? That’s the default.

This means that, by default, SSIS uses the Data Protection API (DPAPI) to encrypt any sensitive data with the user key of the user who saves the package. Every Windows user has a user key, so this is pretty straightforward. Then, when the user goes to open the package again, the encrypted data is automatically decrypted without the user needing to do anything. Pretty slick, eh?

Yep. Pretty slick. Especially for SSIS developers who live in a world where they never work in teams and never have their packages run in an unattended batch job that uses a service account.

Unfortunately, I don’t live in that world, and neither do many of the SSIS developers I know.

Ok, let’s be fair. That sounded awfully critical, but it wasn’t intended to be. I generally don’t like to criticize unless I can suggest a better alternative, and in this context I don’t see a more reasonable default behavior. It just works, and it lets developers be productive with SSIS without needing to worry about manually configuring security settings.

The criticism that I have is that people just don’t seem to understand this, and that causes problems. If one developer didn’t understand it, then it’s probably his fault. But if dozens or hundreds of people have this same problem (and they do – just take a look at the SSIS forums on MSDN to see how often questions related to the ProtectionLevel property come up) then the communication about the feature hasn’t been what it could be.

So what is the problem, and how does it show up? Well, consider one of these situations:

The package developer checks his package into source control[6] and another developer checks it out to make changes. The second developer can open the package, but gets an error saying “Failed to decrypt protected XML node "DTS:Password"” and the package will not execute.
The package developer deploys the package to the production server and the DBA schedules it to execute via a SQL Server Agent job. When the job is run, the job step fails with an error saying “Failed to decrypt protected XML node "DTS:Password"” and the package will not execute.

I won’t go on. You probably see the trend already. With the default settings, initial development is easy, but downstream tasks are more difficult. Only the user who saves the package can edit and execute it without error. This is probably not what we want.

What about the other options? Let’s consider:

EncryptAllWithUserKey – This option has all of the drawbacks of the default EncryptSensitiveuserKey option, but has the additional drawback of encrypting the whole darned package, so not only can no other user execute the package without re-entering the sensitive data, no other user can even open the package. Probably not what you want.
EncryptSensitiveWithPassword – This option replies on the package developer entering a password for the package, and then entering that password every time the package is opened or executed. Um… right. I don’t know about you, but my real-world SSIS projects tend to involve many dozens of packages. And I typically have a dozen or more open in Visual Studio at any given time. So picture me staggering to my desk on a blurry Monday morning, desperately clutching my mug of coffee, and launching Visual Studio. Visual Studio loves me, so it has remembered the packages I had open when I shut down for the weekend, and in a display of digital affection it re-opens them all.[7] Now picture the look on my face as I am prompted to enter the &^%$%@(* package $#^%#& password for every &@%$(^@ package as it opens. (And for that matter, picture the look on the faces of the poor people from accounting down the hall as they run for the fire escape.) And to make matters worse, this option pretty much negates the security it’s designed to provide, because every developer on the team needs to know the package password, and at least one developer[8] will put that password on his white board or on a Post-It Note.
EncryptAllWithPassword – Just when you thought it was safe to go back into the Properties window, here comes an option which combines the worst of all worlds. Not only do you have to enter the &%#*& password every time you open the package, the whole package contents are also encrypted so if you forget the password[9] you’re pretty much SOL[10].

So what does that leave us? That leaves us with two options. The first is the one that I use and advocate: the DontSaveSensitive option.

This option, like the others we’ve seen so far, is pretty much self-describing. It protects the sensitive data by simply not saving it in the package file. What’s not to like? No password, no user keys, everyone can open the file without error. It’s a little slice of perfection in our imperfect world!

Maybe.

Maybe not.

The difficulty, of course, comes in when you actually want to execute the package. The package can’t run if it doesn’t have a valid connection string to connect to that legacy database server from which it extracts data, now can it? So where does the sensitive data get stored, if not in the package? The answer is that we, the package developer, need to store that sensitive data in package configurations. This could be XML config files, it could be a SQL Server database, or the Windows Registry – there are many options. And each one of them can be secured in its own way, through database permissions, NTFS permissions or Registry ACLs.

The problem that some developers have with this is that it doesn’t let them get down to doing what they want to do the most, which is build their packages. And what do we say to those developers?

We say “tough luck!”

We don’t have much sympathy for those developers, because they’re missing the big picture. Yes, it can slow things down up front to have to add configurations to your packages, but this saves a ton of time later on. Configurations are a required feature of packages that will need to get deployed to multiple environments (and unless your packages will only ever be run by you on your machine, then this means your packages) and by adding them right at the beginning of package development you will have a much greater chance of having a successful deployment than if you wait until you start deploying to start thinking about deployment.

So this is what I recommend that SSIS developers do when building their packages:

Set the ProtectionLevel property to DontSaveSensitive
Add a package configuration for each connection string or other sensitive property in the package.
Test everything and make sure that it works.
Save the package and use it as a template package moving forward.[11]

Ahhh… life is good. Life is so much simpler when you have prescriptive guidance.

But wait… What about that last option. What about ServerStorage? Should we ever use that one?

ServerStorage can only be used when deploying packages into a SQL Server database[12]. If you deploy to SQL, then you can rely on the database to protect your sensitive data. But as you probably know, this doesn’t apply to package development. Visual Studio can only work with DTSX files on the file system – it cannot open packages from or save packages to SQL Server. This means that while ServerStorage is a viable option for once package development is complete, but not during development. During development you want to use DontSaveSensitive, even if you are planning to deploy to SQL.

So there you have it. I’ve spent five pages[13] saying “use the DontSaveSensitive package ProtectionLevel option when you develop SSIS packages.” But even though it took longer to read, hopefully you’ll have a better idea of why you should choose this option than if I’d just come out and told you…

[1] Just as a side note, re-read the subject line. Doesn’t it sound like the name of a children’s story? It was all that I could do to not write this post as an AeSISSop’s Fable or modern cautionary tale[2].

[2] I also started putting it to the tune of the Gilligan’s Island theme before i caught myself. I haven’t been feeling well lately, you see…

[3] Say it in your best Shrek voice and it may make sense. Especially in the fable context.

[4] Read the rest of the sentence and then have a good laugh with me, ok?

[5] Wait, then wipe the tears from your eyes, and then start reading again.

[6] And we all use source control, right?

[7] Awww….

[8] You know who I’m talking about.

[9] And can’t find your way down the hall to Bob’s white board.

[10] That’s “Server Offline” to you.

[11] There will likely be other things that need to be set up in your template package too, but this is a good start.

[12] I personally prefer file system deployments and so do most of the SSIS professionals I know. Check out this post from Jamie Thompson where he discusses this topic, and make sure you also click on the link in the top of his post too.

[13] And if memory serves me, set a new record for the number of footnotes in a single blog post.