Monday, March 30, 2009

Hey, That Was Fun!

As you probably know, I’m a Microsoft Certified Trainer (MCT). As you may not know, MCTs are required to renew their certification each year, and the 2009 renewal period is currently underway. I got my MT 2009 welcome kit a week or so ago.

And with that welcome kit came a nice surprise – a voucher for one free Microsoft Certified Application Specialist (MCAS) exam.

Now, if you’re like me, you probably like to think of yourself as being a geek. Someone who builds software that has never existed before, and who dominates the poor software that is unlucky enough to cross your path. (Or something like that anyway.) So although I had been aware of the MCAS certification for some time, I had never really considered trying to achieve it. But hey – free is free, right?

So last Friday I scheduled my exam (77-601: Using Microsoft Office Word 2007) for this morning. I was going to be out of the office for a dentist appointment anyway, and the test center is on the way home, so why not, right?

Well, to make a long story short, I passed the exam and had a great time doing it. The exam was delivered in Word 2007 – each question gave me a list of things to accomplish and I then used Word 2007 to achieve the required goals. It didn’t matter what steps you used (I’m a big keyboard shortcut user and don’t use the Ribbon too often) so long as you achieve the requirements as written.

The thing that struck me here was that this exam really tested my ability to use Word 2007 as a “power user.” Not just someone who typed documents, but someone who knows what features are there and who knows how to use them. I doubt that I’ll stop taking my technical certification exams any time soon, but I think that I may just need to try my hand at the Excel 2007 and PowerPoint 2007 exams too…

Saturday, March 28, 2009

SSIS Development and Source Code Control

In my personal best practices for SSIS development, I claim that we need to treat SSIS packages like code. When I give presentations on this topic, I have a slide that elaborates and talks about how we must use source code control for our SSIS projects just like we do[1] for our “traditional” VB and C# projects. It’s one of those irrefutable best practices of software development; you wouldn’t drive your car without your seat belt, and you’d never even think of building software without source code control.

In fact, this is so obvious and so important, I had assumed that I must have blogged about it years ago. So I came here tonight to get the URL so I could reference it somewhere else.

And I couldn’t find it. So here’s the post I was expecting to find,[2] so that the next time I look I will find it here.

The points that I like to bring up in my presentations are how even though SSIS development is software development, there are inherent differences in the tools that can turn into “gotchas” if you’re not careful. In particular, walk through these basic steps just to make sure you’re following me:

  1. Open an SSIS package in Visual Studio – a package that has at least one user variable defined.
  2. Look up at the tab at the top of the package designer. You’ll see that the tab displays the name of the DTSX file followed by the word [Design] in square brackets, in case you weren’t sure you were looking at the designer and not the raw XML.
  3. Ensure that there is no “*” asterisk after the word [Design] – this asterisk shows that the file has been modified and needs to be changed, and for the purposes of this exercise we need it to not show up. If your package does show an asterisk, save it now to make the asterisk go away.
  4. In the Variables window, click on the name of a variable. Don’t change anything, just click.
  5. Then click on the design surface. Again, don’t change anything – just click.
  6. Then look up at the tab. What do you see?
  7. That’s right! An asterisk!! The package has been changed by a non-editing action on your part. Has the world gone mad?!?

Um… probably not. Or at least, not yet.

But let’s assume for the sake of argument that you followed this best practice and have been using source code control. And let’s further assume that since no one told you any different that you set up your source control plug-in in Visual Studio and left it with the default settings.

Uh-oh.

Let’s take a look at those default settings, shall we?

SSIS SCC Evil

That’s right. The default settings are evil.

Why evil? Well, think back to the “not really editing a variable name” scenario from earlier. If this package were under source control, you just checked out that package without meaning to, and probably without realizing it. And that means that as soon as you walk out the door to go on that two-week vacation[3], you just became that guy. That’s right – the guy who gets blamed for things because he’s not there. Because the default source control settings in Visual Studio are to silently check out a file when the developer starts editing it. And while this may make sense for traditional code files, it definitely does not make sense for DTSX files.

And the worst part is, you really are to blame here, at least in part, because you did not resist, you did not fight back against the evil that consumed you and is now ravaging your team’s productivity. You’re kind of like Gollum.[4]

But do not despair – it is not too late to turn back from the path of evil! In fact, you really just need to change one drop-down, like so:

SSIS SCC Good

That’s right – just say “Prompt for exclusive checkouts” and the problem goes away[5]. Any time that you perform a step in the SSIS designer that would cause the package to get checked out (we walked through one with the variable name thing, but there are many more) you’ll be prompted to check out the file.

And that’s about it. As usual, I spent five pages to say what could have been done with one screen shot and ten words, but hopefully you’ll understand the context a little better this way.

So always use source code control, but also be certain that you have configured it correctly for SSIS.

[1] We do, don’t we?

[2] Yes, it’s late on a Saturday night, but hopefully you too will see that this is more than a little like time travel. Or is it just me?

[3] I did mention that you are going on vacation, right? Go on – you’ve earned it!

[4] Did I mention how much I loved you in those movies?

[5] Yes, this means that Sam and Frodo’s budding romance will probably never get a chance to bloom, but that is a price I’m willing to pay for you to enjoy your vacation.

Thursday, March 26, 2009

Going to Tech·Ed 2009 North America!

I’ve been to Tech·Ed North America every year for the last four years. It’s Microsoft’s biggest technical conference and is an amazing opportunity to learn and to network – and it’s always a ton of fun as well. People have been asking me a lot lately “are you going to be in Los Angeles?” and I haven’t been able to answer them.

Until today.

email_sig_2

I learned this week that I will indeed be going to Tech·Ed. I’ll be In Los Angeles the whole week, from May 11 through May 16, helping to represent Microsoft Learning at the conference. I’ll have more details to share about what I’ll be doing as the conference approaches, but for now I just think that I’ll be staying away from places where I see this sign:

TENA_blgr1_mctshang

Hang here? Yikes! ;-)

Seriously, interacting with Microsoft Certified Trainers is probably going to be a big part of my role at the conference. I just wonder how “hang” ever got through the layers of review.

So I’ll be there – will you?

Tuesday, March 24, 2009

Get on the Bus!

if (you.Location.Match(“Atlanta”,
    “Charlotte”, “Philadelphia”, “Columbus”,
    “Indianapolis”, “St. Louis”, “Kansas City”,
    “Denver”, “Albuquerque”, “Phoenix”,
    “Las Vegas”, “Los Angeles”) ==
        MatchResults.CloseEnough)
{
    you.EnterContest(“Get on the Bus”);
}

if (you.WinContest(“Get on the Bus”))
{
    you.AttendConference(Microsoft.Conference.TechEd,
        Cost.Free);
}

Sound like a good deal? Learn more and enter here: http://www.microsoft.com/learning/mcp/offers/getonthebus/default.mspx

150x300

(I just wish that I were eligible…)

Friday, March 13, 2009

Indexes and SSIS Data Flow Performance

Wow – it’s been a while, hasn’t it?

As you may know, I took a job with Microsoft back in October. It’s been a fun (and insanely busy) five months; the work is challenging and diverse, the team I work with is talented and motivated and there is always something new and interesting to learn and/or do. What’s not to like?

In fact, even though BI isn’t a core component of my job role, I have been working with SSIS and the rest of the SQL Server 2008 BI tools to get easier access to some of the data I need to make better decisions and to support the recommendations I make with some “real numbers.”[1] It’s an easy way to add some value to existing data, and for me it’s a fun way to keep my BI skills from getting too rusty.

So… I re-learned an interesting lesson the other day: Don’t load data into indexed tables.

One of the SSIS packages I’ve built extracts data from a set of text files and loads it into a set of SQL Server tables.[2] There is around 20 GB worth of text file data that ends up in one particular table. When I ran this package Monday evening, I found it was still running Tuesday morning, which was not expected. I looked around in SSMS and noticed that there was one non-clustered index on this table. I had added it recently and had not dropped or disabled it before running the package. So when the package finished running, I took a look at the execution results and found this:

DTExec: The package execution returned DTSER_SUCCESS (0).
Started:  6:03:50 PM
Finished: 9:54:20 AM
Elapsed:  57028.7 seconds

That’s right – the package took nearly 16 hours to run. Ugh!

So I re-ran the package this morning, after having ensured that all indexes on this table had been dropped. And the result?

DTExec: The package execution returned DTSER_SUCCESS (0).
Started:  9:49:21 AM
Finished: 10:41:15 AM
Elapsed:  3114.05 seconds

That’s right – the package completed in well under an hour, taking roughly 5.5% of the time it took to execute the package against the indexed table.

But wait, you say – that’s not a fair comparison. You still need to rebuild that index!

You’re right, of course – the time to rebuild the index needs to be considered before we can make any apples-to-apples comparison. So what I did was to re-create all of the indexes on this table (four indexes, each with included columns, for a total index size of 15.8 GB) which took an hour and 39 minutes. Even though this is longer than the package execution itself, the total end-to-end execution time is still only around 15.9% of the package execution time with one index on the target table. In the end we achieved an 84% reduction in execution time.

So what’s the lesson?

Drop or disable your indexes before loading data into your tables, and then re-create or enable them once you’re done. If you don’t, then SQL Server will keep incrementally rebuilding the indexes while your package executes, thus killing your performance.

This isn’t new advice – this has been a best practice for bulk loading data into SQL Server for as long as I can remember. But it is something that I have not blogged on before, and since failing to follow this advice bit me this week I thought I would take the time to share my pain with the world in hopes that someone else can avoid it.

And, of course, it’s good to blog on SSIS again. Hopefully I’ll be posting again before another few months have passed…

[1] Please note that I’m not calling this a “BI solution” here. At this point I am more in a discovery phase than anything else. The components I’ve developed, while functional, will probably get thrown away once they have served their purpose as prototypes and learning tools. All of the performance numbers come from running the packages on a laptop, with both the target data database’s data file and the source text files on the same external 5,400 RPM USB hard drive. No, I wouldn’t do this in production either. ;-)

[2] The astute reader will note that this is in flagrant violation of the “each package should only load data into one table” best practice that I talk about whenever I have someone who will listen. The scenario in which I’m working here provides an interesting exception to this guideline, and please trust that I did think hard before going against my own advice.