BI Polar: November 2007

Friday, November 30, 2007

The Pain of Living on the Edge

If you're like me, you tend to listen to Manowar really loudly while you write code and built ETL solutions.

No, wait, that's another post. Please let me begin again.

If you're like me, you tend to stay on the cutting edge, upgrading to the latest and greatest development tools as soon as they're available.

Yes, that's the right post.

The great things about this tendency are that you get new features and functionality early, you are constantly learning new things, and when the products are actually released you already have months experience using them, so not only do you have a competitive edge, you can also help others who are less adventurous learn things too. All around it's a real win-win scenario.

But it is not without its pain.

Visual Studio 2008 has recently been released, with important (and cool) new capabilities for people developing Windows and web applications using VB.NET, C# and other "traditional" development tasks. But what about us Business Intelligence developers? Well, do you remember that song from the opening of "Four Weddings and a Funeral?"[1] Well, we've been left out in the cold, like Hugh Grant. The current CTP5 release of SQL Server 2008 still relies on Visual Studio 2005, so if you want to get the goodness of Visual Studio 2008 when building SSIS packages, you're out of luck.

...but like with Hugh Grant, there is a happy ending in sight. Michael Entin, a member of the SSIS product team, posted this information today on the SSIS Forums on MSDN:

1) SSIS 2005 requires VS 2005, and will not work with VS 2008.
2) SSIS 2008 RTM will work with VS 2008 (and only with VS 2008 - there will be no option to use it from VS 2005).
3) It is not final yet when this will be implemented, most likely the next CTP will use VS 2005, and there will be a refresh that provides VS 2008 support.

Michael was careful to point out these these plans are tentative and subject to change, but it's still exciting to know that there is a plan and a roadmap for getting the tools we want hooked up with the BI projects we need and love. Now let's just hope that Visual Studio doesn't marry a rich old Scottish gentleman before then...

[1] In case you missed that one, it was "But Not For Me" by Elton John and went "they're singing songs of love, but not for me..."

Thursday, November 29, 2007

SSIS Performance Tuning Strategy

Similar to my post on SSIS Performance Tuning Techniques, this resource is neither mine nor new, but it looks like a lot of people don't know that it exists. Check out this "SQL Server 2005 Integration Services: A Strategy for Performance" white paper: http://www.microsoft.com/technet/prodtechnol/sql/2005/technologies/ssisperfstrat.mspx#E6G

According to the abstract, "This white paper covers several interesting and unique methods for managing metadata using built-in features in SQL Server 2005 Integration Services, Analysis Services, and Reporting Services. The paper explains such concepts as data lineage, business and technical metadata, auditing, and impact analysis. " This is really weird to me, because there's nothing about SSAS or SSRS in here at all, especially since there's also nothing about data lineage or metadata management. Copy and paste gone wrong, perhaps?

To me the most interesting thing here is the "OVAL" (Operations , Volume, Application, Location) approach for evaluating and tuning SSIS performance. Lots of people on the MSDN SSIS Forums have been asking lately about SSIS performance, and either no one is bothering to look, or else this white paper is difficult to find.

And for people too lazy (or should that be "visually oriented") to read through the whole thing, Donald Farmer (then the SSIS Program Manager) did an excellent presentation at TechEd last year about this OVAL approach, and you can view it on demand here: http://msevents.microsoft.com/cui/WebCastEventDetails.aspx?culture=en-US&EventID=1032298087&CountryCode=US

But remember: Performance tuning is a complex art that is a superset of all of the other things that go into building software on any platform, and SSIS is no exception. In order to do it right, you need to understand how the platform works and how your software interacts with it, and have a methodical approach for identifying and refining the scope of performance problems. Although it's wonderful to have an approach like OVAL to use, without an understanding of how SSIS works under the hood, it's not going to do you much good...

Wednesday, November 28, 2007

Calling All Trainers!

If you're a Microsoft Certified Trainer, this is a very exciting time. With new versions of SQL Server, Visual Studio and Windows Server right around the corner (or in the case of Visual Studio 2008, upon us already) you are faced with a massive influx of new tools, technologies and (hopefully) courses to teach. By the time these products ship, your students will expect you to be the expert, and there's never enough time to prepare...

Fortunately, Microsoft understands.[1] And to help out, they're putting on a series of MCT Summit conferences designed to help prepare MCTs with both hard (technical) and soft (presentation) skills they'll need to prepare for the rough waters ahead. There is an MCT Summit in Berlin, Germany on January 14 to January 16, and another in Redmond, Washington on February 4 to February 6[2] so regardless of which side of the ocean you're on, Microsoft has you covered.

But how does this relate to this blog? Funny you should ask!

I will be presenting three different SQL Server-focused sessions at each of the Berlin and Redmond Summit events. Here are the details:

What's New in SQL Server 2008 for Database Developers
This technical session presents information on the key new features for database developers in the SQL Server 2008 product release. Emphasis is placed on the value proposition of the new features, how the features enhance and extend the functionality in SQL Server 2005, and how to communicate the importance of the features to your students. Where possible, the new features are also discussed in the context of the courseware in which they are presented.

What's New in SQL Server 2008 for Business Intelligence Developers
This technical session presents information on the key new features for business intelligence (BI) developers in the SQL Server 2008 product release. Emphasis is placed on the value proposition of the new features, how the features enhance and extend the functionality in SQL Server 2005, and how to communicate the importance of the features to your students. Where possible, the new features are also discussed in the context of the courseware in which they are presented.

So you want to be a SQL Server MCT? Jumping into the Deep End with SQL Server 2008
This session presents information for MCTs who do not currently teach SQL Server courses with the information that they need to be successful delivering SQL Server 2008 courses. The first portion of this session covers a technical introduction to the SQL Server product suite with an overview of core products and features. The second portion covers existing training materials and presents a proposed plan of self-study to help attendees build the skills they need to succeed in the classroom with the SQL Server 2008 courseware.

The Summit events look like they're going to be a lot of fun, and if you're planning on teaching the SQL Server 2008 courseware when it's ready, this is a great way to get up to speed on the new tools, technologies and features heading our way. I'll see you there!

[1] Well, sort of.
[2] There are also summits in Sydney, Australia and Singapore, but they didn't fit into my schedule. :-)

Speaking at SQL Connections 2008

I just received news that my abstract for the session "SQL Server Integration Services Deployment Best Practices" was selected for the spring 2008 SQL Connections conference. As you can imagine, I'm awfully excited! Here's the full abstract:

"So you’ve built a set of SSIS packages to populate your data warehouse, and everything works great on your development environment – but what next? SSIS has a great set of tools, but when it comes to preparing for and executing a seamless deployment from development into test and finally into production, it’s not always obvious what tools to use or how to best use them. In this session you’ll learn how to prepare for and execute a seamless deployment of complex ETL solutions through a repeatable, testable process. If you’re tired of saying “but it worked on my machine!” then this is one session you can’t afford to miss."

I submitted a handful of different abstracts, but this is the one that got selected, and there's probably no surprise there. Many of the questions posted on the SSIS MSDN Forums are related to deployment, and when people complain about SSIS (like that would ever happen, right?) deployment almost always comes up. Hopefully before too long the SSIS team will have a set of powerful deployment tools that take away the pain of deployment, so people would look at an abstract like this and say "why would anyone ever attend a session on this?" But from what I see today it's not happening in SQL Server 2008, so you'll probably want to attend this one.

I'll see you in Orlando!

Tuesday, November 27, 2007

Let Your Voice Be Heard!

The SQL Server Integration Services Team is looking for your feedback on how to prioritize their work on future versions of SSIS. How cool is that? It's not quite as cool as being invited to an SDR (yeah, this is kind of like saying Hammerfall isn't quite as cool as Manowar - they're somewhat similar, but orders of magnitude different in how cool they are) but you still get to share your opinions with the people who build the tools you love.[1] Or not. Either way, it's cool.

And to make it even cooler, they're giving away five 80GB Zune MP3 players to people who complete their survey by December 20th.

Here's the survey: https://mscuillume.smdisp.net/Collector/Survey.ashx?Name=SqlETLSurvey2

How important is this? Consider the fact that I'm posting a link to it on my blog, even though each person who enters is reducing my chances of winning that Zune. So let your voice be heard!

[1] For all SSIS team members reading this post, yes, I would love to come to your next SDR. You don't even need to give me a Zune. Although I'd take one. ;-)

Tuesday, November 20, 2007

SQL Down Under Show Online

Remember back on Halloween when I recorded an episode of SQL Down Under with SQL Server MVP Greg Low? Well, the episode is now online! You can download it in MP3 or WMA format here: http://www.sqldownunder.com/

Enjoy!

Monday, November 19, 2007

Things are Looking Up in CTP5

One of the things I've really been looking forward to with SSIS in SQL Server 2008 CTP5 is the enhanced Lookup transform. And related to that was looking forward to blogging about it.

But of course, I'm a day late and a dollar short. Or as the case may be, three days late and a few bucks ahead. Because Jamie Thompson has already blogged about it, and I don't see anything for me to add:

http://blogs.conchango.com/jamiethomson/archive/2007/11/16/Katmai_3A00_-SSIS_3A00_-Lookup-component-gets-a-makeover.aspx

So now I don't need to blog it, I can just link to Jamie's post, and get back to work. Thanks, Jamie!

Katmai CTP5 - Public Download Available!

Microsoft has just posted the SQL Server 2008 CTP5 downloads to the public Download Center:

http://www.microsoft.com/downloads/details.aspx?FamilyId=3BF4C5CA-B905-4EBC-8901-1D4C1D1DA884&displaylang=en

Please note that there are both installers and ISOs for x86, x64 and IA64 as well as an Express Edition download. Very cool stuff - Go get it!

Line Numbers, Where Art Thou?

So as mentioned in my last post, I'm delighted to be able to use C# in the Script Task and Script Component in SSIS for Katmai CTP5. I was a bit less that delighted, however, to not have any line numbers in the the VSTA editor. And to make matters worse, there were no options to turn on line numbers in the Tools Options dialog box!

Yes, I become alarmed very easily.

Then I noticed the "Show all settings" check box, and checked it. This, of course revealed the option I was looking for:

And this, in turn, allowed me to make my code window look like I need it to look:

I know this isn't high-tech or difficult, but it took me a minute or five to figure out what was going on, so I figured I might not be alone in my distress...

Katmai Scripting: A Step in the Right Direction

In a recent thread on the SSIS forums on MSDN, a poster was bemoaning the number of steps required to open the VSA script editor for the Script Task in SSIS. And I was sharing his pain.

Fortunately, there is a little improvement on the horizon, but just a little at this point. In CTP5 of SQL Server 2008 the "Edit Script" button (note that it has been renamed from "Design Script" but that the keyboard shortcut key remains "S") is now on the first tab of the editor dialog box, so one fewer steps is required to open VSTA.

Also note that C# is the default language in the Editor dialog box. Oh, how I look forward to being able to use C# for my SSIS development needs...

But we still need that "Edit Script" context menu item, don't we?

BIDS Helper 1.2 Released

BIDS Helper is a CodePlex project that delivers a set of Visual Studio add-ins for developers working on BI projects. It's been around for a while, but until now has not had any SSIS-focused features.

Until now!

The BIDS Helper team has just released version 1.2 of their tool, and it now includes (in addition to a bunch of SSAS-related stuff as well) these new SSIS features:

Create Fixed Width Columns - Simplifies the definition of a fixed width flat file connection manager
dtsConfig File Formatter - As far as I can tell, this is just another shortcut for the ctrl-k, ctrl-d "prettify" keyboard chord that is built into Visual Studio
Expression and Configuration Highlighter - Color-codes tasks and connection managers that have properties set by configurations or expressions, making your packages more self-documenting.
Expression List - Adds a new "Expression List" window to Visual Studio's "Other Windows" menu; this window displays all expressions defined within the package.
Smart Diff - Provides a pre-processor for the built-in VSS or TFS differencing tools, to make it easier to diff two DTSX files' XML content.

All in all, this looks like a major step forward in the SSIS developer's toolbox. The Expression and Configuration Highlighter and Smart Diff features in particular provide functionality I have often wanted, and even if they're not exactly what I would have liked, they're still better than what we had before. Check it out!

Saturday, November 17, 2007

Installer++

So SQL Server 2008 CTP5 is hitting the public distribution channels, and is hopefully coming soon to a virtual machine near you. I've got my Katmai VM rebuilt with CTP5 already, and with a little luck (including my getting over this darned cold) we'll have a bunch of new Katmai SSIS posts here in the days ahead.

But before we can get to SSIS we have to mention the cool new installer. I remember being delighted with the installer in SQL Server 2005 (although admittedly I was comparing it to the BizTalk 2004 installer, and just about anything would be a huge advance over that monster) but the new SQL Server 2008 installer is a big improvement. Check it out:

Bah! I was planning on having a lot more text to go with the flood of images, but I can't seem to make the Blogger editor do what I want it to do, and I'm sick, and it's late. So for now, please enjoy the pretty pictures and we'll see what we can do with other posts tomorrow...

Thursday, November 15, 2007

SQL Server 2008 CTP5 - I'm Ready - Are You?

It looks like the next CTP release of SQL Server 2008 (CTP5) will be available for download tomorrow:

http://weblogs.sqlteam.com/billg/archive/2007/11/15/SQL-Server-2008-CTP5-on-Friday.aspx

This release is due to have significant improvements over CTP4, both in general and in the context of SSIS, so make sure you get your sandbox environment ready today.

Guess what I'll be doing this weekend?

Wednesday, November 14, 2007

Precompile, Recompile, Decompile

As I mentioned in a post yesterday (or the day before... it's been a long week already) I've been spending a lot of time lately in the Script Task in SSIS. Well, I have the task logic implemented and tested, so now it's time to copy that task into all of the packages that need it.[1] So I now have 20 or so packages with the same script in their OnPreExecute event handler, and everything runs great.

Until you look at the package output. During testing I generally pipe DTEXEC's standard output to a text file and use that as my "to do list", and with these updated packages I have a 1500+ line file filled with this warning, over and over:

Warning: 2007-11-14 13:53:07.67
Code: 0x00000003
Source: Script Task Name Here
Description: Precompiled script failed to load. Attempting to recompile.
For more information, see the Microsoft Knowledge Base article, KB931846 (http://go.microsoft.com/fwlink/?LinkId=81885).
End Warning

Not a bad warning, all things considered, and it even has a URL - cool!

Of course, the information in the KB article in question doesn't apply to my scenario, as I'm on a 32-bit machine and am fully patched with SP2 + hotfixes. Hmmm... Time to do some searching...

Well, I'm not alone: http://forums.microsoft.com/TechNet/ShowPost.aspx?PostID=2030858&SiteID=17

And people have been having this problem with Script tasks in packages created from template packages: http://bobp1339.blogspot.com/2007_09_01_archive.html

A-ha!

I copied and pasted each of these script tasks from the package in which I performed the initial development and testing. Not quite a template package in this case, but it's close. So...

For Each Package in AffectedProjects.Packages
Package.Open
Package.EventHandlers.Select
Right-click(ScriptTask)
KeyPress(e)
Ctrl-Tab
Alt-S
Wait(5)
Alt-F4
Ctrl-S
Ctrl-F4
Next

The good news: it works! The warnings are gone!

The bad news: that was MScript - the Matthew Script Language, which performs horribly and does not scale, because it relies on my poor typing skills.

Maybe some time I'll find time to figure out how to automate this in the SSIS .NET API and add it to PacMan. Maybe some time I'll find time to sleep too...

[1] Yes, I know about how custom SSIS components have a much better reuse story, but for now it's more important for me to keep my deployment as simple as possible, but thank you for caring.

Monday, November 12, 2007

Three Lock Box... One Lock Variable...

I've been working in the SSIS Script Task for the past few days, and have been both loving and hating it. I've been loving it because it's been so long since I've written "real" .NET code and hating it because it's VB.NET and not C#. Yes, and also hating it because the developer and debugger experience for writing SSIS code is less than ideal. That too.

Anyway, I've been making great progress when I got this delightful error:

Error: 0xC001405C at <>: A deadlock was detected while trying to lock variables "<>" for read/write access. A lock cannot be acquired after 16 attempts. The locks timed out.

WTF, I asked myself.[1] I haven't changed any code since the last time (the last 10 times) I ran this package successfully. What on earth could have changed?

So I spent a little time digging around online and found lots of references to installing SQL Server 2005 SP1 and post-SP1 hotfixes, but since I'm running SP2 plus hotfixes already this didn't seem particularly relevant. So what had changed?

Well, since it wasn't the code, I thought, it was probably the inputs to that code. The Script Task code in question was reading in name/value pairs from an external file[2] and updating the variable that matches the name with the value from the file. So what had I changed that caused this error?

I had added the same variable more than once to the file.

Apparently if you use VariableDispenser.LockOneForWrite more than once in the same Script Task for the same variable, SSIS doesn't handle it particularly well. I haven't really had time to explore the full boundaries of the problem behavior (and probably won't for the next week, based on what my project schedule looks like) but I wanted to take the time to post this online in case someone else ran into the same thing.

And now, back to my regularly scheduled overtime...

[1] It's true. I'm working on integrating Windows Teamware Foundation into my data warehouse.
[2] It's a little more involved than this, but this is the important part. And yes, I know all about SSIS Package Configurations, thank you.

Developer's Crossing

Miller's Crossing is one of my favorite movies, and is in my opinion the brightest gem in the Coen brothers' crown, despite the fact that few people know about it when compared to Fargo or Raising Arizona and the rest. If you've never seen it, and you like complex plots, flawed anti-heroes and a bit of odd humor, and don't mind a little violence (ok, a lot of violence) here and there, you owe it to yourself to watch it.

But what does this have to do with SQL Server and BI?

Complexity, that's what.

Consider this quote from Tom Reagan (played by Gabriel Byrne)

"Nobody knows anybody. Not that well."

Sound familiar? When was the last time you felt like you really knew a lot (or knew "everything" for the really smart readers out there) about SQL Server? I remember when I decided to "specialize in SQL Server." Then I decided to "specialize in SQL Server development." And now I'm on the cusp of specializing in SQL Server BI (which actually seems like an increase in scope, but perhaps that's just me) because the breadth of the product has gotten so large that one brain (ok, my brain) cannot encompass it all.

This all struck home today when someone I know (I won't name any names, but he doesn't have a blog) asked me a question that I thought fell very well into his personal specialization. As this is a person whose technical knowledge I respect greatly, it got me thinking, and in the process of working with him to find the solution, this quote filtered up from the depths of my brain.

And that's it. So as the USB hub said to the thumb drive... "Take your flunky and dongle."

You fought with my father in the BI Wars?

(I totally stole the whole "BI Wars" idea from Ted Malone, but since he doesn't have a blog, I don't feel at all bad about it.)

"IBM to Acquire Cognos to Accelerate Information on Demand Business Initiative"

That's the title of the press release from IBM. (http://www-03.ibm.com/press/us/en/pressrelease/22572.wss) Following Microsoft's acquisition of ProClarity (now very old news, I know), Oracle's purchase of Hyperion and SAP's offer for Business Objects, this is just one more loose end being tied up (or snatched up, as the case may be) in the BI software landscape.

So what does this mean to those of us who make our living working with the Microsoft BI toolset? Well, I certainly don't claim to know (my track record for predicting industry trends is less than stellar) but I sure hope that it means Microsoft will be putting even more emphasis on BI in the months and years to come. If other database vendors start offering comprehensive and integrated BI offerings like Microsoft already does today, then Microsoft is likely to lose core differentiators that make the SQL Server BI stack unique.

And we don't want that to happen, now do we?

Thursday, November 8, 2007

Widening the Divide?

Microsoft has recently announced that it is splitting its massive TechEd US conference into two week-long events, starting in 2008. One week will be focused on Developers, the other focused on IT Professionals. In some ways this is probably going to be a good thing, because the event will be able to provide more content to more attendees without getting even more insanely large than it already is.

But in other ways, this worries me. And I'm not simply worrying about people (like myself) who fall into both categories and who (probably like myself) will not be able to take off two full weeks to attend both conferences. Although that is a concern.

The real worry comes down to the cultural divide that separates software developers from the people who support the software that the developers build. These two groups of people speak different languages, have different priorities and have very little context in common, for all that we are all "computer people." As they say, developers are from Neptune, and IT pros are from Uranus.

So why does this worry me?

Well, whenever people (especially people from different factions) get together and meet face-to-face, and talk without the artificial barriers that so often divide them, they tend to "get it" and to get along. TechEd has always been a great place for me to meet up with my developer friends from around the world AND to meet up with my IT pro friends from around the world as well.

What will we (that's the collective "we" in case that doesn't come across clearly in text) do? How will these two interdependent but distinct communities get everything they need from each other without the "United Nations" function that TechEd has traditionally served?

Of course, there's no answer to this question today. Only time will tell...

You Can't Get There From Here

Yes Yukon![1]

I've had several people ask me about this offline in the last few weeks: Why can't I connect to SQL server 2008 from Visual Studio 2008? My answer, of course, has been "because it doesn't work in the current builds." It doesn't, but I haven't been in a position where I've needed to make it work, so I've been happy connecting to SQL 2008 from Visual Studio 2005.

My friend and colleague Ted Malone[2], however did need to figure this out, and was kind enough to share with me what he found: A pretty table!

So even if there isn't as much green as we might want, it's still good to know exactly where we can get from where. Now when are we going to see that next SQL 2008 CTP, eh?

[1] Say it phonetically. It's funny to me, but then again, I have had an awfully long week.

[2] Ted has a blog? Who knew?

PerformancePoint Server is Live!

Microsoft this morning announced that its performance management server product, PerformancePoint Server 2007 is now generally available. You can download evaluation copies from the Microsoft Download Center (32 bit and 64 bit versions are available) and can get the real deal from the MSDN and TechNet subscriber downloads.

Bad Math Day

I've always been fond of the quote "the lottery is a tax on people who are bad at math." It's sort of like "There are 10 types of people in the world - those who understand binary and those who do not," but without the geeky overtones.

But The Register today has a news story that really drives home someone having a bad math day: http://www.theregister.co.uk/2007/11/08/scratchcard_anarchy/

My favorite quote from the story:

"I phoned Camelot and they fobbed me off with some story that -6 is higher, not lower, than -8, but I'm not having it."

Right...

Wednesday, November 7, 2007

SSIS Myths - Phil May Have Beaten Me To It

Last week I posted a "call for myths" about SSIS. My schedule has been insane since then and I've found little time to follow up on this myself, but today I stumbled across a delightful source of misinformation about SSIS, on Oren Eini's blog: http://ayende.com/Blog/archive/2007/07/15/SSIS-15-Faults.aspx.

Hooray, I thought, here is a pre-built source of "myths" about SSIS that I can use instead of needing to build the entire list myself.

But sadly (for me, not for the SSIS community), Phil Brammer (SQL Server MVP and vocal member of the SSIS community) has already responded to Oren's attack:

http://www.ssistalk.com/2007/07/27/ssis-15-faults-rebuttal/#more-60

I may still use some of Oren's points later on, but for now, you should check it out and see what Phil has to say in response.

Sunday, November 4, 2007

Viewing Data Flow Metadata

This is a pretty simple tip, but I’ve seen a few different questions on the MSDN SSIS Forums lately where this technique was part of the solution, so I’ll post it despite its simplicity.

How do you view the buffer metadata for a given path in your data flow? How do you tell which columns are flowing through the logical data flow pipeline, and what data types and sizes each one uses?

The answer is: very easily!

Simply right-click on any data flow path arrow and select “Edit” from the context menu. Then select the Metadata tab and you will be presented with information that looks something like this:

Please note that you cannot edit the metadata here, as the buffer is defined based on the output properties of the component at the “root” of the current execution tree, but often simply being able to view the metadata is exactly what you need in order to understand the cause of a problem.

Of course, it’s important to understand just what you’re looking at when you view this metadata. What you’re seeing is only those columns in the data flow buffer that are “in scope” for the given data flow path arrow. If you view the metadata for the path immediately before a Derived Column transform, for example, you will not see the columns that it adds to the data flow, but if you view the metadata for the path immediately after, you will. While this may seem obvious and logical, it can also be misleading. The actual memory buffer that the data flow runtime creates for that execution tree contains all columns that exist at any point within the tree. This means that the columns added by the Derived Column transform physically exist in the buffer before the transform, but because they are not available to components upstream from the Derived Column transform they do not appear in the metadata viewer.

In any event, hopefully this tip is helpful and will make your SSIS troubleshooting efforts less painful. Enjoy!

Data Source Annotations

This post is something of a follow-up on the “Annotate like you mean it” best practice I mentioned in my best practices post last week. I always figured that the technique was something of a no-brainer, but when I mentioned it to Greg Low in our SQL Down Under conversation last week he seemed a little surprised (in a good sort of way) and if he hadn’t thought of it[1] odds are there are a few people out in SSIS-land who might not have thought of it either.

In my last post I talked about updating an SSIS package to use Common Table Expressions in the source query to replace a complex set of asynchronous transforms in the package’s data flow. The upside of this change was that the package performed better by several orders of magnitude. The downside is that the source query became more complex.

Now this isn’t inherently a bad thing, but it is undesirable in that it makes the package less self-documenting. In addition, because I tend to base my source queries on expressions[2], it is not always a simple task to figure out exactly what query is being executed. Because of this, and because self-description is an admirable goal for any development artifact, I will generally update my data flow with an annotation on the design surface that shows the query next to the data source component. In this scenario, the data flow ended up looking something like this:

As you can see, it is now much easier to understand the data being extracted than it would be to examine the data source component to find out on which variable its query was based, and then examine the variable to understand the expression on which it, in turn, was based.

The obvious drawback of this approach is that as soon as the annotation is created, it is instantly out of date. There is no way to create a “dynamic annotation” in SSIS[3], so there is no enforced relationship between the annotation and the data source component that it describes. This then puts something of a maintenance demand on the package developers, as they must remember to update the annotation each time the source query is updated.

I personally think that this is an acceptable tradeoff. The maintenance cost itself is low, and the benefits of having a well-documented package are considerable. Also, if you follow the best practice of “Get your metadata right first, not later” then you will likely not need to update the source query (or the annotation) very often.

[1] I have a great deal of respect for Greg’s knowledge and experience, so if there’s something that is new to him, it’s worth blogging about.
[2] Which will likely be a post of its own before too long.
[3] Although having annotations based on expressions would be a cool new feature, wouldn’t it?

I Want My CTE!

In my recent Best Practices post I mentioned that it was a best practice to “do it in the data source.” By this, I meant that when using SSIS to extract data from a relational database you can often get much better performance by completing tasks like sorting, filtering, grouping and aggregating in the source query as opposed to completing these tasks in the SSIS data flow. Now don’t get me wrong – the SSIS data flow performs exceptionally well, but it is not always the right tool for every job.

Here’s a specific case in point. I was recently helping someone troubleshoot a poorly performing SSIS package. The developer in question was pulling a variety of date values (different dates and times that described a business entity) from a single table in the source system and loading them into a time dimension in a data mart. As part of the data flow he was taking the seven different source columns and putting them together into a single column like this:

1) He used the Multicast transform to split the one data flow path into seven data flow paths, one for each date column.
2) He created a set of “DateValue” columns with the Derived Column transform, with each one based on one of the date columns from the source table, and replacing NULL dates with a placeholder date value.
3) He used the Sort transform to remove duplicates from each of the derived columns.
4) He used the Union All transform to bring the seven derived columns together into a single “DateValue” column.
5) He used another Sort transform to remove duplicates from the single data column.
6) He loaded the discrete dates into his dimension table.

It looked something like this:

And it performed about as well as you could imagine.

As you can probably imagine, he was a little frustrated, and was saying all sorts of bad things about SSIS which I won’t repeat here and which, for the most part, weren’t true.
And as you can probably also imagine, my solution for this problem was to perform the required logic not in the data flow, but in the source query. In fact, I said “this looks like a job for… Common Table Expressions!” That’s right – since the source system was using a SQL Server 2005 database, we could quickly and easily perform all of this logic in the SELECT statement used by the data flow, and eliminate 90% of the work being done in the data flow. The source query ended up looking something like this:

WITH FilteredDatesCTE
AS
(
SELECT TimeRaised
,TimeOfFirstEvent
,TimeOfLastEvent
,TimeResolved
,LastTimeStateWasModified
,TimeAdded
,LastModified
FROM SourceTable
WHERE LastModified >= '1/1/2007'
),
UnionedDatesCTE
AS
(
SELECT TimeRaised AS DateTimeString FROM FilteredDatesCTE UNION
SELECT TimeOfFirstEvent FROM FilteredDatesCTE UNION
SELECT TimeOfLastEvent FROM FilteredDatesCTE UNION
SELECT TimeResolved FROM FilteredDatesCTE UNION
SELECT LastTimeStateWasModified FROM FilteredDatesCTE UNION
SELECT TimeAdded FROM FilteredDatesCTE UNION
SELECT LastModified FROM FilteredDatesCTE
)
SELECT DISTINCT ISNULL(DateTimeString, '1/1/9999') AS DateTimeString
FROM UnionedDatesCTE;

Of course, if we were using a different database platform for the source system we could have used derived tables instead of CTEs, but the CTE syntax is cleaner and more elegant, and certainly does the trick.

The updated data flow was much simpler – none of the transforms shown above were required anymore – and it performed superbly, completing in minutes where the original data flow with all of its asynchronous transforms took over a day to complete with the same production data volume.

And as you’ve doubtless noticed as well, this example also ties in nicely with the “Don’t use asynchronous transforms” best practice.

So what’s the moral of the story? Yes, this time I actually do have one! The moral is that in order to be an effective SSIS developer, you also need to understand the capabilities of the platforms on which your packages are based. In this case (and in many, many cases) this means understanding T-SQL or some other SQL dialect. If you don’t have a thorough understanding of SQL, you’re going to be at a significant disadvantage when using SSIS. Part of it comes down to choosing the right tool for each job, part of it comes down to understanding the strengths and weaknesses of each tool.

Saturday, November 3, 2007

SSIS Raw File Viewer

SQL Server Integration Services has its own high-performance file format for persisting data - it's the "raw file." The downside of these files is that they can only be used from within an SSIS data flow - there is no other tool to open them.

Or is there?

Simon Sabin has developed a "Raw File Viewer" and has made it available for a free download - check it out here: http://sqlblogcasts.com/blogs/simons/archive/2007/01/11/SSIS-Rawfile-viewer---now-available.aspx

Yes, I realize that this is really old news as he posted this back in January (and it is now November) but as it has managed to slip under my radar for the last eleven months I figure that I'm not the only one who has missed it...

Thursday, November 1, 2007

Evangelists, Evangelists, Evangelists!

I consider myself an evangelist on SQL Server and BI technologies, but my friend and colleague Lynn Langit has this in her job title. She's a Developer Evangelists with Microsoft, and right now she's on the home page of the MSDN Channel 9 web site, talking about her job.

Check it out: http://channel9.msdn.com/Showpost.aspx?postid=352786

Edit: Damn, that's funny! I blogged this while I was downloading the WMV version of the webcast, before I actually watched it. Now I'm watching it, and laughing my head off.

CozyRoc SSIS Products

Although there is a "3rd party developer ecosystem" for SSIS components, it's not particularly active. I don't know if this is because there's no interest, if people just don't understand the platform well enough, or if the SSIS product group just did such a good job building the core platform that people don't see the need to turn to 3rd party vendors (or perhaps some other reason entirely) but I don't see a ton of add-on components out there.

But CozyRoc has just announced beta versions of a set of interesting-looking control flow and data flow components, including:

Package SSIS Connection Manager - Provides access to the current or different package at runtime.
Data Flow Source SSIS Data Flow Component - Very fast, non-blocking, in-memory data read from another data flow.
Data Flow Destination SSIS Data Flow Component - Exposes data from a data flow for in-memory read in another data flow.

You can check it out here (http://www.cozyroc.com/) - and they're free while in beta, so strike while the iron is hot.

SSIS Best Practices, Part 2

And as promised, here is my personal list of SQL Server Integration Services best practices. This is by no means comprehensive (please see my previous post for links to other best practice resources online) but it's a great starting point to avoiding my mistakes (and the pain that came with them) when you're working with SSIS. Here goes:

Get your metadata right first, not later: The SSIS data flow is incredibly dependent on the metadata it is given about the data sources it uses, including column names, data types and sizes. If you change the data source after the data flow is built, it can be difficult (kind of like carrying a car up a hill can be difficult) to correct all of the dependent metadata in the data flow. Also, the SSIS data flow designer will often helpfully offer to clean up the problems introduced by changing the metadata. Sadly, this "cleanup" process can involve removing the mappings between data flow components for the columns that are changed. This can cause the package to fail silently - you have no errors and no warnings but after the package has run you also have no data in those fields.

Use template packages whenever possible, if not more often: Odds are, if you have a big SSIS project, all of your packages have the same basic "plumbing" - tasks that perform auditing or notification or cleanup or something. If you define these things in a template package (or a small set of template packages if you have irreconcilable differences between package types) and then create new packages from those templates you can reuse this common logic easily in all new packages you create.

Use OLE DB connections unless there is a compelling reason to do otherwise: OLE DB connection managers can be used just about anywhere, and there are some components (such as the Lookup transform and the OLE DB Command transform) that can only use OLE DB connection managers. So unless you want to maintain multiple connection managers for the same database, OLE DB makes a lot of sense. There are also other reasons (such as more flexible deployment options than the SQL Server destination component) but this is enough for me.

Only Configure package variables: If all of your package configurations target package variables, then you will have a consistent configuration approach that is self-documenting and resistant to change. You can then use expressions based on these variables to use them anywhere within the package.

If it’s external, configure it: Of all of the aspects of SSIS about which I hear people complain, deployment tops the list. There are plenty of deployment-related tools that ship with SSIS, but there is not a lot that you can do to ease the pain related to deployment unless your packages are truly location independent. The design of SSIS goes a long way to making this possible, since access to external resources (file system, database, etc.) is performed (almost) consistently through connection managers, but that does not mean that the package developer can be lazy. If there is any external resource used by your package, you need to drive the values for the connection information (database connection string, file or folder path, URL, whatever) in a package configuration so they can be updated easily in any environment without requiring modification to the packages.

One target table per package: This is a tip I picked up from the great book The Microsoft Data Warehouse Toolkit by Joy Mundy of The Kimball Group, and it has served me very well over the years. By following this best practice you can keep your packages simpler and more modular, and much more maintainable.

Annotate like you mean it: You've heard of "test first development," right? This is good, but I believe in "comment first development." I've learned over the years that if I can't describe something in English, I'm going to struggle doing it in C# or whatever programming language I'm using, so I tend to go very heavy on the comments in my procedural code. I've carried this practice over into SSIS, and like to have one annotation per task, one annotation per data flow component and any additional annotations that make sense for a given design surface. This may seem like overkill, but think of what you would want someone to do if you were going to open up their packages and try to figure out what they were trying to do. So annotate liberally and you won't be "that guy" - the one everyone swears about when he's not around.

Avoid row-based operations (think “sets!”): The SSIS data flow is a great tool for performing set-based operations on huge volumes of data - that's why SSIS performs so well. But there are some data flow transformations that perform row-by-row operations, and although they have their uses, they can easily cause data flow performance to grind to a halt. These transformations include the OLE DB Command transform, the Fuzzy Lookup transform, the Slowly Changing Dimension transform and the tried-and-true Lookup transform when used in non-cached mode. Although there are valid uses for these transforms, they tend to be very few and far between, so if you find yourself thinking about using them, make sure that you've exhausted the alternatives and that you do performance testing early with real data volumes.

Avoid asynchronous transforms: In short, any fully-blocking asynchronous data flow transformation (such as Sort and Aggregate) is going to hold the entire set of input rows in memory before it produces any output rows to be consumed by downstream components. This just does not scale for larger (or even "large-ish") data volumes. As with row-based operations, you need to aggressively pursue alternative approaches, and make sure that you're testing early with representative volumes of data. The danger here is that these transforms will work (and possibly even work well) with small number of records, but completely choke and die when you need them to do the heavy lifting.

Really know your data – really! If there is one lesson I've learned (and learned again and again - see my previous blog post about real world experience and the value of pain in learning ;-) it is that source systems never behave the way you expect them to and behave as documented even less frequently. Question everything, and then test to validate the answers you retrieve. You need to understand not only the static nature of the data - what is stored where - but also the dynamic nature of the data - how it changes when it changes, and what processes initiate those changes, and when, and how, and why. Odds are you will never understand a complex source system well enough, so make sure you are very friendly (may I recommend including a line item for chocolate and/or alcohol in your project budget?) with the business domain experts for the systems from which you will be extracting data. Really.

Do it in the data source: Relational databases have been around forever (although they did not write the very first song - I think that was Barry Manilow) and have incredibly sophisticated capabilities work efficiently with huge volumes of data. So why would you consider sorting, aggregating, merging or performing other expensive operations in your data flow when you could do it in the data source as part of your select statement? It is almost always significantly faster to perform these operations in the data source, if your data source is a relational database. And if you are pulling data from sources like flat files which do not provide any such capabilities there are still occasions when it is faster to load the data into SQL Server and sort, aggregate and join your data there before pulling it back into SSIS. Please do not think that SSIS data flow doesn't perform well - it has amazing performance when used properly - but also don't think that it is the right tool for every job. Remember - Microsoft, Oracle and the rest of the database vendors have invested millions of man years and billions of dollars[1] in tuning their databases. Why not use that investment when you can?

Don’t use Data Sources: No, I don't mean data source components. I mean the .ds files that you can add to your SSIS projects in Visual Studio in the "Data Sources" node that is there in every SSIS project you create. Remember that Data Sources are not a feature of SSIS - they are a feature of Visual Studio, and this is a significant difference. Instead, use package configurations to store the connection string for the connection managers in your packages. This will be the best road forward for a smooth deployment story, whereas using Data Sources is a dead-end road. To nowhere.

Treat your packages like code: Just as relational databases are mature and well-understood, so is the value of using a repeatable process and tools like source code control and issue tracking software to manage the software development lifecycle for software development projects. All of these tools, processes and lessons apply to SSIS development as well! This may sound like an obvious point, but with DTS it was very difficult to "do things right" and many SSIS developers are bringing with them the bad habits that DTS taught and reinforced (yes, often through pain) over the years. But now we're using Visual Studio for SSIS development and have many of the same capabilities to do things right as we do when working with C# or C++ or Visual Basic. Some of the details may be different, but all of the principles apply.

In the days and weeks ahead I hope to have some more in-depth posts that touch on specific best practices, but this will have to do for now. It's late, and I have work to do...

[1] My wife has told me a million times never to exaggerate, but I never listen.

SSIS Best Practices, Part 1

One of the topics on which I've presented quite a bit lately has been that of SQL Server Integration Services best practices. The body of knowledge on how to best use SSIS is small compared to more mature development technologies, but there is a growing number of resources out there related to SSIS best practices.

And here it is. ;-)

In any event, here is a list of links to those online resources I have found related to SSIS best practices:

SQL Server Integration Services (SSIS) 10 Quick Best Practices from Kuldeep Chauhan (http://blogs.msdn.com/sqllive/archive/2007/05/21/sql-server-integration-services-ssis-10-quick-best-practices.aspx): This is a relatively straightforward list, and presents some "low hanging fruit" for avoiding common mistakes.
SQL SSIS Best Practices from Roy Ashbrook (http://drowningintechnicaldebt.com/blogs/royashbrook/archive/2007/05/04/sql-ssis-best-practices.aspx): This is more of a collection of links than any truly original material, but it's still good stuff.
Best Practices from the SQLIS.com Wiki (http://wiki.sqlis.com/default.aspx/SQLISWiki/BestPractices.html): Another short collection of links - hopefully it will grow sooner rather than later.
SSIS Best Practices - Performance from Ashvini Sharma (http://blogs.msdn.com/ashvinis/archive/2005/09/27/474563.aspx): This one predates the release of SQL Server 2005 so it's a little dated and a little sparse, but it still covers practices that I see people overlooking today.
SSIS: Suggested Best Practices and naming conventions from Jamie Thompson (http://blogs.conchango.com/jamiethomson/archive/2006/01/05/SSIS_3A00_-Suggested-Best-Practices-and-naming-conventions.aspx): Yes, I've saved the best for last. This is the most comprehensive list of SSIS best practices that I have found online to date. Jamie is a great voice for the SSIS community, and has obviously put a lot of thought and work into this post.

I'd also like to point out that there is a decent amount of overlap and a decent amount of contradiction between the various best practices referenced above. Not everyone agrees on what's best, and to a large extent, "best" is determined as much by the context in which you're using SSIS as it is by SSIS itself. Read all of these resources (and everything else, for that matter) with an open mind and apply their guidance with a healthy dose of pragmatism.

And although this doesn't fall into the same "lists of things to do and to not do" category as the links above, anyone who is interested in doing SSIS correctly needs to check out Microsoft's Project REAL (http://www.microsoft.com/sql/solutions/bi/projectreal.mspx) as well.

Of course, I have no intention of stopping here. (The whole "Part 1" thing in the title probably clued you in on this already, right?) At some point - hopefully today, but possibly as late as this weekend - I will be posting my own personal list of best practices, but I wanted to give credit where credit is due before doing so.

SSIS Performance Tuning Techniques

This article is neither new nor my own, but I've shown it to a few people lately and despite their having worked with SSIS for months, none of them had seen it.

http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

This is a white paper that describes how the SSIS run time engine and data flow engine work, and how to take advantage of that knowledge to build better performing SSIS packages. It has excellent coverage of execution trees and memory buffers, and how they are used "under the hoods" in your data flow.

If you haven't read it before, please check it out today and tell a friend. It's required reading for anyone building SSIS packages in the real world.