Wednesday, September 10, 2008

Views as "Interfaces" for SSIS

This is a technique that I have used for quite some time, and while I have mentioned it to others often enough, I have never blogged on it. A recent conversation on the MVP newsgroups made me think of it again, and hopefully I'll get the chance to type it all up before I get pulled back to the real world.

Here's the scenario:

  • An SSIS package is extracting data from a 3rd party database.
  • The package is deployed to multiple dev/test/prod environments.
  • The package uses configurations so that deploying to multiple environments does not require changes to the package file.
  • The database schema of the various environments is "identical." [1]
  • The package works in one environment, but when deployed to another environment, it fails validation with the dreaded "VS_NEEDSNEWMETADATA" error.
  • After various debugging and diagnostic steps are completed, the package suddenly works again, even though "nothing has been changed."[2]

Hey wait - if the database schemas are identical, you shouldn't get this error, right? And since nothing has changed, the error shouldn't just disappear. What's going on here? Obviously SSIS shouldn't behave like this.

Of course, the real world story here is that someone was changing the schema of the test environment and had not communicated that change to anyone. Once this is known, the solution is pretty obvious, and it's clear that SSIS is behaving as desired, enforcing strict typing in the data flow.

But if this was a simple "user error" scenario it wouldn't be appropriate blog fodder, would it?

But unfortunately, this is not a rare, one-off scenario, and the solution generally involves more communication than it involves technology. But before looking at a solution, let's see if we can summarize the problem. These are the significant problem factors that I see when I look at this scenario:

  • The SSIS data flow is incredibly strict when it comes to data types. This strictness is closely related to interfaces in object oriented programming in that the names, order and data types of the columns being consumed cannot change in any way once a client (the SSIS data flow) exists that is built to use the interface.
  • When the "interface" of the source data changes, any data flows that reference it will break with the "VS_NEEDSNEWMETADATA" error mentioned above.
  • Database administrators are generally pretty good about not changing column names and data types on tables when they know that other database objects rely on those tables.[3]
  • There is no simple/easy/straightforward way for a DBA to tell if an SSIS package uses a given table. Solutions generally rely upon documentation that is external to the database, or just waiting until something breaks.

So how does this problem description help us reach a solution? Take a look at the last two bullets and the solution practically presents itself: Use database views as an "interface layer" within the database itself.

But what does this mean? In short, it means that:

  • The SSIS data flows should extract views from views and not from base tables.
  • The views should "self-documenting" to DBAs because they are database objects with concrete dependencies with the base tables from which they draw their data.
  • The views should explicitly CAST each column to an explicit data type and column alias - even when the alias name and data type exactly match the base table - to provide an additional layer of insulation against changes to the base tables.

The key here comes from the second and third bullets, above. Having the views as "clients" to the base tables should make it more obvious to people who are changing table schemas that the the tables in question are being used. And the explicit data type casting in the view definitions will define the "interface" required by the SSIS package clients in a way that is familiar to DBAs.

Now obviously this isn't always an option - many DBAs will not allow the creation of new database objects (such as views) but when this is viable it is a technique with which I have had great success. And when it's not, you can still get some of the same benefits (namely the data type change protection) by basing your data flow on a source query that explicitly casts and aliases each column in the SELECT statement. Then, so long as data type changes to the underlying tables can still support an explicit cast to the data type required by the SSIS data flow, the package still won't break.

So there you have it - a technique to isolate SSIS data flows from changes in source databases, and to reduce the likelihood of those changes happening in the first place.

What do you think? I'd love to hear if anyone has used similar techniques...

[1] Yes, this is in quotes for a reason

[2] Same thing here

[3] This is a pretty broad statement, I know...


Davide said...

Hi Mattew, I use the very same tecnique, and I'm really happy with it. Views are great to way to smooth the "friction" you have between database and anything outsite it, that shields the external world from database schema changes and doesn't incour in performance penalities.

Davide said...

The comment above is mine - Davide Mauri - don't know why google put the nickname "manowar" :-)
I starting to think that google knows too much about me! :-)

Matthew Roche said...

It's not Google - it's the Power of Steel that cannot be denied...