Saturday, June 7, 2008

Sharing The Lookup Cache

Sometimes I'm the last person to know. I like to think that I know at least a little about SQL Server Integration Services, but it still seems that I'm always discovering some technique or tidbit of knowledge that had previously escaped my notice. I learned one such lesson this week during the TechEd Developers conference in Orlando. I was co-presenting with Donald Farmer[1] during his "Microsoft SQL Server 2008 Integration Services: From D'oh to Wow" session Friday morning. Here's the deal:

in the SSIS data flow (both 2005 and 2008) the in-memory cache for a fully-cached Lookup Transformation can be shared between multiple Lookup Transformations, so long as they all use identical queries to populate their cache. This means that if two Lookups can be implemented with the same query (not a particularly common scenario, but it does happen from time to time) the performance of the package can be improved by only loading the lookup data into memory once.

There are opportunities here to refactor the lookup queries in existing data flows - consider a scenario where a data flow contains two Lookup Transformations, one to add a Department Name field to the data flow an done to and one to add a Department Contact field, both looking up based on the Department Key. A natural approach to solve this problem would be to have each Lookup use just the two columns it needs, thus making the lookup data as narrow as possible. (This is a best practice after all, right?) But by having both Lookups use the same three-column lookup query you can reduce the volume of data being read and cached, and potentially improve package performance. What's not to like?

Funny you should ask.

After the session was over, I went to visit my friend and colleague John Welch, who was staffing one of the SQL Server Business Intelligence booths at the conference. I explained to him what I'd just learned (I was all excited in being able to share this with someone who loves SSIS as much as I do) and asked him if he had every heard of such a feature. I'll paraphrase here, but his response was basically "uh, yeah." And John filled me in on the dark side of this cool little feature.

Take a look here: http://support.microsoft.com/kb/924016

That's right, there is a known bug in SSIS related to this behavior. Basically if there are multiple data flows in the same package, or in multiple packages that run in the same process, it is possible for the cache to be cleared before all of the Lookup transformations are done using it. There are some good workarounds (and some not-so-good ones too - I would probably not want to disable caching for my Lookups to work around this bug) included in the KB article above that can make this feature usable, but you need to be careful using it. And of course if you're using SQL Server 2008 SSIS, you can explicitly cache your data and share it in many more interesting and complex scenarios thanks to the new Cache Connection Manager options added to the Lookup transformation.

And of course this wasn't the only thing I learned yesterday, but it's the only one for which I have blogging time today. I've got the others filed away so once TechEd is done (maybe before, but as I have two breakout sessions for which to prepare this week I would not hope for too much) I'll have more SSIS goodies to blog.

 

[1] Donald was the Program Manager for SSIS during the SQL Server 2005 release, and is now the Principal Program Manager for SQL Server Data Mining. It was a great honor (and a ton of fun!) to share the spotlight with such an engaging, knowledgeable and funny speaker. And to make things even cooler, we're going to team up again for a repeat performance on Wednesday during the TechEd IT Professionals conference.

No comments: