Thursday, November 1, 2007

SSIS Best Practices, Part 2

And as promised, here is my personal list of SQL Server Integration Services best practices. This is by no means comprehensive (please see my previous post for links to other best practice resources online) but it's a great starting point to avoiding my mistakes (and the pain that came with them) when you're working with SSIS. Here goes:
  • Get your metadata right first, not later: The SSIS data flow is incredibly dependent on the metadata it is given about the data sources it uses, including column names, data types and sizes. If you change the data source after the data flow is built, it can be difficult (kind of like carrying a car up a hill can be difficult) to correct all of the dependent metadata in the data flow. Also, the SSIS data flow designer will often helpfully offer to clean up the problems introduced by changing the metadata. Sadly, this "cleanup" process can involve removing the mappings between data flow components for the columns that are changed. This can cause the package to fail silently - you have no errors and no warnings but after the package has run you also have no data in those fields.

  • Use template packages whenever possible, if not more often: Odds are, if you have a big SSIS project, all of your packages have the same basic "plumbing" - tasks that perform auditing or notification or cleanup or something. If you define these things in a template package (or a small set of template packages if you have irreconcilable differences between package types) and then create new packages from those templates you can reuse this common logic easily in all new packages you create.

  • Use OLE DB connections unless there is a compelling reason to do otherwise: OLE DB connection managers can be used just about anywhere, and there are some components (such as the Lookup transform and the OLE DB Command transform) that can only use OLE DB connection managers. So unless you want to maintain multiple connection managers for the same database, OLE DB makes a lot of sense. There are also other reasons (such as more flexible deployment options than the SQL Server destination component) but this is enough for me.

  • Only Configure package variables: If all of your package configurations target package variables, then you will have a consistent configuration approach that is self-documenting and resistant to change. You can then use expressions based on these variables to use them anywhere within the package.

  • If it’s external, configure it: Of all of the aspects of SSIS about which I hear people complain, deployment tops the list. There are plenty of deployment-related tools that ship with SSIS, but there is not a lot that you can do to ease the pain related to deployment unless your packages are truly location independent. The design of SSIS goes a long way to making this possible, since access to external resources (file system, database, etc.) is performed (almost) consistently through connection managers, but that does not mean that the package developer can be lazy. If there is any external resource used by your package, you need to drive the values for the connection information (database connection string, file or folder path, URL, whatever) in a package configuration so they can be updated easily in any environment without requiring modification to the packages.

  • One target table per package: This is a tip I picked up from the great book The Microsoft Data Warehouse Toolkit by Joy Mundy of The Kimball Group, and it has served me very well over the years. By following this best practice you can keep your packages simpler and more modular, and much more maintainable.

  • Annotate like you mean it: You've heard of "test first development," right? This is good, but I believe in "comment first development." I've learned over the years that if I can't describe something in English, I'm going to struggle doing it in C# or whatever programming language I'm using, so I tend to go very heavy on the comments in my procedural code. I've carried this practice over into SSIS, and like to have one annotation per task, one annotation per data flow component and any additional annotations that make sense for a given design surface. This may seem like overkill, but think of what you would want someone to do if you were going to open up their packages and try to figure out what they were trying to do. So annotate liberally and you won't be "that guy" - the one everyone swears about when he's not around.

  • Avoid row-based operations (think “sets!”): The SSIS data flow is a great tool for performing set-based operations on huge volumes of data - that's why SSIS performs so well. But there are some data flow transformations that perform row-by-row operations, and although they have their uses, they can easily cause data flow performance to grind to a halt. These transformations include the OLE DB Command transform, the Fuzzy Lookup transform, the Slowly Changing Dimension transform and the tried-and-true Lookup transform when used in non-cached mode. Although there are valid uses for these transforms, they tend to be very few and far between, so if you find yourself thinking about using them, make sure that you've exhausted the alternatives and that you do performance testing early with real data volumes.

  • Avoid asynchronous transforms: In short, any fully-blocking asynchronous data flow transformation (such as Sort and Aggregate) is going to hold the entire set of input rows in memory before it produces any output rows to be consumed by downstream components. This just does not scale for larger (or even "large-ish") data volumes. As with row-based operations, you need to aggressively pursue alternative approaches, and make sure that you're testing early with representative volumes of data. The danger here is that these transforms will work (and possibly even work well) with small number of records, but completely choke and die when you need them to do the heavy lifting.

  • Really know your data – really! If there is one lesson I've learned (and learned again and again - see my previous blog post about real world experience and the value of pain in learning ;-) it is that source systems never behave the way you expect them to and behave as documented even less frequently. Question everything, and then test to validate the answers you retrieve. You need to understand not only the static nature of the data - what is stored where - but also the dynamic nature of the data - how it changes when it changes, and what processes initiate those changes, and when, and how, and why. Odds are you will never understand a complex source system well enough, so make sure you are very friendly (may I recommend including a line item for chocolate and/or alcohol in your project budget?) with the business domain experts for the systems from which you will be extracting data. Really.

  • Do it in the data source: Relational databases have been around forever (although they did not write the very first song - I think that was Barry Manilow) and have incredibly sophisticated capabilities work efficiently with huge volumes of data. So why would you consider sorting, aggregating, merging or performing other expensive operations in your data flow when you could do it in the data source as part of your select statement? It is almost always significantly faster to perform these operations in the data source, if your data source is a relational database. And if you are pulling data from sources like flat files which do not provide any such capabilities there are still occasions when it is faster to load the data into SQL Server and sort, aggregate and join your data there before pulling it back into SSIS. Please do not think that SSIS data flow doesn't perform well - it has amazing performance when used properly - but also don't think that it is the right tool for every job. Remember - Microsoft, Oracle and the rest of the database vendors have invested millions of man years and billions of dollars[1] in tuning their databases. Why not use that investment when you can?

  • Don’t use Data Sources: No, I don't mean data source components. I mean the .ds files that you can add to your SSIS projects in Visual Studio in the "Data Sources" node that is there in every SSIS project you create. Remember that Data Sources are not a feature of SSIS - they are a feature of Visual Studio, and this is a significant difference. Instead, use package configurations to store the connection string for the connection managers in your packages. This will be the best road forward for a smooth deployment story, whereas using Data Sources is a dead-end road. To nowhere.

  • Treat your packages like code: Just as relational databases are mature and well-understood, so is the value of using a repeatable process and tools like source code control and issue tracking software to manage the software development lifecycle for software development projects. All of these tools, processes and lessons apply to SSIS development as well! This may sound like an obvious point, but with DTS it was very difficult to "do things right" and many SSIS developers are bringing with them the bad habits that DTS taught and reinforced (yes, often through pain) over the years. But now we're using Visual Studio for SSIS development and have many of the same capabilities to do things right as we do when working with C# or C++ or Visual Basic. Some of the details may be different, but all of the principles apply.
In the days and weeks ahead I hope to have some more in-depth posts that touch on specific best practices, but this will have to do for now. It's late, and I have work to do...

[1] My wife has told me a million times never to exaggerate, but I never listen.


dave morris said...

Could you elaborate on the "One Target Table per package" rule? I am having trouble understanding how this would work in a real world scenario when you want to send exception rows to a separate table. This would have to be done all in a single data flow, and would result in having at least two database destinations.

In order to support transactions, you apparently need to have a separate connection manager for each destination table, not each destination database. This turns out to be a good reason to have only one target table per package, but I find it almost impossible to build a solid package without at least two destinations in each data flow task.

dave morris said...

One more thing before I forget. I have had a really hard time getting an Oracle Connection manager to work without using a Share Data Source file. I have found that if I use a connection manager without a shared data source, and configure the connection string in a dtsConfig file, go in and hard code the password after the dtsConfig file is created, I have to retype the password into the connection manager every time I open the package, otherwise the DFTs just show warnings...

When I do the same exact thing with a Shared Data Source, it works every time without any trouble.

I think I'm just complaining because I just read an enlightening blog post called "I hate SSIS":

SSISBlogger71 said...

Could you elaborate more on why using a data source is a bad idea. I do understand that that it is a feature of VS. But what makes it a bad idea to use with the packages?

Matthew Roche said...

@SSISBlogger71 - Take a look here:

atlas245 said...
This comment has been removed by a blog administrator.