Friday, March 13, 2009

Indexes and SSIS Data Flow Performance

Wow – it’s been a while, hasn’t it?

As you may know, I took a job with Microsoft back in October. It’s been a fun (and insanely busy) five months; the work is challenging and diverse, the team I work with is talented and motivated and there is always something new and interesting to learn and/or do. What’s not to like?

In fact, even though BI isn’t a core component of my job role, I have been working with SSIS and the rest of the SQL Server 2008 BI tools to get easier access to some of the data I need to make better decisions and to support the recommendations I make with some “real numbers.”[1] It’s an easy way to add some value to existing data, and for me it’s a fun way to keep my BI skills from getting too rusty.

So… I re-learned an interesting lesson the other day: Don’t load data into indexed tables.

One of the SSIS packages I’ve built extracts data from a set of text files and loads it into a set of SQL Server tables.[2] There is around 20 GB worth of text file data that ends up in one particular table. When I ran this package Monday evening, I found it was still running Tuesday morning, which was not expected. I looked around in SSMS and noticed that there was one non-clustered index on this table. I had added it recently and had not dropped or disabled it before running the package. So when the package finished running, I took a look at the execution results and found this:

DTExec: The package execution returned DTSER_SUCCESS (0).
Started:  6:03:50 PM
Finished: 9:54:20 AM
Elapsed:  57028.7 seconds

That’s right – the package took nearly 16 hours to run. Ugh!

So I re-ran the package this morning, after having ensured that all indexes on this table had been dropped. And the result?

DTExec: The package execution returned DTSER_SUCCESS (0).
Started:  9:49:21 AM
Finished: 10:41:15 AM
Elapsed:  3114.05 seconds

That’s right – the package completed in well under an hour, taking roughly 5.5% of the time it took to execute the package against the indexed table.

But wait, you say – that’s not a fair comparison. You still need to rebuild that index!

You’re right, of course – the time to rebuild the index needs to be considered before we can make any apples-to-apples comparison. So what I did was to re-create all of the indexes on this table (four indexes, each with included columns, for a total index size of 15.8 GB) which took an hour and 39 minutes. Even though this is longer than the package execution itself, the total end-to-end execution time is still only around 15.9% of the package execution time with one index on the target table. In the end we achieved an 84% reduction in execution time.

So what’s the lesson?

Drop or disable your indexes before loading data into your tables, and then re-create or enable them once you’re done. If you don’t, then SQL Server will keep incrementally rebuilding the indexes while your package executes, thus killing your performance.

This isn’t new advice – this has been a best practice for bulk loading data into SQL Server for as long as I can remember. But it is something that I have not blogged on before, and since failing to follow this advice bit me this week I thought I would take the time to share my pain with the world in hopes that someone else can avoid it.

And, of course, it’s good to blog on SSIS again. Hopefully I’ll be posting again before another few months have passed…

[1] Please note that I’m not calling this a “BI solution” here. At this point I am more in a discovery phase than anything else. The components I’ve developed, while functional, will probably get thrown away once they have served their purpose as prototypes and learning tools. All of the performance numbers come from running the packages on a laptop, with both the target data database’s data file and the source text files on the same external 5,400 RPM USB hard drive. No, I wouldn’t do this in production either. ;-)

[2] The astute reader will note that this is in flagrant violation of the “each package should only load data into one table” best practice that I talk about whenever I have someone who will listen. The scenario in which I’m working here provides an interesting exception to this guideline, and please trust that I did think hard before going against my own advice.

1 comment:

Carolus Holman said...

Lesson Learned!
We usually Bulkload into a Staging table. Drop the indexes, load the data and transformations and then rebuild the indexes.

Could you imagine a table with 12 Billion Rows? Currently takes 2 hours to load 115 million rows into 2 separate servers.