Thursday, August 23, 2007

Loading Multiple Excel Files with SSIS

Now that I have the time to look up and see what's going on in the world around me, I'm trying to spend more time on the SSIS Forums on MSDN. This is a great community-driven resource where lots of really bright and dedicated people (and sometimes me, too) help the growing body of SSIS developers solve their problems. And looking around the forums I've found several posts (including this one) asking about how to load data from multiple Excel files using a single SSIS package. This is an exceptionally easy task for SSIS, but I could not find[1] any online resource that walked you through step by step.

So here you go: Step by step instructions on how to load data from multiple Excel files with a single SSIS package.

1. Create an SSIS project in Visual Studio


2. Create multiple Excel files in the project folder – make sure that they have an identical format but different data




3. Add an Excel connection manager to the package – point it at of one Excel file
4. Add a Data Flow task to package


5. Add an Excel source component to the Data Flow
6. Configure the Excel source to use Excel connection manager and the correct worksheet (Sheet1 in this example)



7. Click the Preview button to be certain that it works



8. Build the rest of data flow – in this case we’re simply counting the rows, but you might want to put the data into a database or something ;-)



9. Make sure the Excel file is not open – this will cause errors every time if Excel has the file open when you try to load it
10. Execute the package



11. At this point we know that we have everything built correctly to load a single Excel file – now we need to modify the package to work with an arbitrary set of files
12. Copy and paste into notepad at the ConnectionString property of the connection manager – it should look something like this: Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\Projects2005\SSIS_Excel_Loop\SSIS_Excel_Loop\email_book_01.xls;Extended Properties="Excel 8.0;HDR=YES";
13. Create a package variable called FileName and give it the value of the file name from the ConnectionString



14. Update the connection manager’s ConnectionString property to be based on an expression that uses the FileName variable to duplicate its tested value - "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + @[User::FileName] + ";Extended Properties=\"Excel 8.0;HDR=YES\";"
(Note how we're using the backslash "\" character to escape the double-quotes that are part of the connection string without breaking the expression!)



15. Run the package again to make sure it still works – By testing after each change you will catch errors early, and not need to try to figure out where you went wrong if you get to the last step and things don’t work




16. Add a Foreach Loop Container to the control flow and drag the Data Flow task to be contained within it



17. Edit the Foreach Loop Container so that it loops through all xls files in the folder and retrieves the fully qualified file name and maps that value to the FileName variable





18. Run the package again – Voila!! We've now loaded multiple Excel files with a single SSIS package



19. If you're feeling particularly skeptical (after all, how can you be certain that really looped through the files?) you can easily add a Data Viewer to the data flow and look at the data being loaded from each Excel file





And there you have it. It's not exactly a complicated scenario, but it is very common, very "real world" and it shows off some of SSIS's most powerful features, namely variables and property expressions. We could also easily update this example to use a configuration to set the path of the folder through which the Foreach Loop Container enumerates, but I'll leave hat as an exercise for the reader.

[1] If you do find an existing online resource that goes through this step-by-step, please don't tell me. Loading images using Blogger is painful, and I'd hate to think I went through that torture for naught. ;-)

38 comments:

Igorblackbelt said...

Is it possible to do the same if the tab names across the files are different?

Romain said...

Hi,

I've been working with SSIS for a week, trying to migrate some DTS packages.

My present problem is the following:
How can I load datas from several workbooks (contained in one file), wich contains several worksheets?

Thanks to your article Matthew, I can load the first worksheet's data of all my workbooks. But how can I add a new loop to load all my worksheets?

Thx,

Romain

Romain said...

Just a little add about the process:

if you have a problem with the connectionString while using the variable, that's because the connectionString is readen before the foreach loop, and it is then invalid because the variable has no value at this moment.

to make sure the foreach loop processed before, set package's parameter DelayValidation to True

Romain said...

http://www.developpez.net/forums/showthread.php?t=410671

answer is here... in french

Matthew Roche said...

Take a look at this post I made today:

http://bi-polar23.blogspot.com/2007/09/loading-multiple-excel-files-with-ssis.html

It won't get you 100% of the way (I don't have time to write the Excel code, sorry) but hopefully it will get you most of the way there.

Brian said...

Great article, very detailed and well written, you probably save me 3 days of work. Really great job.

Matthew Roche said...

Thanks, Brian!

I've been so swamped with work lately that I've been neglecting my blog (again) but it's great to know that someone finds this useful. Maybe that's the inspiration I need to get blogging on more technical topics again. Of course, this means I'll have to give up sleep... ;-)

Peter said...

Hi Mathew,

Very interesting post. I've followed your example exactly as it is, but each time I get an aquire connection manager error.

It's infuriating.

Any suggestions on what I might try would be gratefully received.

Peter

Matthew Roche said...

Peter - The first thing I would do is ensure that you do not have the file open in Excel when you attempt to run the SSIS package. This will cause the package to fail every time.

If this is not the problem, then you'll need to post the exact steps you're taking (it's not enough to simply say that you're following these instructions exactly, as there are several steps involved) and the exact error message you're receiving if I'm going to be able to help.

Peter said...

Mathew,

Thanks for the quick response.

I made a very silly error. Instead of copying the file path to the variable, I copied the connection string. I guess I shouldn't work on my laptop in the evening in front of the TV.

Apologies for wasting your time.

Matthew Roche said...

No worries, Peter!

This is the sort of mistake I personally make all the time. That's the reason that I always try to to test the package with each incremental change I make, so that when I do something stupid and break it (as I invariably do) there is a smaller delta that I need to consider.

Woody said...

Matthew,

Thanks for the great example of how to load multiple excel files.

I have a lot more to overcome for my current need. Now that I have the data coming in from the excel sheets I have to do so much more than row counting.

I now need to load that data into a table (I think I know how to do this part) however, once it's in the table, I need to sort the data and eliminate any duplicate rows, then take the data from the table and use it to create multiple output files (I have the script written to take data from a pre-sorted flat file and do this split currently, so I should be able to modify that part as well).

After all that I need to zip those output files and then ftp the zip file to an outside location..

You responded to my post on msdn, so thanks again for the help so far, if you can offer any more help it's certainly welcome.

Woody

Matthew Roche said...

Woody - If you recall my response to your post on the MSDN forums, you'll see that I referenced another blog post where I showed a technique for exporting to multiple text files, and to the SQL Server ranking functions as potential tools to solve the other parts of your problem. If you want more valuable input, you're going to need to provide more complete and precise problem descriptions.

Also, the MSDN forums are the ideal place for continuing this conversation. Not only do the forums serve a much broader audience (so the entire SSIS community can benefit from the exchange) but you can also benefit from a much broader group of "experts." I'm just one of many people who help out on the MSDN forums, and if you post there, you can get more help from more people, more often (for example, I'm going offline for the weekend now ;-) than you will get here.

Woody said...

I will definitely update the original post on msdn as your first blog sample was great for learning the For Each loop!

Thanks, have a great weekend!

Woody

Lothar said...

Hi Matthew,

do you know how I can apply the same procedure to Excel files which are located in a SharePoint library? I'm not really sure where to involve a http-connection manager...

Thx,
Lo

Clarice said...

Hi, I went through the whole looping process for xls file and read in the FileName variable. BUT when i tried to pass this filename value to the derived column, somehow I get an error as the data is not passed in. How can I pass this filename for each file(which has been read by the file enumerator) to the derived column?

thanks

Matthew Roche said...

Lothar - There is no native (and by "native" here I mean "where you don't need to write .NET code yourself")) way to use this technique for Excel files hosted on a SharePoint portal. The Foreach File enumerator in SSIS only works with local or LAN (SMB) file systems and cannot work with an HTTP source.

I've seen this question (or variations thereof) posted on the SSIS forums on MSDN several times. The best thing to do is to look there and if you cannot find an existing solution, post your question there where a large community can help answer it.

http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&SiteID=1

Matthew Roche said...

Clarice - It's really not clear what you're doing, what error you're getting or where the Derived Column transformation comes into play. There is no Derived Column in the example I posted, so I must be missing something.

The best thing to do is to look on the MSDN Forums for SSIS and post your question there where a large community can help answer it.

http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=80&SiteID=1

I've been travelling a lot lately and things aren't looking any less hectic in the weeks ahead, so you're much more likely to get timely assistance if you post your question (hopefully with more details) there.

Good luck!

Clarice said...

Actually i managed to debug the code. It was because the file path was too long. And I managed to find a way to get the actual filename. - I redefined the value of the variable fileName which was passed over to the derived column.
Did a substring and hey presto, it works!
thanks lots for the illustration. Ur post really was very helpful indeed:)!

Rajesh Krishnamoorthy said...

This post really helped me a lot in solving my problems today. thanks!

Vikram said...

I have been trying to find the expression builder to debug the error in my Connection String value but can't seem to find a it in SS BIDS. Any pointer on how to start the Expression builder will be very helpful.

Thanks,
Vik.

Christian said...

Seriously, this was exactly what I needed and it worked!

Duli said...

Hi Matthew,

I'm trying to figure out the answer to a different problem but not altogether too different.

If you were told you will be receiving spreadsheets with different layouts (1 of 50 lets say) and the layout specs were given to you previously (which reside in their own excel files), but you dont want to create 50 different packages for each layout (since they have different delimiters, mappings etc), is there a way to dynamically get that info and dump it into a package?

Mike Agafonov said...

Hi!

I do have a problem: the foreach enumerator works only if there exists a file which I specified during package configuration (C:\temp\aaa.xls). If it's present, then enumerator takes all other excel files too with any names, but if it's not, then it stops with error "Opening a rowset 'AA' failed. Check if object exists in the database."
Any thoughts?

J.Fuqua said...

Excellent post, exactly what I was looking for. Thanks for saving me time.

Alex said...

For realize this plan there is nice tool-ms Excel repairing tool,application has many other resources and is free as far as I know,it keep several copies of your workbook in different places, you can avoid the loss of information as a result of data corruption or virus attack,ecover corrupted information in Microsoft Excel format, please download Excel files repairing tools right now and try to repair your Excel files with MS Excel repair tool,can open and recover damaged documents in Microsoft Excel format,sove next problems on example if your document was seriously damaged, for example due to HDD failure, MS Excel repairing tool will show these areas to be pretty large,very good solution to repair Excel files.

subba said...

can Any one tell me how can we load data from multiple sheets in a singl excel file.

Matthew Roche said...

@subba - You'll need to add a Foreach Loop container that loops over the sheets in the workbook file. The tricky thing is that for this to work, each of the sheets needs to have exactly the same metadata (same columns with the same data types, etc.) and this is often not the case.

David said...

I am receiving many errors with this process in SSIS 2008.
1. The tool wants an extension.
2. I typed in the expression as you have it here and I get a Property Value is not valid.

Matthew Roche said...

@David - The best place to go for assistance is the SSIS forums on MSDN. Share as much information as possible, including the expression and the exact error messages you're getting. Good luck!

anonymous said...

Hi Matthew,

Well, your approach worked in a number of others I tried. Thanks.

I have one question. The names of excel files, I have, change everyday. As in you example, I initialized the filename variable with the first filename. But when I changed the files in that folder with some other, it was not working. The only change in my file names is the date.

Matthew Roche said...

@Anonymous - Try setting the DelayValidation property of your package to True. This will keep you from needing a valid file name when you execute the package.

Dustin said...

Hi Matthew,

Thanks for the step by step instructions. My package ran successfuly. However, it looped and loaded the 1st file 3 times instead of loading 3 files in the directory. Any insight of what might have happened?

Thanks,
Dustin

Dustin said...
This comment has been removed by a blog administrator.
Matthew Roche said...

@Dustin - The most likely explanation for the behavior you've described is that the ConnectionString property for your Excel connection manager is not properly configured to use the file path expression. In this scenario, the Foreach Loop container will run once for each file, but will not change the file path.

piamars said...
This comment has been removed by the author.
ssis 2008 said...

Tried the same example with SSIS 2008 but unfortunately not working . error in connection manager variable

Matthew Roche said...

This works in 2008 (and 2008 R2 and 2012) just as it did in 2005. Whenever I have had problems implementing this type of logic and have run into errors, it has been an improperly configured connection string. If you're testing your changes after each step after each change you should be able to catch where the error is introduced. If this doesn't help, I'd recommend posting the details of your problem to the SSIS forums on MSDN...