Friday, September 14, 2007

Loading Multiple Excel Files with SSIS - Part Two

I got a comment yesterday (well, it was yesterday as I typed this post, but I'm not actually publishing it until the next day... damned air travel...) on my “Loading Multiple Excel Files with SSIS” post asking if it was possible to use the same technique to loop over multiple Excel workbook files with different worksheet names and still have a single package to do the work. The short answer is yes, but it depends on these workbooks that have different worksheet names to have identical worksheet structures, and that’s a pretty big if.

I’m personally skeptical that you’re going to run into this situation in the real world. From my experience, unless files come from a single source, there are almost always meaningful differences (although they may be small) between them. And if these hypothetical Excel workbook files come from a single source, they’re probably going to have the same name, right?

In any event, let’s work under the assumption that we have Excel workbook files with the same data but with a different worksheet name. The steps below will demonstrate how to update the sample package from my earlier post to work in this scenario. I’m not going to write all of the VB.NET code that will be required (partly because I’m writing this in the airport while waiting for my flight and don’t have the necessary components installed to make it work, and partly because this sounds like an ideal “exercise for the reader”) but I’ll cover everything else. Here’s what you need to do:

First, we need a new XLS file to serve as input. Copy one of the existing XLS files (I’m using the last post as the starting point, so if you didn’t follow those steps then, you’ll need to do it now in order to follow along) and then open the copy in Excel. Rename the first worksheet from Sheet1 to DifferentName. Save the file and close it.

Open the SSIS project created in the previous post in Visual Studio, and open the SSIS_Excel_Loop.dtsx package.

Add a new variable named WorksheetName of type string with the value “Sheet1$”

Edit the Excel data source in the data flow to use the “Table name or view name variable” Data access mode and select the WorksheetName variable from the drop-down list.

Click on the Preview button to verify that the data source still works.

Add a Script task to the control flow inside the Foreach Loop container and connect it to the Data Flow task with a success precedence constraint.

Right-click on the newly-added Script task and select Edit from the pop-up menu. On the Script tab within the Scrip Task Editor window, add WorksheetName to the ReadWriteVariables list and FileName to the ReadOnlyVariables list.

Click on the Design Script button to open the Visual Studio for Applications (VSA) development environment.

Edit the code to look like this:
Public Sub Main()

Dim sheetName As String
Dim fileName As String = CStr(Dts.Variables("FileName").Value)

If fileName.Contains("email_book_03.xls") Then
sheetName = "DifferentName$"
sheetName = "Sheet1$"
End If

Dts.Variables("WorksheetName").Value = sheetName

Dts.TaskResult = Dts.Results.Success

End Sub
This is where you’ll need to do a little work on your own. Although this scenario will execute and work as desired for the sample files, it’s obviously not going to be sufficient in the real world. Instead, you’ll need to use the Excel object model, perhaps through a custom .NET assembly that wraps the COM Office DLLs, perhaps through Visual Studio Tools for Office (I’m not going to take the time to build the former and can’t download and install the latter without an internet connection) and use that API to retrieve the worksheet name of interest. This will probably be the first worksheet in the workbook, but your requirements may vary.
The key portions of the VB.NET code you’ll need to write for the Script task are that you read from the FileName package variable to know what XLS file you’re working with, and then write to the WorksheetName package variable to tell the rest of the package (specifically the Excel data source component in the data flow) the name of the sheet to use. Pretty simple, right?
Well, it works on paper, but as I mentioned early on I suspect that you’ll find that there are differences beyond the sheet name that you’ll need to handle. Because of this you’ll need to ensure that your package has robust error handling so that when these differences appear at runtime the package can handle them gracefully.
Good luck!

Postscript: I also see that while I was in the air yesterday I got a few additional comments posted from "Romain" on looping over the tabs in a workbook to load all of them instead of just the first tab. I'm afraid I cannot read the French solution to which he linked, but the same technique not listed in the code above is what you're going to need to do that. You'll need to loop through each Worksheet object in the Workbook object for the current XLS file, using the Excel object model to do so. Probably the easiest way to do this is to have a For Loop Container in your control flow contained within the existing Foreach Loop Container that loops through the multiple Excel files. Within the For Loop have a script task that fetches the name of the next sheet following the last-fetched "current" sheet (or the first sheet in the book on the first pass through the loop) and also sets a Boolean flag indicating that the sheet name being fetched is the name of the last sheet in the workbook. Configure the For Loop to exit when this flag variable is set to true.
Again, good luck! I'm not going to have time to write the Excel sheet-looping code myself, but when you get it working I'd love to see it. ;-)


Matthew Roche said...

This is the English version of an MSDN article that may provide an alternative to the Excel object model code I didn't include in this post:

Apparently you can use the built-in Tables enumerator provided with the Foreach Loop Container to loop through the Worksheets *and* named ranges in an Excel workbook file.

SA said...

Hello Matthew,
I would like to take this a little further. I am in a situation where i need to read the list excel files from a database table, instead of reading the excel files from a folder.
Can you give me some idea on how to achive this?

- Sharjeel

Matthew Roche said...

Sharjeel - the main difference is that you will need to use a different enumerator for your Foreach Loop container. The big risk, of course, is that the database table and the reality of the file system can easily get out of sync. You'll need to ensure that your package is smart enough to handle situations where the database says that there is an file but there really is not.

SA said...

Hello Michael,
Here i am again :)
Well I have a SSIS package which imports data from Excel files on the network.Its so happening that one of those files are opened by someone else & package is giving AcquireConnection error.
As Package is only reading data from the files, i want to open the files in ReadOnly mode.
How can i open files in ReadOnly mode & remove this error ?

Matthew Roche said...

Hi again!

Unfortunately, there is no way that I'm aware of to open a file in read-only mode through the Jet OLE DB provider, which is what SSIS uses. The Excel file format is not designed for concurrent multi-user access, so this is an unsupported scenario regardless of whether you're using SSIS or VB or the Excel client application.

There may be ways to work around this, perhaps by copying the files you need to load into a secondary folder so that when the SSIS packages run, the copies of the files are in a folder where no one will use them. Or, you could use logic in your control flow to continue execution and simply skip those Excel files that are locked.

Good luck!

Unknown said...

Hi Matthew

How can I make the package re-runable? The FileName variable need initialize file name. My package will do clean up the .xls file after loading successfully or found duplicate file.Next time, when new file(different name) come in, the package can not pass the validation. I try to assign the value as *.xls to filename variable but still failed.


Unknown said...

I just wanted to say thanks since this was very helpful on a recent project. I did not implement it the way you did, but it showed me how to access arbitrary sheet names which was enough.

Alex said...

For work with office and excel files try use-download Excel files repairing tools,tool helped me many times and as far as i know it is free,utility keep several copies of your workbook in different places, you can avoid the loss of information as a result of data corruption or virus attack,ecover corrupted information in Microsoft Excel format, please download Excel files repairing tools right now and try to repair your Excel files with MS Excel repair tool,can open and recover damaged documents in Microsoft Excel format,sove next problems on example if your document was seriously damaged, for example due to HDD failure, MS Excel repairing tool will show these areas to be pretty large,very good solution to repair Excel files.

Alex said...

In this state advise try-repair Excel files,this tool helped me many times and as far as i know it is free,program may try to open the necessary *.xls, *,xlt, *.xlsx, xlsm, *.xltm, *.xltx, *.xlam file and see an error message on the screen instead of it: Microsoft Excel unable to read file, This file is not in a recognizable format, etc,recover after attack viruses, problems with media (a hard drive failure, a corrupted CD or DVD, etc.), errors in transferring the file via the LAN or the Internet, etc,extract information from corrupted Microsoft Excel files,can filter files of the *.xls, *,xlt, *.xlsx, xlsm, *.xltm, *.xltx, *.xlam formats,will display a dialog box with the question whether this worksheet should actually be scanned,allows you to view the entire information in a convenient form and to make sure that the tool reproduced it correctly,exporting information recovered from the corrupted worksheet into Microsoft Excel.

Anonymous said...

I really really need help with this. I've been stuck on the same issue for days now.

Basically i have 6 excel files which i need to loop through.
Each excel file has 1 sheet each but the sheet names are different.
EG: Sheet1, Sheet2, IRE, ....

I need to be able to loop through all these 6 files and read the data of these sheets and insert into 1 table.

I am able to loop through if the sheet names are the same, but not when the sheet name changes.

I am loading the data into 1 sql server table. I tried following the above blog, but im stuck.

sridevi(pragnya) said...

I am importing excel files into mysql table using ssis, excel files are in one folder(xxx),using foreach container i am able to get fullpath and importing.after that i am moving all the files from 'xxx' folder to 'yyyy' folder and deleting in 'xxx' . I gave default path to excel connection manager.once files are deleted and if i run the package its throwing error that connection manager doen't have valid path. what i have to do in this case.please suggest me,thanks in advance

SA said...

Hello Sri,
I would suggest you to have a temp xls file & give your connection manager its path. Put a foreach loop on the folder which contains the real xls files & copy the files from folder to the temp file path (overwrite temp file by each of the real files). This way there will always be a file available for connection mgr & no errors should be reported

Matthew Roche said...

@sri - I would suggest setting the DelayValidation property on your data flow task to True. If this is not the case, then the package will fail validation if the connection string is not valid at package execution.