Code Mortem

"Old code never have to kill it." - Grady Booch

Saturday, December 17, 2005

SSIS and Static Metadata

SQL Server Integration Services (SSIS) is a powerful new ETL platform that is part of SQL Server 2005. I've been working with it for several years during its beta and preview release phases and even though it's a great tool, one of the frustrating things about its implementation is the fact that ETL plan metadata is basically static. Jamie Thomson, a prolific SSIS blogger, recently made the statement that making the metadata dynamic wasn't possible, and that you wouldn't really want to or need to have the metadata change during plan execution. I think I understand where he is coming from when he makes this statement, but I don't agree with the conclusion. Keeping the metadata static during execution allows for the engine to do a lot of optimization, but I think it should be dynamic for a short period before the plan is compiled.

In DTS, it was fairly common to use a script object to manipulate a data pump task at runtime, so that the metadata could be set on the fly. This is essential when building reusable ETL plans. SSIS is not quite as flexible in this one regard. SSIS goes through a compilation and optimization process when you first start up a data movement plan, and as a result cannot be changed at runtime. You can't use a script task in the process to have the task modify itself. This is an understandable architectural approach. The static compilation approach allows the designers to put a lot of optimizations into the plan execution so that it can be as fast as possible. Those benefits seem to have been realized, since SSIS is very fast. As a side note, at one of the presentations I saw a few years ago it was mentioned that at least one of the people who worked on the core SSIS engine came from the Microsoft compiler optimization team.

Because of its power, SSIS has the potential to be a core piece of the data movement infrastructure of an enterprise. This means that you need maintainable, understandable, reusable code. Take an operation such as copying data from one database to another as an example. You can define a data flow task from each source table to each destination table. This may be an easy thing, or a hard think depending on how many tables you have. If it is 5 tables, then just create 5 data flows and you are done. However, if you have 500 tables, 1000 tables, or even just 50 the job gets much harder. Then, when the schema in those databases change, you have to go update those movement plans to add the modified table structure. Additionally, when you find need to change a setting on one of those plans, you'll have to open up every one of those data flows and make the changes. A much better approach, would be to have a "template" data flow task that you could run over and over again for each table in the database. Then you only have one place to go to when changes need to be made.

Even though it sounds like you can't do this in SSIS, you actually can with a workaround. The script task in SSIS is powerful enough that you can open up a template task and use the DTS object model to dynamically change the task definition as needed. The key is, you must then save out this task and call a separate copy of SSIS to load up and run that task as a sub-process using dtexec. To the SSIS engine, it looks like we are just running static SSIS plans.

I don't expect the static compilation aspect of the SSIS plan architecture to change in the future, in fact I prefer it the way it is. However, I'd ask that the designers allow the metadata to be changed and resolved before the plan is compiled. There are ways to accomplish this. Here are a couple ideas. Without access to the architecture, I can't really speculate on how easy or hard these upgrades would be.

Possible Approaches

1. Create a special type of "initialization" script task that is allowed to run before the plan is compiled. This script could handle setting up dynamic metadata. It should be able to access variable values as well, so that a variable could be used to set a table to use.
2. Allow a data source to have multiple possible source tables (or commands) that could be selected from at runtime using a loop or passed in variable. I think this idea would be more problematic to implement (and use), but it would be interesting.
3. Have some sort of wildcard syntax to specify source schema elements and pass them through.

Possible Difficulties

1. The design-time environment is probably going to be the hardest problem to solve. How do you design a data flow without knowing what data is coming through? One idea would be to have "template" columns in addition to "permanent" columns. "Template" columns would be selected from whatever the data source provides during the initialization phase of the operation. You can't do much to those columns except copy them to the output. "Permanent" columns would be required to be supplied by the data flow source, so they could be referenced in subsequent transforms. "Template" columns could be referenced in data sources as well, but it would require pattern-based grouping of columns or something like that. "Permanent" columns might be conditionally available, in case they are not always present.

All this being said, SSIS is ground-breaking in terms of the ease of creating custom components, the ease of debugging plans, and the combination of control flow and data flow in a seamless package. It will be very useful for a long time to come. Kudos to the team. I hope some of you are out there listening.


At 12/18/2005 11:53 AM, Anonymous Jamie said...

Someone is listening :)

Those are some interesting ideas. Your suggestion about an initialisation step that occurs prior to execution (and also, I would assume, prior to validation) could be worth exploring although this seems, to me, to be a fundamental change to the architecture. I wonder...

Another possible solution may be something that I think will be coming in katmai - package parts. Package parts are just what they sound - parts of a package saved externally to the package itself. This is being put in to support reusability - the most obvious example being putting a commonly used data-flow into a "package part file" which can then be called from different packages. If you could modify that external "package part file" at runtime prior to calling it then that would satisfy this requirement. The idea has got "workaround" plastered all over it but, in theory, it would work.



Post a Comment

<< Home