Code Mortem: December 2005

Friday, December 30, 2005

The Cookie Disease Is Spreading

Now the cookie-monitoring is spreading to other government agencies as "security researchers" try to make a name for themselves.

Here is a quote from the article (Cookiegate Hits White House), speaking of cookies.

"They are helpful, for the most part, but they carry the potential for abuse because they can monitor and document the activities of web surfers."

Cookies cannot monitor and document the activities of web surfers. The cancer is spreading.

Thursday, December 29, 2005

Persistent Rumors about Persistent Cookies

I'm still amazed that after all this time the popular media insists on clinging to the old myths about browser cookies. The Associated Press is reporting that the NSA is somehow spying on visitors to its web site. Their evidence? The NSA site leaves a cookie on your machine! Huh??

These articles constantly claim that web sites can somehow "spy" on you or track your web browsing activities by leaving a cookie on your computer. I don't understand why this claim persists despite its complete inaccuracy. A cookie does not actively do anything on your computer. It is simply a way for a web site to store settings in your browser so that when you return the site can continue where it left off. It is not possible to "spy" using a cookie. The privacy concerns are entirely unfounded and based in myth rather than fact. If you disagree with me, please explain how someone can spy on me using a cookie. I'd love to know.

"Privacy advocates" claim that your web surfing can be tracked by using cookies. A cookie can only be used by a site that it is enabled for. Web site owners could track your browsing activity within their own site, but that can be done anyway without cookies. All web servers keep a log of your browsing activities. In fact, whole companies are built on analyzing web logs. Google just bought Urchin, a software company devoted to web analytics.

There are possible issues where an advertising company (such as DoubleClick) would leave behind information in a cookie and since the advertising content shows up across many other sites, they have a way to "track" your usage across sites that use their advertising. Still, since they can use web logs and IP address tracking to do the same thing without cookies, the argument is pretty pointless.

Saturday, December 17, 2005

SSIS and Static Metadata

SQL Server Integration Services (SSIS) is a powerful new ETL platform that is part of SQL Server 2005. I've been working with it for several years during its beta and preview release phases and even though it's a great tool, one of the frustrating things about its implementation is the fact that ETL plan metadata is basically static. Jamie Thomson, a prolific SSIS blogger, recently made the statement that making the metadata dynamic wasn't possible, and that you wouldn't really want to or need to have the metadata change during plan execution. I think I understand where he is coming from when he makes this statement, but I don't agree with the conclusion. Keeping the metadata static during execution allows for the engine to do a lot of optimization, but I think it should be dynamic for a short period before the plan is compiled.

In DTS, it was fairly common to use a script object to manipulate a data pump task at runtime, so that the metadata could be set on the fly. This is essential when building reusable ETL plans. SSIS is not quite as flexible in this one regard. SSIS goes through a compilation and optimization process when you first start up a data movement plan, and as a result cannot be changed at runtime. You can't use a script task in the process to have the task modify itself. This is an understandable architectural approach. The static compilation approach allows the designers to put a lot of optimizations into the plan execution so that it can be as fast as possible. Those benefits seem to have been realized, since SSIS is very fast. As a side note, at one of the presentations I saw a few years ago it was mentioned that at least one of the people who worked on the core SSIS engine came from the Microsoft compiler optimization team.

Because of its power, SSIS has the potential to be a core piece of the data movement infrastructure of an enterprise. This means that you need maintainable, understandable, reusable code. Take an operation such as copying data from one database to another as an example. You can define a data flow task from each source table to each destination table. This may be an easy thing, or a hard think depending on how many tables you have. If it is 5 tables, then just create 5 data flows and you are done. However, if you have 500 tables, 1000 tables, or even just 50 the job gets much harder. Then, when the schema in those databases change, you have to go update those movement plans to add the modified table structure. Additionally, when you find need to change a setting on one of those plans, you'll have to open up every one of those data flows and make the changes. A much better approach, would be to have a "template" data flow task that you could run over and over again for each table in the database. Then you only have one place to go to when changes need to be made.

Even though it sounds like you can't do this in SSIS, you actually can with a workaround. The script task in SSIS is powerful enough that you can open up a template task and use the DTS object model to dynamically change the task definition as needed. The key is, you must then save out this task and call a separate copy of SSIS to load up and run that task as a sub-process using dtexec. To the SSIS engine, it looks like we are just running static SSIS plans.

I don't expect the static compilation aspect of the SSIS plan architecture to change in the future, in fact I prefer it the way it is. However, I'd ask that the designers allow the metadata to be changed and resolved before the plan is compiled. There are ways to accomplish this. Here are a couple ideas. Without access to the architecture, I can't really speculate on how easy or hard these upgrades would be.

Possible Approaches

1. Create a special type of "initialization" script task that is allowed to run before the plan is compiled. This script could handle setting up dynamic metadata. It should be able to access variable values as well, so that a variable could be used to set a table to use.
2. Allow a data source to have multiple possible source tables (or commands) that could be selected from at runtime using a loop or passed in variable. I think this idea would be more problematic to implement (and use), but it would be interesting.
3. Have some sort of wildcard syntax to specify source schema elements and pass them through.

Possible Difficulties

1. The design-time environment is probably going to be the hardest problem to solve. How do you design a data flow without knowing what data is coming through? One idea would be to have "template" columns in addition to "permanent" columns. "Template" columns would be selected from whatever the data source provides during the initialization phase of the operation. You can't do much to those columns except copy them to the output. "Permanent" columns would be required to be supplied by the data flow source, so they could be referenced in subsequent transforms. "Template" columns could be referenced in data sources as well, but it would require pattern-based grouping of columns or something like that. "Permanent" columns might be conditionally available, in case they are not always present.

All this being said, SSIS is ground-breaking in terms of the ease of creating custom components, the ease of debugging plans, and the combination of control flow and data flow in a seamless package. It will be very useful for a long time to come. Kudos to the team. I hope some of you are out there listening.

Saturday, December 03, 2005

Microsoft Restart Manager and Disingenuous Reporting

I came across this doozy of an article this evening and was just stunned. About halfway down the article there is a quote from a guy named Anthony Risicato, from a company in New York
named 360i. The article says,

Anthony Risicato, the general manager for search and contextual at 360i LLC in New York, told eWEEK that Restart Manager is a concept that, in a vacuum, seems like a wonderful idea.
[Hmmm...starting up the left-handed compliments]
"Get the 'latest and "greatest' the minute it's available. But I do not like it, because it is trying to solve the effect, and not the cause," he said.
[The restart feature is solving the problem of how to replace or upgrade actively running software in an operating system without restarting. What exactly is the "cause" that they are not trying to solve anyway?]
He also said he is cautiously optimistic that Microsoft would be able to get Restart Manager to do what they say it will if they "decide to take the time to do it right, with an emphasis on testing and quality assurance. But they must still overcome the inherent weaknesses on the Windows platform as related to file corruption, shared memory space, etc."
[Yeah, no kidding, Einstein. All software, especially on such a widely used platform must be well tested and done right. My question is, what file corruption and shared memory space problem are you talking about? Sounds like you're still on Windows 98 to me. Have you heard of NTFS and process isolation? You might want to check out Windows XP before you start commenting on Windows Vista.]

So, who is this guy and what does he know anyway? You'd think eWeek would think twice before publishing a quote like this. If you look at the comments on the article, they are essentially all from Microsoft haters who rail on Microsoft for ... I'm not sure exactly what. With opposition like this though, no wonder Microsoft has leading market share. I'm no Microsoft evangelist, but this is ridiculous.

Message to eWeek's editors, keep the uninformed comments out of the articles and relegate them to the comments section.

On another note, I wonder why Microsoft attracts so much of this kind of uninformed opposition. You don't see this kind of thing against Google, or Sun, or Oracle for example.

Code Mortem