Perils of Converting a Lot of Data In-House.
 

The Volume Problem

The story we all know...

Most of us know how well Jack fared after he cut the beanstalk. After all, he walked away with the goose that lays the golden egg. Every morning, another golden egg would be waiting for him. Those eggs saved him and his mother from poverty. Before long, they were contented suburban homeowners.

Until that fateful day when Jack took up rollerblading. He was having so much fun that he left the golden egg under the goose all day. That evening, the egg hatched! Jack was dejected about his lost revenue until the next day, when he discovered that both geese had laid golden eggs. He could hardly believe his good fortune. If he harvested the eggs every other day instead of every day, he would double the number of gold-laying geese every two days.

40 days later, he had 1,048,576 geese to take care of and gold was so common that nobody wanted it.

The lesson is simple: volume always complicates matters. Most recipes will work if you double the ingredients. But try multiplying by 50 or 100 and all you'll have is a mess in the kitchen and a big room full of hungry people.

The SGML Expert

High technology is no exception to the problem of volume. Consider Gus, for example. He is Acme Corporation's resident SGML expert, hired as part of Acme's initiative to have all of its product documentation stored as SGML. Gus is a technical wizard. He designed a DTD for Acme in two weeks, and proudly shows off chapter 1 of the Acme Dustscraper Repair Manual, which he tagged himself in just one day.

A commendable effort, but there are 10 chapters in the Acme Dustscraper Repair Manual and Acme has 100 manuals. It would take Gus over 4 years to get all that documentation into SGML. Even if Acme could wait 4 years, they need Gus for other things. After all, he's crucial to ramping up the rest of the company to the new SGML system.

Gus Days

So far we've determined that having Gus convert all the data is unacceptable. But what are the other options? Well, the work can be divided up among Acme's staff, or temporary employees can be hired specifically for this project. Before we make any such decisions, however, it's important to determine just how much effort is involved.

About 1,000 chapters need to be converted. It takes Gus one day to tag a chapter. We can therefore assume an effort of 1,000 Gus-days (the four years mentioned above). So, hire 100 Gus's and you'll be done in two weeks. Easy!

Except for the volume problem. Where are you going to find 100 SGML experts who are willing to work for only two weeks? And even if you could, can you afford to pay 100 people what you're paying Gus? And when you do hire them, how are you going to get all 100 to tag the data the same way? Everyone will have his/her own interpretation. The only way to get useable SGML from these experts is to have Gus train them in his DTD.

Ah hah! If you're going to need training anyway, hire unskilled or semi-skilled workers at one third the cost of Gus. That's fine, but it will take them three times as long.

The point is, what works for low volume doesn't work for high volume. New solutions are required.

Software

An automated solution is ideally suited for high volumes of data. The computer is about 1,000 times faster than Gus. You've finally solved the volume problem. All you have to do is find or develop software that will completely and accurately convert your data to SGML.

Guess what? You'd have an easier time cloning Gus than getting such a program. Why? Because this isn't just a conversion. You are adding structure to your documents, which requires inference and subjective decision-making.

The Best of Both Worlds

Ah, but surely the computer can do most of the grunt work and then Gus can fix it up afterwards. Yes, combining automation with expert review seems to be the best approach. But only if it's done right.

If you do enough damage to your car, the insurance company will give you money to buy another one rather than fix the one you have. Similarly, fixing cookie-cutter SGML can actually take longer than tagging it by hand. It's clear that one key to a successful conversion is to automate as much as you can as cleanly as you can.

Here is where Acme makes a frightening discovery: an SGML expert is not a conversion expert. Gus doesn't know how best to develop or configure a conversion program. Why should he? That's like asking a race car driver to fix your car: it's simply a different field of expertise.

What Does a Conversion Expert Do?

Conversion is not a standard field of knowledge. As far as I know, there are no degrees available: the most reliable indicator of expertise is a track record. So, even though there is no universally accepted methodology, I can cover some guiding principles used at BCA for managing a large conversion.

Standardization

Large volumes require standardization to prevent chaos. Otherwise, different interpretations will generate inconsistent results. BCA implements "conversion specifications," which detail every element in a document and how it should be coded in the new format. These specifications are used as a standards document throughout the project. Also, BCA uses a project team approach, with one data analyst per project. This analyst is solely responsible for interpreting how data should be coded. All exceptions to the written rules are brought to him. Even details such as file naming conventions are standardized, because the smallest discrepancy can snowball at large volumes.

Customized Software One key to successfully using conversion software is to customize it. BCA has developed its own suite of conversion filters that it configures to the specifications of each project. It has even created its own generic intermediate formats. These robust "hub" formats divide the conversion in half so that changes in specs require only partial rework of data that's already been converted.

Quality Control

As discussed earlier, it is crucial to minimize the amount of cleanup necessary after the conversion is finished. While it is true that BCA's editors know nothing about Acme Dustscrapers, they know plenty about SGML (and all the other standard electronic formats). These editors parse the new SGML and then do a "format review." This second review is necessary because parsed SGML is not necessarily correct SGML.

The SGML is filtered into a viewing package. Tags, which require slow, tedious checking, are converted to visual cues. It then becomes immediately apparent to an editor if something is tagged right or not, simply by comparing it to the original hard copy.

Customer Feedback

The most critical element of quality control is customer feedback. BCA keeps the entire conversion process open to Acme, so that a misunderstanding doesn't result in thousands of mistagged pages. Normally, two samples are provided to the customer before the volume work begins. These samples, along with the conversion specifications, must be approved by the client at the start.

Once the conversion is underway, partial deliveries are sent to the client as they are completed. This is more than just checking BCA's work. "Live" data gives Acme a better understanding of how it will best implement new data on its new system.

Experience

For most companies, conversion is a rare occurrence. Therefore, no past experience exists to provide guideposts and warning signs. BCA has converted millions of pages to and from every major format. Which brings us to our conclusion.

No Surprises

Perhaps the most pernicious problem of large volumes is that the work involved is impossible to predict. In other words, even if you do budget for all the Gus days you think you need, you might very well need more. This could lead to disgruntled workers and even more disgruntled executives.

BCA has learned, through experience, to make its process flexible enough to stay on schedule. Problems are either avoided or prepared for in advance. Potential concerns are brought to the customer before they multiply. To put it simply, you can get away with a little sloppiness when you have one goose, but a million geese demand serious attention.

Your company is not set up to be a conversion house. I recommend you hire someone who is. Otherwise, you just might lay an egg.