by Mark Gross and John Lynch
 
If you have the luxury to author your materials directly into XML, then you're already in good shape. Actually, you might even skip this article. However, if you're like most people, you've probably already amassed huge volumes of materials, produced in a multitude of electronic formats. Worse still, you most likely have mountains of documents, research reports and catalogues sitting somewhere in paper gathering dust. Even for new materials, it may be expecting too much to get your authors to do things exactly the way you want. Who are you to tell that Nobel Prize winner that he can't use his 1985 Macintosh? Is it really appropriate for your domain experts to have to learn the various rules of XML? Clearly, there are challenges. The bottom line? If you're like most people, moving your material into XML makes sense, but it can be a daunting task.

XML and Structure

Whether you're building an eCommerce site with an on-line catalogue containing 1,000,000 products, publishing journals online or on CD-ROM, or maintaining a huge legal citation database, you need a way to maintain your data in a more structured, easily reconfigurable format; XML is the ideal way to do that. One of the ways XML provides that ability is in the way it applies structure to information. Furthermore, to get the full benefits of XML requires that you add structure that did not previously exist. In order to do that with legacy data, usually amassed over a period of many years and written by multiple authors, you'll need to standardize those materials into a common mold. Adding to the challenge, you'll need to concurrently add structural information. Getting this done in any kind of volume requires comprehensive planning.

While there are specific technical issues to be determined prior to project commencement, those are not dealt with here. Rather, we deal most extensively with how to plan and manage a large scale conversion effort. Utilizing a proven methodology, you can build a process that allows what you accomplished in your small demo to be extended into timely, accurate and consistent deliveries over the next tens of thousands or millions of pages.

Why should I bother with a plan?

Be it catalogue assimilation for eCommerce, or the conversion of scientific or technical documents, most large scale document conversions fall victim to bad planning. This usually has unacceptable effects on schedule, quality and costs. The methodology described here incorporates upfront data analysis, a Proof of Concept phase, intelligent software development, a structured approach to testing the conversion concept before you commit to a full production process, and an explicit production planning task.

The approach described is a four phase plan, which, in BCA's experience, ensures that potential problems are encountered and addressed early on in the process. You will want to identify them earlier rather than later; getting them fixed now means that you'll be able to automate your project with much greater accuracy.

While our approach espouses significant preplanning which might get in the way of people's tendency to "just get started", it does reliably deal with the major issues in any large scale conversion effort:

  • Time to Market - How long can your users wait for you to be ready? Can you guarantee that you meet your deadline? What are the implications if you don't?
  • Quality - Will it be right? Will people be getting what they thought they would?
  • Cost- Will you be able to reliably predict cost? Are you sure?
  • Scalability - Will you be able to convert and integrate a million of them the same way you did your demo?

 
Figure I: In large-scale conversion projects, people usually underestimate the task at hand.

You need a blueprint.

While you might build a tool shed without a blueprint, it is not likely that you would construct a 20 story building without one. Large conversion efforts are quite similar to large construction projects and many of the techniques to ensure that the final product meets the end-user's needs are common to both.

There is a tendency, when the pressure's on, to just get right on with it. After all, that demo you did impressed everyone upstairs. In truth, if it's a small project that you can do yourself over the weekend, you probably don't need a sophisticated plan. If it's a larger, more complex project, however, you will need a structured approach. There will be unknowns in technology, and there will be a need to communicate effectively among a number of people.

It's clearly too risky to forego the planning stages. You'll need to describe in advance exactly what the end product will look like, before you manufacture a million of them. That means you'll need to build a formal system to share information.

There are other important questions to ask yourself. What are the principal risk factors? How do I precisely describe to dozens, or perhaps hundreds of people, what to do in a variety of cases? And how do I make sure they all listen?

Ok, when don't you need a plan? How large is large?

While there is a gray area in which you might be able to bypass some of the steps, Data Conversion Laboratory's experience shows that you always need at least some of the planning stages. It's just too risky to play it by ear. This is a sizable investment you are about to make; consider the formal approach described here as a very inexpensive insurance policy.

The number of people that will be involved in the project can make a difference to the importance of a plan. But that's not the only thing upon which to base your decision. Whenever there's significant complexity, new technologies being explored, or you're simply doing something you've never done before, it's critical to plan and "Proof of Concept" the process before you start those machines rolling.

A project plan overview.

No matter how well you think it's going to go, a conversion project has many unknowns. Following this suggested methodology will help to bring many of those unknowns to the fore.

Data Conversion Laboratory's conversion project experience has taught us that a typical project's lifecycle can be divided cleanly into four phases. See Figure II.

 
Figure II: The lifecycle of a conversion project can be divided into four distinct phases.

A key benefit of this phased approach is that there are a number of specific checkpoints at which you can reconsider the project in terms of new information and redirect it to best fit where you're going.

Phase 1: Concept and Planning - The purpose of this phase is to get everyone to agree to a common definition of what the project is. You'll want to lay out the project objectives and expectations, define the success criteria, lay out a preliminary approach, identify the risk areas, estimate approximate cost ranges and define a preliminary budget.
Phase 2: Proof of Concept - The purpose of this vital step is to test your approach on a limited scale, paying particular attention to the areas identified as potential risk areas. The results of this phase will help you arrive at a more detailed plan, while further fleshing out functional requirements. Based on the results of the test, preliminary software is prepared, and cost projections are fine-tuned.
Phase 3: Analysis, Design, and Engineering - This is the critical step where all the details get worked out and the project gets prepared for volume production. Specifically, keying and conversion specifications are finalized, cleanup and review guidelines are defined, and final production costs are confirmed. More generally, the entire conversion process is finalized and tested, and production rampup begins.
Phase 4: Production - This is the step we've all been waiting for - data flowing smoothly at 500 or 50,000 pages a week. Provided that the preceding 3 steps were done well, there's not a whole lot to say about this phase. What you will need, however, are some tools in place to closely monitor quality and productivity.

In addition to the four phases, there are two other important aspects to this methodology. These are the underlying disciplines shown as two stripes at the bottom of figure II. As in any large project, management and quality control are critical and apply to every phase. Ideally, a single person will oversee both disciplines in order to guarantee continuity.

OK. This sounds like a plan, but who does the work?

It all depends on the staffing and the experience that you have available.

While the nuts and bolts of doing the conversion require some specialized skills and facilities, the actual planning and management process requires much the same skills required to manage any large, complex project. If you have people with experience available to dedicate to this effort, then you can probably do it internally. If not, you may want to consider outsourcing. The key issue is not to overlook the fact that this effort will need dedicated project management talent.

Phase 1: Concept and Planning

Although it is an important step, this can be a pretty short one if you've carefully thought through exactly what you want to happen.It may just be a day or two, though it's more likely to be several weeks. The major elements of this phase are described below:

Project Concept - Everybody needs to be on the same page. The first step is to clearly define the project, and to get an agreement that people's various expectations are the same. You simply cannot meet a goal that you don't know about in advance. At this point, the project concept is discussed at a high level, without getting bogged down in detail. The following are the critical questions that need to be answered honestly.

  • What do you need to do, and how quickly do you need to do it?
  • Do you have a technical approach in mind?
  • What are the goals and what are the success criteria?
  • What's critical and what's nice to have?
  • What's the expected budget? And what are estimated costs?
  • Where are the tradeoffs in time, budget, and functionality?

The end result of this analysis is a Project Concept document.

Materials Evaluation - While a detailed inventory of materials does not usually get done until the Proof of Concept phase, it is critical to get an early understanding of the project's scope. Design and implementation decisions on where best to focus resources will be based on this information. This is illustrated by the chart in Figure III, and while the specific questions will vary from project to project, typical questions are:

The end result of this analysis is a Project Concept document.

Materials Evaluation - While a detailed inventory of materials does not usually get done until the Proof of Concept phase, it is critical to get an early understanding of the project's scope. Design and implementation decisions on where best to focus resources will be based on this information. This is illustrated by the chart in Figure III, and while the specific questions will vary from project to project, typical questions are:

  • How big is the project? You need to quantify in terms you're used to thinking in - pages, books, journal issues, products, etc?
  • How much source variation is there? - Materials may have been produced in a multitude of electronic formats, on different computer operating systems, or by different typesetters. Some of it may even live as paper, under dust, in huge warehouses.
  • How much format variation is there? How often has the layout format changed over the years? Invariably, different authors choose to lay out in different ways; while it would be nice to have a strictly enforced template, if you're dealing with legacy data, you're bound to find a lot of formatting inconsistency.
  • What are the special issues? Tables, formulas, cross-referencing and graphics are all areas that need special attention in the planning process.

All of these critical issues will differ slightly from project to project; it's a good idea to lay them out explicitly in a format like Figure III.

Figure III: The materials evaluation sheet will be invaluable in helping you understand the scope of the conversion task.

Rough-Cut Pricing Estimate - Usually, there is not enough information available this early in the process to allow accurately predicting the project's overall production costs. There are simply too many variables that will not be finalized until well into Phase II. However, it is possible (and useful) to start assembling rough-cut costing parameters.

Figure IV: It's important, early on, to get a feel for what the rough-cut numbers will be.

It's generally a good idea to use a chart like the one shown in Figure IV to lay out what the major tasks in the production process are. Alongside this, cost ranges are laid out for what those tasks have historically taken. What this provides you with, are some very broad ranges. These are useful, both for feasibility analysis ("I didn't know we were talking about a $2,000,000 project!"), and for sensitivity analysis ("If we didn't have to do that step we could save $2.00 per page"). If budgeting has not yet been done for your project, these ranges will also prove to be useful guides for setting budgets.

Project Feasibility Analysis - While the information collected so far is fairly sketchy, this is an early opportunity to assess, based on those broad parameters, whether the project is still feasible. You need to answer some of the following questions. If this is a $1-$2 million project, does it still make good business sense to proceed? If it's way over budget can we redefine the project's scope? Is there another way to do this? Can we do without certain elements so as to bring down the cost? If so, does the project make sense at a reduced level? And most importantly, does it make sense to go on with Phase 2?

Phase II: Proof of Concept

So Phase I has told you that this may actually be worth pursuing. You've got the rough cut estimate, and even your CFO admits that it sounds like a pretty sound business model. Most importantly, everyone is agreed on what the broad strokes of the project are. The next step is the Proof of Concept.

The purpose of the Proof of Concept phase is to test your planned approach on a limited scale. This will be your opportunity to test out the areas that were identified as being particularly risky, and to test on a small scale, the hypothesis developed in Phase I. The results of this phase will provide a more detailed plan including the following: fleshed out functional requirements, preliminary software development, a converted sample set, and more finely-tuned cost projections.

Returning to the building analogy again, this is the step where the preliminary design is laid out, and a model built so that everyone can get an idea of what the building will look like. Additionally, a test boring is done to ensure that the soil will be able to support the building.

Figure V: Establishing a project timeline will help ensure that you accomplish what you need on schedule.

Figure V shows a typical project timeline for this phase. For a significantly larger project, this phase might take 6-10 weeks. The key stages and deliverables are described below:

Project Initiation - You always need a project kickoff meeting. One of the main purposes of this meeting is to make sure that everyone on the expanded team has the same understanding of the project concept. The team will probably include a project manager, a domain expert, a data analyst, a programmer and a senior editor.

The project initiation is also where the detailed task plan is created and reviewed. The task plan will help ensure that everyone understands their roles and responsibilities as a member of the team.

Defining the Sample Set - Important questions need to be answered in order to define the Proof of Concept. Be patient here; you probably won't be ready after the kickoff meeting.

Ask yourself the following questions: What's intended to be proven? How big should the sample be? Which project elements are known technology and therefore don't need to be part of this exercise? Which elements are particularly risky or unknown and need special focus?

Beware the common mistakes. While there may be a tendency to try to do everything at once, or to do the easy parts first, remember that the real purpose is to focus on a small data set, and on the risky and unknown areas. Fail to identify where your project's critical challenges are now, and the hypothesis of your whole project might be off.

With this in mind, it may be better to focus on 10 pages of difficult bibliographic references or complex tables, rather than 100's of pages of straightforward or repetitive text. And if there are 20 major variations of material, don't try to analyze them all; instead, pick the 2 or 3 that are most representative of the issues.

Inventory Materials - This task invariably evokes groans, but someone has got to do it. You need to have a good idea of how big the pile is, and a clear understanding of the variation contained within the pile. The exact methodology you use to collect this information will vary depending on the project.

While it would be ideal to get a detailed list of everything that needs to be done, that's not usually the case. What you are trying to do at this stage is get an understanding of how much of each type of material there might be. The reason for this, is, that each type of material will probably require its own programming and conversion process. And while building conversion software to help automate much of the conversion makes sense, you don't want to invest lots of programming time automating for a particularly difficult type of materials that you only have 10 pages of.

Developing Decision-Making Guidelines - This is usually the heart of the Proof of Concept phase. The extent to which you can you develop rules and guidelines for transforming your source materials into "properly tagged data" will be the most important determinant of final cost of this project. In other words, it needs to be done with care.

The domain expert and the data analyst will work closely together here, to try to generalize the rules and condense them into as small a set as possible. What you are trying to do at this stage, is build a functional set of rules. Don't make the mistake of turning this into a programming exercise; that will just bog you down at this stage. Equally importantly, don't give up too early. While the usual tendency is to think there are no rules - "it's just common sense and you either know it or you don't" - that's probably not the case.

The Conversion Specification Document – It is useful at this stage to formalize the guidelines derived to this point into a single document. The Conversion Specification Document will become the primary repository of project information; it will be continually consulted and reviewed by the end user, the domain expert, the analyst, and the programmer. This document expands the previously established guidelines into a set of rules that can be programmed for. It also identifies areas that are ambiguous or difficult to define; these areas will then need to be reviewed by the domain expert. The Conversion Specification document typically circulates among the various parties involved, and becomes the central discussion document until issues are resolved. It is also the document that defines the programming efforts.

POC Software and Sample Set Conversion - OK, so you've written the conversion specification document; hopefully it addresses all the major issues of the conversion. Now it's time to see if you can really use those guidelines and specifications to convert anything.

As in the Project Initiation task, you need to be cautious here. While most successful conversion projects combine automation with manual effort, programming should be done sparingly at this point. There simply isn't time during the Proof of Concept to program for everything you'd like to. In addition, there will be a tendency to program for the easy things first. The best approach is to select a few complex areas which people doubt can be converted in an automated manner. For these areas, invest time testing out programmatic approaches to their unique problems; this learning process will be invaluable and will help tremendously when you move on to Phase III. For the rest of the set, however, it probably makes sense for people to follow the conversion specification manually, rather than investing heavily in writing and testing programs.

The end result of this phase should give you a good feel for what can or should be automated, and what will need to be done manually. It will also yield some valuable timings on what the labor elements of this project are likely to be.

Future Phase Planning and Pricing - If you've done your job right thus far, you'll now be able to more closely estimate the project's costs, and lay out a realistic timeframe in which it can be done. As more materials are tested and converted in the next phase, these estimates will be further refined. Keep in mind that programming costs will rise in the next phase as you start to expand your efforts toward automation. However, if the materials you initially selected for the sample are truly representative, and you've taken into account people's learning curves as they started working with your sample data, what you have now is pretty close.

This phase is also the check point at which to determine whether the project still makes sense. This checkpoint let's you make a go/no go decision based on the outcome of the POC. Ask yourself the following questions: Are the costs still in line with our budget? Were we able to prove the concept that we came up with in the first phase? Are we getting the quality we expected? Was our original time estimate (or the promises I made to our backers) doable?

In deciding whether this project is feasible or not, figure out what the Proof of Concept has yielded.

What did this exercise buy you?

  • Time to Market - you'll have a realistic estimate of how long this project will take as well as your options for speeding it up.
  • Quality - you'll be able to demonstrate expected results while there's still time to make modifications.
  • Cost - you'll have an understanding of the project costs and what the tradeoffs are.
  • Scalability - you'll have the tools in place to create a process that scales as big as you want.

Phase III: Analysis, Design and Engineering.

By now, we should have a clear understanding of what we want to achieve from this conversion. The Proof of Concept will have yielded valuable clues as to where and how to refine our conversion guidelines. Mistakes will be identified and concepts will be proven. You'll understand the true complexity of the project and the steps you need to overcome them. The Proof of Concept, more than anything else, should land the entire conversion team on the same page. And, it will become the foundation upon which our entire process will be built.

The following are the primary results of the Proof of Concept:

  • Improved conversion guidelines.
  • Refined conversion specification.
  • Refined conversion software.
  • More finely tuned cost projections.

Phase III is primarily a matter of refining the various deliverables done on the conversion sample, for the fuller set of materials. Here we build upon what was done during the Proof of Concept and expand the analysis, design, and engineering components to handle the full set of materials. Additionally, we go back and program for all the things we did not have time for (or did not need to prove) during the Proof of Concept. Planning for gradual ramp-up and full volume production processing will also be done at this point.

A typical project timeline for this phase is shown in Figure VI:

Figure VI: It's important to lay out a timeline for the analysis, design and engineering phase of the project; for the typical larger project, this phase will likely take 6-10 weeks.

Phase III is made up of the following key tasks:

Production Process Planning - Integrating the various elements of the conversion process is too often an afterthought. That can be an expensive mistake. The most mundane things, such as agreeing upon filename conventions and basic data trafficking procedures, are too often not properly planned in advance. Typically, a large conversion effort consists of between 30-50 independent steps requiring multiple skills, and often multiple vendors. There are also time dependencies that need to be integrated in order to ensure a smooth production flow.

In planning the production process, there are also a number of important logistical considerations:

  • How many pages a week can each step in the process handle?
  • What's the weak link in the chain?
  • Can you keep up with reviewing and inspecting converted materials as they're delivered?
  • Technical questions will arise; will there be a dedicated point of contact on both sides?
  • How will materials be transported back and forth?

Another important question to ask is - how much time will it take? With an ongoing production facility that handles thousands of pages a day, you still need to allow 4-6 weeks for the integration to take place. You'll therefore need to start early in Phase III. And if you're going to be building a process from scratch, you should allow at least 6 months.

Production Quality Planning - While many of the standard quality control processes apply to conversion projects, there is a significant difference. Chips Ahoy may make the best cookies, but they'll tell you that they only use the choicest chocolate chips and the finest flour. Unlike a cookie factory, you can't really control the quality of ingredients coming into your machine. No matter how well you select your samples and trial materials, you are unlikely to find every significant variation; therefore, it's probably expecting too much to hope to account for all the possibilities in advance. The documents will typically have been written by many individuals, at several different locations, in many different editing packages, over a long period of time, and on a variety of systems. So, like the people who made them, the documents will have personalities. And, like people, their behavior may not always be exemplary.

Ensuring quality control in this environment means building feedback loops at each step of the process. These checkpoints are designed to report when things are not meeting expectations, and provide guidelines, rather than rules, to the people inspecting the results. Information needs to flow back and forth easily in order to allow refinement of this process. You'll also need to collect statistics in order to tell how much sampling will be needed as the process improves.

Production Ramp Up - Just as we advised in Phase 1, caution is critical at this stage. We usually find that the best approach is to plan for a few weeks of low volume production through the initial production process. This will help to identify any weaknesses in our process.

The entire production team needs to be aware that the purpose of the first weeks of production is to provide feedback in order to help engineer a smoother process. This is not yet the time to put dozens of people to work, but rather a time to assign a select few individuals who are capable of figuring out where improvements can be made.

Phase IV: Production

You're almost there, but you do need to continually monitor results and make sure that quality and productivity stay where you expect them to be.

Full-Volume Production - Even after the production ramp-up stage, it is not necessarily prudent to plan for full production volumes immediately. We've found it best to gradually increase volumes, thereby allowing ample time for people to be trained and to come fully up to speed.

Production Process Control - You need a method to track production through the various phases. For smaller projects, Excel spreadsheets may be sufficient. But for larger projects, you probably need something more sophisticated. At a minimum you need to know where in the process each batch of materials are, how long a batch is taking to go through, and how much material is awaiting each phase.

Materials Trafficking - It is very rare that you have everything that needs conversion ready in a pile at the beginning of a project. More likely, materials will be readied gradually as the project progresses. In order to avoid slowdowns later in the process, someone needs to be in charge of trafficking the materials, making sure that materials are ready and complete, and forwarding them appropriately.

Process Improvement Feedback - It certainly won't be perfect when you first go into production. You will need a method to formally collect information on exceptions and on what's not working properly. This method will need to be quite flexible as different parts of the process will report exceptions at different times.

Packaging and Delivery - This doesn't seem like a big deal, but you need to get the finished materials to the right person. The right materials! Otherwise frustration can set in. This is also a convenient point at which to do some final quality checking, and to document any specific procedures the person you're delivering to needs to follow.

Exception Handling Mechanisms - You'll also have to allow for exception reporting. Exception reports are delivered to the end user along with the completed data. Because of the wide variance and inconsistencies of the materials being converted, there will inevitably be materials that need special handling by the recipient. And it would seem wise to have a mechanism more sophisticated than yellow stickies to deal with this.

Where do I go from here?

There is a lot to absorb in this article. If you only take six things from what you've read, take the following:
 
  • Don't just do it. Figure out exactly what you want to accomplish, before you start planning, then plan it before you do it.
  • Project Management and Quality Assurance are key issues. To ensure that the project proceeds properly and on time, you'll need a dedicated team. Don't think of this as a part-time job; your critical team members need to be dedicated to the project.
  • Communications are key. The domain expert, the technical expert and management all need to be on the same page.
  • Select your sample set carefully. It's better to pick 200 pages that are truly representative of the key issues, than 5000 pages on a hit or miss basis.
  • Don't underestimate the value of a well crafted production process; work through the details of the production process before you ramp up to volume processing.
  • You'll need to be sure that it's working; build feedback loops into every step to monitor, control and continually improve the production process.

Figure VII: Maybe it's a good idea for everyone involved to read this article.

Just as your company relies on good management and planning, so too will your project. By following the detailed plan outlined in this white paper, it's possible to plan and engineer a smooth project, guided by the disciplines of a proven methodology.