ABSTRACT:
As the Internet evolves and
information becomes increasingly
valuable, so too does the competition
for revenues from that information.
As publishers and societies
seek new ways to yield revenues
(and maintain existing ones),
repurposing of data for licensing
reasons becomes a critical part
of the overall strategy. To
that end, SGML/XML is the way
to go – but which SGML?
And how?
One of the questions that I’m
often asked is: “If I’ve
already paid to have my documents
converted to SGML, and now ‘all’
I want to do is convert them
to another SGML/XML DTD, shouldn’t
that be easy? And shouldn’t
it be inexpensive?” These
are perfectly logical questions.
In theory, if all documents
were converted to SGML ‘properly’,
then the process of converting
from one structured markup to
another should be as simple
as remapping one set of tags
to another. However, the reality
can be quite different. Consider
DTD design, for example. When
it comes to designing DTDs,
there is quite a bit of latitude
available in the level of granularity
that the tag structure can be
designed to deal with. There
are many factors that can affect
the design of the DTD. Real
world issues, such as cost and
the need to republish the document
to paper or a new electronic
format can also influence or
restrict the designer. What
this translates to, when converting
from one DTD to another, is
that structural components are
often not present on a consistent
level, and compromises made
in the design phase, combine
to make the conversion considerably
more complex.
I usually explain to people
that one of the basic concepts
of structured markup is that
you say what something is, not
how it looks. In practice, however,
there are practical limits that
implementers face when trying
to reach these goals. To help
illustrate this, let’s
take a look at a real-world
issue that we often come across.
Figure 1: A typical
journal reference as it exists
on the printed page.
In typical scholarly journal
publishing, each article contains
a reference section at the end
pointing the reader to reference
sources used in the article.
These references are typically
well structured, and contain
information such as the author's
name, article title, journal
title, page number, date and
place of publication, etc. In
an ideal world, when such references
are converted to SGML, each
reference would be completely
decomposed to its component
pieces, with all emphasis and
punctuation removed (after all,
what we said above is that in
SGML, we want to say what something
is, not how it looks). However,
this can add significant cost
and complexity to the conversion
process. Because most of us
have to function in the real
world (budgets to meet, etc.),
some electronic publishers may
decide not to bother decomposing
these references at all, while
other publishers may decompose
the references but leave in
emphasis and punctuation (this
makes the act of republishing
the reference easier since the
display engine has less work
to do). Yet others may go ‘all
the way’ and produce a
completely tagged reference
(this is SGML at its most granular).
Figure 2: A partially
decomposed SGML instance of
the reference in Figure 1.
Figure 3: A more fully
decomposed SGML instance of
the same reference.
Herein can lie the problem.
Say if someone already has an
SGML representation of a journal,
and the references have NOT
been decomposed. Now they want
to license the journal to someone
else and the new DTD requires
that the references be fully
decomposed. A whole bunch of
engineering work needs to be
done to accomplish this. Similarly,
if the source SGML representation
to be converted is fully decomposed
into a set of markup tags, and
the target DTD does not support
this, then the conversion procedure
must reinsert into the target
document, all of the punctuation
and emphasis that was removed
from the source SGML. Effectively,
this means that the conversion
software is required to ‘play’
composition engine. This can
get particularly complex if
there are a number of tags involved,
because the conversion software
and process must be engineered
to deal with many different
combinations of tags that may
appear in a variety of sequences.
Decomposing references in SGML
to SGML conversion is just one
of the challenges involved.
While there are many other similar
issues, the above example should
help illustrate the point that
while these conversions are
doable, they are certainly not
anything close to trivial. Based
on the complexity of the conversion,
manual editorial review is often
required following the automated
conversion processing. This
is because unexpected input
can occasionally produce unacceptable
results (e.g. punctuation or
a space in the wrong place).
In the follow up article, we’ll
discuss some steps that can
be taken up front to minimize
the pain and potentially reduce
the costs involved in doing
SGML to SGML/XML conversion.
|