Vol. 24, No. 4
- How to Structure Chemistry-Related Documents
Peter Murray-Rust and Henry S. Rzepa
the use of markup languages in publishing goes back to the 1960s when
IBM introduced GML (Generalized Markup Language), which subsequently
evolved into the standard SGML, most authors are nowadays more familiar
with the more recent implementation, referred to as HTML (HyperText
Markup Language). The rapid rise in the use of HTML in conjunction with
the growth of the World Wide Web was in large measure due to its ease
of use for achieving presentational and visual effect. However, its
limitations as a mechanism for expressing precisely defined data and
meanings were not always adequately recognized. These limitations meant
that in areas such as molecular sciences where precise meanings are
essential, a variety of often proprietary solutions continued to be
used to define and manipulate molecular "data" and information.
offers a general, powerful, and extensible mechanism for handling
both the "capture" and the publication of chemical information.
processes were seen as quite separate and the process of translating
data, information, and knowledge into a published entity remained an
activity requiring much human perception. It is also worth noting that
the reverse process of converting the published materials back into
usable data remained equally human intensive and hence expensive.
to reconcile these two extremes was recognized at the first World Wide
Web conference in 1994. A solution gelled shortly after the conference
as a remarkable communal effort resulted in the specification of extensible
markup language or XML. The ultimate vision of XML , as described by
Berners-Lee, is the creation of a "Semantic Web."1
rationale for this impressive effort included the following:
of a more universal infrastructure for publishing
that the use of XML will require subject-specific vocabularies called
"ontologies" Ontology is defined as a descriptionsuch as a formal
specification of a programof the concepts and relationships
that can exist for a software agent or a community of agents.
of a mechanism for enhancing quality ("validation") l Promotion of
the creation of dynamic hyper-documents
of the need to be able to reuse components of documents for other
of a mechanism for creating smart archives, in which the re-usable
components (information objects) can be readily identified
of an infrastructure for underpinning the emerging areas of e-business
to chemistry included, therefore, the creation of a new generation of
ontologically rich, primary publication and a clear division of the
respective roles of humans and software agents (robots). Thus,
humans should be able to:
all their data automatically
errors from publications
the published literature as a database
information from other domains
should be able to:
publications (on whatever scale)
chemicals from literature
this, we argue that a number of prerequisitesmust be in place:
data capture, especially from instruments. We note that in 30 years
we have moved from using instruments that captured data often only
in analogue form (chart paper) to using standard computers to capture
and process data to most recently an increasing tendency for placing
these computers online and connecting them to centralized data stores.
ontologies for a specific community (e.g., molecular science)
Involved in "Capturing" Chemistry
- The Current Position of XML
- Global Open Activity in Scientific XML
- Some Essentials of an XML System
- Creating Valid XML Documents
- Ontologies of Relevance to Chemistry
Key to Making It Unique
- Dictionaries and Schemas
- Document Structure and Metadata
Involved in "Capturing" Chemistry
extract2 from a typical science journal illustrates
both how precisely data and information must be represented, but also
how much human perception is required to translate this information
(e.g., to a reproducible experiment or a mechanistic interpretation):
synthase catalyzes the formation of thiamin phosphate from 4-amino-5-(hydroxymethyl)-2-methylpyrimidine
pyrophosphate and 5-(hydroxyethyl)-4-methylthiazole phosphate. The
reaction involves... dissociative mechanism... carbenium ion intermediate...
and pyrimidine iminemethide observed in the crystal..."
profusion of chemical structure information, concepts, and terms, which
only a trained human chemist could easily process. Quantitative concepts
and units are also ubiquitous:
|"A 500 ml
aliquot of 0.8 mM TP synthase in 50 mM
Tris-HCl (pH 7.5) and 6 mM MgCl2 incubated at room temperature
with 50uM CF3HMP-PP."
greater degree of human perception is required when handling graphical
chemical representations, which may contain many, often fuzzy and dangerous,
human-only semantics (e.g., 2-D representations of 3-D properties, relative
stereochemistry, aromaticity, hydrogen and other "weak" bonding, use
of generic and "R" groups, reaction arrows, and mechanisms, etc.). The
challenge, therefore, is to develop an infrastructure that can be routinely
used to capture, store, and appropriately filter and display such information.
Current Position of XML
is in 2002, XML offers a general, powerful, and extensible mechanism
for handling both the "capture" and the publication of chemical information.
In particular, XML allows for the first time this process to operate
equally well in both directions. Our basis for stating this derives
from the following observations:
is increasingly accepted as an information infrastructure.
protocols are all public and many of the tools are open source.
is vendor neutral, but with heavy vendor involvement.
is a large communal investment in generic tools (e.g., business2business,
has a modular approach; an application is built from components.
are expected to create domain-specific XML protocols and tools.
is increasingly universal in back-ends, middleware,and servers.
for XML from database vendors is rapidly increasing.
has close interoperability with other informatics standards such as
UML, OMG/CORBA, etc.
is increasing support for "XML over the net" and from browsers (e.g.,
Internet Explorer, Netscape 6, etc).
is very well supported by books, tutorials, etc.
Open Activity in Scientific XML
has the scientific community adopted these concepts? As noted above,
the first World Wide Web conference specifically identified mathematics
and chemistry as requiring specific markup languages. With this spark,
CML (Chemical Markup Language) evolved between 1995-1997 to become the
first scientific extended markup language. A concurrent effort lead
to MathML becoming formalized as such in 1998.3 We
estimate that by 2002, perhaps 50 specifically scientific applications
have been described in some degree. For example, 37 scientific applications
are quoted at <www.xml.com/pub/rg/Science>
and a more general listing is at <www.oasis-open.org/cover/xml.html#applications>.
The Science Citation Index shows around 570 references to the keyword
XML, and SciFinder retrieves 38 references to the term "XML in chemistry."
emphasize that XML is designed to allow markup languages to be combined,
at whatever level of granularity, so that documents can contain any
number of components deriving from specific XML languages. HTML, which
we noted above, has evolved into one such language (XHTML), but in its
latest development has been modularized into smaller, more easily implemented
components (e.g., XFORMS, a data-entry and validation component can
be implemented separately from other, more display-oriented components).
XHTML can co-exist in a document with languages such as SVG (a scalable
vector graphical language), MathML, and CML. We elaborate this when
discussing namespaces (vide infra).4
Essentials of an XML System
tasks will have to be accomplished in order to implement an XML solution
to publishing chemical information:
of documents from both legacy sources of data and de novo by
and capture of metadata (dictionaries of terms, tables of contents,
of namespaces (a reserved addressing scheme for information)
validation of the system (conformance to agreed specifications)
validation of documents (according to a specified and agreed upon
and display (XSL-FO, domain-specific such as molecular representations)
of an XML-based markup language should provide for the following:
simple, extensible document type definition (DTD) or schema
(modular and not over-complicated)
or more agreed and published ontologies
examples and conformance tests
community of critical mass
tools for accomplishing this should be identified. These might include
readers (more difficult than readers since the
may not be normalized to a single form)
converters (difficult because of variation and ambiguity in the original
data which may require some degree of perception for an accurate conversion)
XSLT style sheets and generic editors will accomplish some of these,
but a document object model (DOM), which represents a syntax free abstraction
of the data in memory, is probably essential for many subjects.
of Relevance to Chemistry*
this context, the term ontology refers to a machine readable set
of definitions that create a taxonomy of classes and subclasses
and relationships between them. Ref: <www.w3.org/2001/sw/
of the types of ontologies required is shown in Table 1. Of the chemically
specific information types, support should be included for:
information, especially spectra
and simulation (QM, mechanics,dynamics, etc.)
concepts (numbers, units, arrays, matrices, etc.)
software for display, editing, searching, etc.
disciplines such as bio areas, materials science, etc.
1: Types of Ontologies Relevant to XML in Chemistry and Tasks
for the Chemical Community
|General Non-Chemical Informatics
|Business and Commerce, Government, Regulatory, Academic,
||Reuse existing or emerging approaches
|Mathematics (MathML), Healthcare (HL7/XML), Genomics
||Collaborate to reuse existing or emerging approaches
|Chemical-Specific but Generic Information
|Numeric data, descriptive prose, safety
||Create ontologies and reuse generic
|Chemical-Specific Information Types
|Chemical substances, molecules, analytical and spectroscopic,
reactions, computational chemistry
||Build the complete tool set
Valid XML Documents
tools and protocols already exist to create valid XML documents. In
particular, the use of DTDs (Document Type Definitions) and Schemas
can bring enormous benefits, including eliminating/reducing software
failure due to the use of invalid data and reducing difficulty of (human)
understanding due to invalid publications. The DTD is a concept rooted
in SGML, and is still used in XML to constrain the Markup vocabulary
(i.e., the basic elements used for markup) and to some extent the (sub)structure
of documents (i.e., what element can be a parent or child of another).
Schemas are a more recent development, and unlike DTDs, are themselves
expressed using XML. Of particular relevance to chemistry, they provide
advantages over DTDs in that they can also be used for:
numbers and user-defined types
(for example to specify the list of chemical elements)
schemas allow for additional user-created rules (schematron/XSLT), and
with dictionaries, support the conversion to software (e.g. CML-DOM),
authoring (e.g., in editors), validation of the data on entry by the
Key to Making It Unique
object must be uniquely named to avoid collision and ambiguity. This
is achieved using XML namespacing.
below shows a paragraph of text (derived from XHTML, which inherits
the default namespace), within which components of CML are embedded,
including prefixes using the defined namespaces:
<p>We can supply the following set of molecules:</p>
<cml:molecule id=p1 title=phosphine>
<li><cml:molecule id=p2 title=penguinone/>
for domain-independent components for Scientific-Technical-Medical information,
or STMML, contains key elements such as units, dictionary, metadata,
item, array, and matrix and which supports datatypes such as numbers,
max/min, ranges, errors, etc. The next example illustrates how CML can
be used in conjunction with the STMML namespace5 to
specify units and their constraints:
<stm:scalar title=a errorValue=0.001
<stm:scalar title=b errorValue=0.001
<stm:scalar title=c errorValue=0.001
<atomArray> <atom id=a1
xyzFract=0.0 0.0 0.0 xy2=+23.2 -21.0/>
<atom id=a2 elementType=Cl
formalCharge=-1 xyzFract=0.5 0.0 0.0/>
extended example of this concatenation of namespaces6
contains up to eight namespaced components and illustrates how a complete
publication in XML/CML could be achieved. The use of namespaces can
be seen in a more general context in Figure 1, which illustrates how
the various specific XML components might relate to each other.
1: The use of namespaces in CML.
we note how the original CML specification7 can be
extended by modularization into a core namespace, and extended via other
schemas into the following:
A reaction, containing reactantLists, productLists and links between
A container for computational and simulation input and results.
A generic query language. l Hooks for other Schemas, such as SpectHook,
forspectral parameters and data and links to molecular details (assignment).
useful to separate the domain ontology from the Schema/DTD, which allows
the schema to be more abstract and which helps extensibility. Thus,
with the instance document referring to NAMESPACE dictionaries, a three-or
four-level hierarchy can be envisaged:
XMLSchema describing the instance
dictionary/ies describing the instance
schema describing the dictionaries
2: Validation scheme using dictionaries.
and referring processes add semantics and ontology. An overview of this
process is shown in Figure 2, where, for example, units are themselves
verified by the UNITS dictionary.
Structure and Metadata
dictionaries and compendia usually have some of the following features:
consist of curated entries and many are "flat" (e.g., the IUPAC GoldBook).
are compiled within a single hierarchy:
generic ("is A"):
eukaryote <--vertebrate <--mammal <--human
partitive ("has A"):
<--leg <--foot <--toe
can now be associated with a namespace for uniqueness and navigation.
must have curatorial information.
should support versioning.
is an important component of a document or information object and it
can serve a number of purposes:
is a piece of information to be discovered (e.g., Dublin Core and
does the information mean and how is it to be used?
constraints are there on the structure and content of the information?
Is it valid? This would be accomplished using mainly XML Schemas.
(hyper-) data added from metadata
can be made from metadata (e.g., using Schematron, XSLT, and RDF).
example, medicinal, physical organic chemistry, Gold Book, stereochemistry.
example, theoretical chemistry and CIF.
example, tables of atomic weights, dictionaries of compounds, etc.
example, theoretical chemistry and CIF.
agreed-upon schemas for defining such metadata are again seen as an
essential component of the XML-infrastructures.
IUPAC compendia provide a natural foundation for creating XML-based
machine processible resources. They fall into three broad categories:
descriptive (e.g., medicinal chemistry, physical organic chemistry,
stereochemistry, etc.), validating (e.g., theoretical chemistry) and
supplemental (e.g., atomic weights). Their availability for XML-based
processes would be a considerable asset.
brief review of the application of XML in chemistry, we have summarized
the essential advantages of adopting the XML approach. We have discussed
in particular the benefits in creating reusable namespaced information
components or objects, how these can be created and validated using
subject-specific ontologies and dictionaries, and then how they can
be enhanced with appropriate metadata. The role of communities and global
organizations, such as IUPAC, is crucial to this endeavour. The use
of such XML-based documents opens the prospect of creating avenues for
the reversible flow of data and information between the scientific publication
processes and the discovery, research, and learning processes in molecular
sciences; a reversibility that has hitherto only been achieved with
considerable human effort and expense.
T. Berners-Lee, M. Fischetti, M, Weaving the Web: The Original Design
and the Ultimate Destiny of the World Wide Web, Orion Business Books,
D. H. Peapus, H. J. Chiu and N. Campobasso, Biochemistry, 2001,
See www.w3.org for details of all XML
G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. Wright, Internet
J. Chemistry, 2001, article 13.
P. Murray-Rust and H. S. Rzepa, 2002, submitted for publication. For
the previous article in this series, see P. Murray-Rust and H. S. Rzepa,
Data Science 2002, 1, 84-98.
P. Murray-Rust, H. S. Rzepa, and M. Wright, New J. Chem., 2001,
P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci., 1999,
39, 928; P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci.,
2001, 1113; G. Gkoutos, P. Murray-Rust, H. S. Rzepa, and M. Wright,
J. Chem. Inf. Comp. Sci., 2001, 1124.
is a lecturer at the Unilever Centre for Molecular Informatics, Cambridge
University, United Kingdom. Henry Rzepa <firstname.lastname@example.org>
is a reader in the Department of Chemistry, Imperial College of Science,
Technology, and Medicine, London, United Kingdom.