An Approach for Evolving XML Vocabularies
Using XML Schema
Noah Mendelsohn
IBM Corporation
June 15, 2004
ROUGH DRAFT: DAVID HAS ASKED ME TO GET THIS OUT
ASAP. I’LL PROOFREAD TOMORROW AND
REPUBLISH IF NECESSARY. COMMENTS ARE
WELCOME. THANK YOU.
Introduction
At the May, 2004 face to face meeting of the XML Schema WG I
discussed some ideas for managing evolving XML vocabularies using XML
schema. This note is an attempt to
summarize that proposal. These ideas
remain at best incomplete, and they do not necessarily represent the preferred
approach of my employer, IBM. I do hope
they are helpful in facilitating workgroup discussions of XML versioning.
Assumptions & Rationale
In this exposition, the term “vocabulary” is used broadly
and somewhat informally to refer to a user-defined XML language that is used for
some purpose or other. It is assumed
that a given version of such a vocabulary will be describable by some XML
schema that can validate element information items conforming to the
vocabulary. Though informal, this
definition should suffice to introduce the design. Indeed, some of the characteristics of a vocabulary (I.e. what
can be validated, what can be versioned, etc.) become clearer from the design
itself. Note that a vocabulary is not
necessarily a single namespace, and a single namespace may embody all or part
of more than one vocabulary. Changes to
a vocabulary may or may not be made using the same namespace(s) as the
original.
There seems to be wide divergence of opinion as to how
vocabularies will evolve in various scenarios, who will be involved in making
changes, how often the same vocabulary will be revised over time, what sorts of
incompatibilities may be introduced, which users will have which versions of a
schema, how documents that use multiple evolving namespaces are to be managed,
and so on. Some of the specific
proposals that have been discussed make particular assumptions about the
answers to these questions. A thesis of
this proposal is that XML Schema can relatively easily be adapted to support a
quite broad range of answers to the above questions.
Specifically, this proposal is based on the following
assumptions and design goals:
·
The same vocabulary may be versioned or fixed repeatedly. Accordingly, the design should be convenient
to use even after 20 or 30 such revisions.
·
The versioning mechanisms should not presume particular
instance constructions such as <extension> elements.
·
In some but not in all cases, forward and/or backward compatibility
is be required: I.e. it should be
possible but not essential to write early schemas that will somehow accept
content that is not fully defined until later, and schemas for later versions
will often but not always validate earlier forms of the vocabulary.
·
Conversely, breaking changes should not be
forbidden. For example, it may be that an
early construct is deprecated at some later time, and perhaps completely
disallowed eventually. Likewise, later
versions may introduce constructs that are rejected outright by earlier ones.
·
It should be possible to check for or force various sorts
of forward or backward compatibility when desired.
·
Schemas for versions of a vocabulary may but need not
form a sequence or tree, in which later versions somehow directly reference
particular schema documents for earlier versions. This choice allows for redefinition of the same vocabulary by
multiple organizations or in multiple schema files, as XML does today. In this design, you can rewrite the schema
for my vocabulary to adapt it for your own use (this may or may not be good common
practice, but the versioning mechanisms do not prohibit it.)
·
A consequence of the point above is that the schema for
version x is not necessarily expressed as a delta to the schema for version x-1,
if in fact the versions form a sequence at all. Such incremental definition schemes are convenient, but do not
necessarily scale to the case where the same vocabulary is revised 20 or 30
times. In such a case one would need up
to 30 schema documents to assemble the effective schema. Accordingly, this design allows for but does
not require such incremental definition.
·
As discussed above, no unnecessary assumptions should
be made regarding the relationships between vocabularies and XML
Namespaces. Often, a vocabulary will
be expressed primarily as a single XML namespace. Often, to maintain forward and backward compatibility, that same
namespace will be used in subsequent versions as well. Nothing in this design prohibits the use or
coordinated evolution of multiple namespaces, the addition of new namespaces in
subsequent versions of a vocabulary, etc.
Separation of Concerns and Goals of the Core Mechanism
Obviously, this design is targeted at a very flexible set of
assumptions as to how XML is used. This
is achieved through a more careful attention to separation of concerns than
seems to be the case in some other proposals.
Indeed, the key feature of this approach is that it’s core
mechanisms are not fundamentally addressed at defining “versions”, but
rather with these two goals:
1)
Making it convenient to write a schema that, when
validating, distinguishes in the PSVI content that is truly expected from
content that is tolerated but not fully understood (I.e. to tolerate later
changes to the vocabulary) and to distinguish both from content that is to be
completely rejected.
2)
Given two such schemas, making it convenient to check
whether there is any content that one will allow that the other will not
tolerate. This is to facilitate the
situations in which you do want such compatibility.
Note that the above refers to schemas, not schema documents,
so all of the above applies to mixed namespace scenarios, and independent of
how many schema documents may have been imported or included to make the
schema. The means by which this is
achieved are discussed below under “Core Design”. We then allow for but do not require the invention and optional
use of mechanisms that:
·
Facilitate the writing of schema documents that embody
incremental changes to earlier versions (e.g. mechanisms like
<xsd:redefine>. Indeed, the
design specifically allows for one to make say 5 or 6 successive changes using
such an incremental mechanism, and then for cleanliness rewrite the next
version of the schema document from scratch.
·
Declare an intensional tree or other graph of schema documents
or schemas to facilitate management of evolving definitions. These would include mechanisms such as
version=”x.y” attributes, attributes asserting that one schema is an evolution
of some other designated schema etc.
The next section discusses the means by which the core
mechanism achieves the two goals set out above.
The Core Design
The two main goals of the core design are listed above. These are achieved as follows:
Distinguishing expected from tolerated or disallowed content
using wildcards
XML schema provides a so-called “wildcard”, which is
expressed in schema documents as <xsd:any>. The fundamental thesis of this design is that content truly
expected by an application can usually be validated by a non-wildcard particle;
wildcards can be used to designate
places in the content model where additional content is tolerated to facilitate
interoperation with other versions of the vocabulary.
The Unique Particle Attribution Constraint of XML schema ensures
that applications can tell from the PSVI which content has been validated by a
wildcard and which by an element declaration.
Consider the following two schemas:
Version A:
<xsd:sequence>
<xsd:element name=”x” type=”xsd:integer”/>
<xsd:any minOccurs=”0” maxOccurs=”unbounded”
processContents=”skip”/>
</xsd:sequence>
Given the instance:
<x>123</x>
<y>abc</y>
this schema will create a PSVI associating <x> with
the element declaration and y with the wildcard. Now let’s assume that the reason this instance showed up was that
it was created by an application that knew about schema version B:
Version B:
<xsd:sequence>
<xsd:element name=”x” type=”xsd:integer”/>
<xsd:element name=”y” type=”xsd:string”/>
<xsd:any minOccurs=”0” maxOccurs=”unbounded”
processContents=”skip”/>
</xsd:sequence>
An application validating with version B accepts the same
instance but associates y with the second element declaration in the PSVI. This application presumably has first-class
knowledge of both x and y.
Weak Wildcards avoid UPA conflicts
Use of wildcards in roughly this manner has been considered
on and off for years, but has until know been inhibited by UPA, which prohibits
schemas such as the following:
<xsd:sequence>
<xsd:element name=”x” type=”xsd:integer”
minOccurs=”0”/>
<xsd:any minOccurs=”0” maxOccurs=”unbounded”
processContents=”skip”/>
</xsd:sequence>
The above causes a UPA violation, because a single element
<x> matches either the first or the second particle.
The schema workgroup has recently given serious
consideration, for other reasons, to changing wildcards to behave in a
so-called weak manner. This would not
change the behavior of existing schemas, but would allow schemas such as the
one shown above. In such cases, the
explicit element declaration would always take precedence, thus removing the
ambiguity. This proposal presumes the existence
of weak wildcards in the schema design.
Indeed, this is the only incompatible change that is absolutely required
in comparison to XML Schema 1.0.
Given that we have weak wildcards for other reasons, it
becomes possible to use wildcards freely at any point in a content model. Thus, this proposal allows at user
discretion for extension of vocabularies not just at the end of each model, but
anywhere that a weak wildcard can be used.
Possible further changes to facilitate wildcard use
The above analysis shows how weak wildcards can be used to
facilitate construction of extensible content models. In fact there are at least two sorts of further changes to XML
schema that we would want to consider in conjunction with this proposal:
·
Our current wildcards allow content from any namespace,
other namespaces, or a list of designated namespaces. These may not necessarily be the most useful options for our
versioning scenarios. One proposal is
to introduce a wildcard that would validate any element not explicitly declared
elsewhere in the schema (regardless of namespace, or perhaps intersected with
the existing namespace controls.) This
supports an idiom in which: if I know
about an element and I don’t explicitly call for it, that means I don’t want
it. I personally think we would want to
use something like this in the schema for schemas.
·
Purely as a convenience, we could introduce wildcard defaulting
mechanisms into the transfer syntax.
So, we might have something like:
<xsd:schema … defaultExtensionModel=”{openAtEnd,
openEverywhere}>
where “openAtEnd”
causes by default a weak wildcard to appear at the end of every content model,
and openEverywhere puts one between each particle. This is an admittedly vague proposal, but it’s just a convenience
and we can decide later what if anything we want along these lines.
Comparing two schemas to determine subsumption
The sections above outline an approach that uses weak
wildcards to facilitate the construction of schemas that distinguish tolerated,
from expected, from disallowed content.
The 2nd goal is to facilitate checking of whether one schema
will fail to tolerate any of the content allowed by another.
The schema workgroup has recently devoted significant effort
to proving that the subsumption relation between any two content models can be
tractably checked. That work was done
to support a simplification of the rules for complexType refinement.
This design proposes to use the same subsumption algorithm
to achieve our second goal.
Specifically, given an element declaration from schema 1 and a declaration
for a similarly named element from schema 2, the subsumption algorithm allows
one to determine whether any content accepted by one of the schemas will be
rejected by the other.
This design does not require that such checking be
done. The assumption is that in
many scenarios application developers will wish to enforce such discipline
between evolving versions, and we show how development tools can do the necessary
checking as schemas are developed and before they are deployed. Conversely, if a user wishes for whatever
reason to introduce an intentional incompatibility, then a suitably written
checker can be used to ensure that only the intended incompatibilities have
been created.
Note also that nothing in the design as presented so far
states that versions must evolve in linear or tree-like form. Indeed, completely independent organizations
can create schemas that purport to validate similar or identical vocabularies,
can check the degree to which the goal is achieved, and can deploy as
appropriate. Debug versions can be
checked against production versions, and so on.
Optional Features
As implied above, optional layers can be defined to meet
additional goals such as the following:
·
Facilitating the creation of one schema document based
on another, particularly if the changes are small. We should see whether or not redefine meets the whole need.
·
Facilitating the automatic insertion of weak wildcards
to create content models that are by default open.
·
Supporting some sort of standard labeling for versions
(e.g. version=”x.y”). Note that the
mechanisms above are oblivious to such labeling, but various schema document
management and deployment systems may find them useful.
While not a separate layer, we should also consider the
proposal above to:
·
Provide new options on wildcards to accept only content
likely to be used in evolution of a given vocabulary (e.g. from the current
namespace but not explicitly declared in the current schema)
Conclusion
Careful readers will note that the only truly required
incompatibility with XML Schema 1.0 is the introduction of weak wildcards, a
change that is already contemplated.
Other changes to facilitate version management or incremental
development of schemas may also prove desirable, but are not strictly
required. Accordingly, an interesting
feature of this design is that it seems to achieve a broader range of goals
than many others, using a single design change that has already been
contemplated for other reasons. Indeed,
that one change does not invalidate any existing schemas, but merely allows the
use of wildcards in situations where they were previously prohibited.
Among the next steps I suggest are:
·
Check this proposal against a broad range of use cases
and with our user community.
·
Check with the schema implementation community to guage
their willingness to deploy weak wildcards and to deal with the attendant
incompatibility with XML Schema 1.0.
·
Consider the optional features above, whether they are
worth implementing, and if so whether implementers of the schema language will
support them.