An Approach for Evolving XML Vocabularies
Using XML
Schema
Noah Mendelsohn
IBM Corporation
June
15, 2004
ROUGH DRAFT: DAVID HAS ASKED ME TO GET THIS OUT ASAP. I’LL PROOFREAD TOMORROW AND REPUBLISH IF
NECESSARY. COMMENTS ARE
WELCOME. THANK YOU.
Introduction
At the May, 2004 face to face meeting of the XML Schema WG I
discussed some ideas for managing evolving XML vocabularies using XML
schema. This note is an attempt to
summarize that proposal. These
ideas remain at best incomplete, and they do not necessarily represent the
preferred approach of my employer, IBM.
I do hope they are helpful in facilitating workgroup discussions of XML
versioning.
Assumptions & Rationale
In this exposition, the term “vocabulary” is used broadly and
somewhat informally to refer to a user-defined XML language that is used for
some purpose or other. It is
assumed that a given version of such a vocabulary will be describable by some
XML schema that can validate element information items conforming to the
vocabulary. Though informal, this
definition should suffice to introduce the design. Indeed, some of the characteristics of a
vocabulary (I.e. what can be validated, what can be versioned, etc.) become
clearer from the design itself.
Note that a vocabulary is not necessarily a single namespace, and a
single namespace may embody all or part of more than one vocabulary. Changes to a vocabulary may or may not
be made using the same namespace(s) as the original.
There seems to be wide divergence of opinion as to how
vocabularies will evolve in various scenarios, who will be involved in making
changes, how often the same vocabulary will be revised over time, what sorts of
incompatibilities may be introduced, which users will have which versions of a
schema, how documents that use multiple evolving namespaces are to be managed,
and so on. Some of the specific
proposals that have been discussed make particular assumptions about the answers
to these questions. A thesis of
this proposal is that XML Schema can relatively easily be adapted to support a
quite broad range of answers to the above questions.
Specifically, this proposal is based on the following
assumptions and design goals:
·
The same vocabulary may be versioned or fixed
repeatedly. Accordingly, the design
should be convenient to use even after 20 or 30 such revisions.
·
The versioning mechanisms should not presume particular
instance constructions such as <extension> elements.
·
In some but not in all cases, forward and/or backward
compatibility is be required: I.e.
it should be possible but not essential to write early schemas that will somehow
accept content that is not fully defined until later, and schemas for later
versions will often but not always validate earlier forms of the vocabulary.
·
Conversely, breaking changes should not be
forbidden. For example, it may be
that an early construct is deprecated at some later time, and perhaps completely
disallowed eventually. Likewise,
later versions may introduce constructs that are rejected outright by earlier
ones.
·
It should be possible to check for or force various
sorts of forward or backward compatibility when desired.
·
Schemas for versions of a vocabulary may but need not
form a sequence or tree, in which later versions somehow directly reference
particular schema documents for earlier versions. This choice allows for redefinition of
the same vocabulary by multiple organizations or in multiple schema files, as
XML does today. In this design, you
can rewrite the schema for my vocabulary to adapt it for your own use (this may
or may not be good common practice, but the versioning mechanisms do not
prohibit it.)
·
A consequence of the point above is that the schema for
version x is not necessarily expressed as a delta to the schema for version x-1,
if in fact the versions form a sequence at all. Such incremental definition schemes are
convenient, but do not necessarily scale to the case where the same vocabulary
is revised 20 or 30 times. In such
a case one would need up to 30 schema documents to assemble the effective
schema. Accordingly, this design
allows for but does not require such incremental definition.
·
As discussed above, no unnecessary assumptions should be
made regarding the relationships between vocabularies and XML Namespaces. Often, a vocabulary will be
expressed primarily as a single XML namespace. Often, to maintain forward and backward
compatibility, that same namespace will be used in subsequent versions as
well. Nothing in this design
prohibits the use or coordinated evolution of multiple namespaces, the addition
of new namespaces in subsequent versions of a vocabulary, etc.
Separation of Concerns and Goals of the Core Mechanism
Obviously, this design is targeted at a very flexible set of
assumptions as to how XML is used.
This is achieved through a more careful attention to separation of
concerns than seems to be the case in some other proposals. Indeed, the key feature of this
approach is that it’s core mechanisms are not fundamentally addressed at
defining “versions”, but rather with these two goals:
1) Making it convenient to write a schema that, when
validating, distinguishes in the PSVI content that is truly expected from
content that is tolerated but not fully understood (I.e. to tolerate later
changes to the vocabulary) and to distinguish both from content that is to be
completely rejected.
2) Given two such schemas, making it convenient to check
whether there is any content that one will allow that the other will not
tolerate. This is to facilitate the
situations in which you do want such compatibility.
Note that the above refers to schemas, not schema documents,
so all of the above applies to mixed namespace scenarios, and independent of how
many schema documents may have been imported or included to make the
schema. The means by which this is
achieved are discussed below under “Core Design”. We then allow for but do not require the
invention and optional use of mechanisms that:
·
Facilitate the writing of schema documents that embody
incremental changes to earlier versions (e.g. mechanisms like
<xsd:redefine>. Indeed,
the design specifically allows for one to make say 5 or 6 successive changes
using such an incremental mechanism, and then for cleanliness rewrite the next
version of the schema document from scratch.
·
Declare an intensional tree or other graph of schema
documents or schemas to facilitate management of evolving definitions. These would include mechanisms such as
version=”x.y” attributes, attributes asserting that one schema is an evolution
of some other designated schema etc.
The next section discusses the means by which the core
mechanism achieves the two goals set out above.
The Core Design
The two main goals of the core design are listed above. These are achieved as follows:
Distinguishing expected from tolerated or disallowed content
using wildcards
XML schema provides a so-called “wildcard”, which is
expressed in schema documents as <xsd:any>. The fundamental thesis of this design is
that content truly expected by an application can usually be validated by a
non-wildcard particle; wildcards
can be used to designate places in the content model where additional content is
tolerated to facilitate interoperation with other versions of the
vocabulary.
The Unique Particle Attribution Constraint of XML schema
ensures that applications can tell from the PSVI which content has been
validated by a wildcard and which by an element declaration. Consider the following two schemas:
Version A:
<xsd:sequence>
<xsd:element name=”x”
type=”xsd:integer”/>
<xsd:any minOccurs=”0” maxOccurs=”unbounded”
processContents=”skip”/>
</xsd:sequence>
Given the instance:
<x>123</x>
<y>abc</y>
this schema will create a PSVI associating <x> with the
element declaration and y with the wildcard. Now let’s assume that the reason this
instance showed up was that it was created by an application that knew about
schema version B:
Version B:
<xsd:sequence>
<xsd:element name=”x”
type=”xsd:integer”/>
<xsd:element name=”y” type=”xsd:string”/>
<xsd:any minOccurs=”0”
maxOccurs=”unbounded”
processContents=”skip”/>
</xsd:sequence>
An application validating with version B accepts the same
instance but associates y with the second element declaration in the PSVI. This application presumably has
first-class knowledge of both x and y.
Weak Wildcards avoid UPA conflicts
Use of wildcards in roughly this manner has been considered
on and off for years, but has until know been inhibited by UPA, which prohibits
schemas such as the following:
<xsd:sequence>
<xsd:element name=”x”
type=”xsd:integer”
minOccurs=”0”/>
<xsd:any minOccurs=”0” maxOccurs=”unbounded”
processContents=”skip”/>
</xsd:sequence>
The above causes a UPA violation, because a single element
<x> matches either the first or the second particle.
The schema workgroup has recently given serious
consideration, for other reasons, to changing wildcards to behave in a so-called
weak manner. This would not change
the behavior of existing schemas, but would allow schemas such as the one shown
above. In such cases, the explicit
element declaration would always take precedence, thus removing the ambiguity.
This proposal presumes the existence of weak
wildcards in the schema design.
Indeed, this is the only incompatible change that is absolutely required
in comparison to XML Schema 1.0.
Given that we have weak wildcards for other reasons, it
becomes possible to use wildcards freely at any point in a content model. Thus, this proposal allows at user
discretion for extension of vocabularies not just at the end of each model, but
anywhere that a weak wildcard can be used.
Possible further changes to facilitate wildcard use
The above analysis shows how weak wildcards can be used to
facilitate construction of extensible content models. In fact there are at least two sorts of
further changes to XML schema that we would want to consider in conjunction with
this proposal:
·
Our current wildcards allow content from any namespace,
other namespaces, or a list of designated namespaces. These may not necessarily be the most
useful options for our versioning scenarios. One proposal is to introduce a wildcard
that would validate any element not explicitly declared elsewhere in the schema
(regardless of namespace, or perhaps intersected with the existing namespace
controls.) This supports an idiom
in which: if I know about an
element and I don’t explicitly call for it, that means I don’t want it. I personally think we would want to use
something like this in the schema for schemas.
·
Purely as a convenience, we could introduce wildcard
defaulting mechanisms into the transfer syntax. So, we might have something like:
<xsd:schema
… defaultExtensionModel=”{openAtEnd,
openEverywhere}>
where
“openAtEnd” causes by default a weak wildcard to appear at the end of every
content model, and openEverywhere puts one between each particle. This is an admittedly vague proposal,
but it’s just a convenience and we can decide later what if anything we want
along these lines.
Comparing two schemas to determine subsumption
The sections above outline an approach that uses weak
wildcards to facilitate the construction of schemas that distinguish tolerated,
from expected, from disallowed content.
The 2nd goal is to facilitate checking of whether one schema
will fail to tolerate any of the content allowed by another.
The schema workgroup has recently devoted significant effort
to proving that the subsumption relation between any two content models can be
tractably checked. That work was
done to support a simplification of the rules for complexType refinement.
This design proposes to use the same subsumption algorithm to
achieve our second goal.
Specifically, given an element declaration from schema 1 and a
declaration for a similarly named element from schema 2, the subsumption
algorithm allows one to determine whether any content accepted by one of the
schemas will be rejected by the other.
This design does not require that such checking be
done. The assumption is that in
many scenarios application developers will wish to enforce such discipline
between evolving versions, and we show how development tools can do the
necessary checking as schemas are developed and before they are deployed. Conversely, if a user wishes for
whatever reason to introduce an intentional incompatibility, then a suitably
written checker can be used to ensure that only the intended incompatibilities
have been created.
Note also that nothing in the design as presented so far
states that versions must evolve in linear or tree-like form. Indeed, completely independent
organizations can create schemas that purport to validate similar or identical
vocabularies, can check the degree to which the goal is achieved, and can deploy
as appropriate. Debug versions can
be checked against production versions, and so on.
Optional Features
As implied above, optional layers can be defined to meet
additional goals such as the following:
·
Facilitating the creation of one schema document based
on another, particularly if the changes are small. We should see whether or not redefine
meets the whole need.
·
Facilitating the automatic insertion of weak wildcards
to create content models that are by default open.
·
Supporting some sort of standard labeling for versions
(e.g. version=”x.y”). Note that the
mechanisms above are oblivious to such labeling, but various schema document
management and deployment systems may find them useful.
While not a separate layer, we should also consider the
proposal above to:
·
Provide new options on wildcards to accept only content
likely to be used in evolution of a given vocabulary (e.g. from the current
namespace but not explicitly declared in the current schema)
Conclusion
Careful readers will note that the only truly required
incompatibility with XML Schema 1.0 is the introduction of weak wildcards, a
change that is already contemplated.
Other changes to facilitate version management or incremental development
of schemas may also prove desirable, but are not strictly required. Accordingly, an interesting feature of
this design is that it seems to achieve a broader range of goals than many
others, using a single design change that has already been contemplated for
other reasons. Indeed, that one
change does not invalidate any existing schemas, but merely allows the use of
wildcards in situations where they were previously prohibited.
Among the next steps I suggest are:
·
Check this proposal against a broad range of use cases
and with our user community.
·
Check with the schema implementation community to guage
their willingness to deploy weak wildcards and to deal with the attendant
incompatibility with XML Schema 1.0.
·
Consider the optional features above, whether they are
worth implementing, and if so whether implementers of the schema language will
support them.