![]() |
|
Contents
|
LinksDTD See AlsoIntroductionStructured data conforms to a schema. XML documents are no exception. The XML 1.0 standard defines schema using a subset of SGML's Document Type Definition (DTD). The DTD defines which elements appear in a document, which attributes can be assigned to an element, and which elements appear inside other elements. DTDs are simplistic and limited. The successor to DTDs, XML Schema, will allow designers to specify the acceptable datatypes for elements and attributes. Since XML Schema is not finished yet, dg should use DTDs, which are much more standard and supported by practically all XML tools. Developers should define the DTDs for all internal XML documents. Designing schema is an insightful exercise during design. Validating documents will also catch bugs at development time and will help developers avoid writing validation code themselves. Furthermore, DTDs contain valuable project documentation particularly if designers liberally comment them. DatatypesSee here Creating DTDsFew good explanations of XML 1.0 DTDs exist; so far the best one I have found is in Chapter 5 of XML: A Primer. Information, of course, is also available on the web. There are some DTD examples in XMLSpy's Samples folder. Including a DTD in Another DTDIncluding a DTD inside another DTD facilitates the modularization of schema -- large schemas can be broken down into smaller ones. Use entities to include a document inside a DTD. The main concern with modularization regards namespace. DTDs do not support XML 1.0 namespaces. Therefore all element names must be unique. This example shows how to include a DTD in another DTD. first.dtd:
second.dtd:
The second DTD includes the first DTD and adds the Linking to DTDsDocuments can refer to DTDs via http. Since URI support is still weak, absolute URLs are commonly used to link to DTDs. Several parsers such as Oracle's Java parser let you specify the base URL of the DTD before parsing a document so that URIs can be used. This is rather cumbersome, however, and the standard DOM API doesn't support it. One way to deal with this is to create a corporate-wide server name that
acts as a DTD repository. For example: Example DTD:
Example Document:
<!DOCTYPE Doc SYSTEM "http://dtdrepository/dtd/doc.dtd"> Note: If you're using IE5, you should be aware that your computer caches DTDs. If you change a DTD, IE5 is not smart enough to pick up the new version - you have to clear IE5's cache or close all open Explorer windows and components. PCDATA and CDATAPCDATA stands for "parsed character data" and CDATA stands for "character data" that contains no mark-up. PCDATA is used for entity element content and CDATA is used for attributes. CDATA BlocksSometimes it's annoying to encode element content that contains less-than signs, ampersands, etc.. This is the purpose of a CDATA block. A CDATA block is a "wrapper" around data that would otherwise cause the XML parser to choke. For example, a CDATA block can surround an HTML snippet. CDATA can not be used with attributes. A CDATA block starts with
Mixed Content (elements and content) DANGER!If you want mixed content (elements and text), the DTD must
contain: <!DOCTYPE Body[ < Body>This is one text node <Font face="helvetica">Hello</Font> This is another text node </Body> I don't know why the DTD is so limiting, but it is. This means that elements that have #PCDATA can be extended
to have elements, but at the cost of making the DTD ambiguous -- you can't say that an element can only
have, say, one <Foo> tag. All you can say is that the element can have any number of <Foo> tags and
content. <!DOCTYPE Body [ < Body><Data>This is one text node</Data> <Font face="helvetica">Hello</Font> </Body> The bottom line is that you should plan ahead if you think an element may have mixed content and it doesn't make sense for the element to have multiple text nodes. Unless you want a headache, put (#PCDATA) in its own node. Generally, mixed content is a very bad thing. If you're not seeing (#PCDATA) all by itself, you should worry about it. Also, (#PCDATA) by itself implies that the element should not be modified down the road to contain sub-elements... unless you want a mess on your hands. Basic DTD Guidelines
ToolsXML Authority seems to be the best schema tool available. XML Authority saves schema in multiple formats - DCD, DTD, XML-Schema, etc.. Schema can be viewed graphically. XMLSpy can also process DTDs and can validate XML documents. eXcelon ships with a schema tool called eXcelon studio, although it does not save to DTDs. ID and IDREFID and IDREF are XML 1.0's mechanism for building graph structures within the same document.
An element is marked with a ID attribute. An element refers to
another element via an IDREF attribute (or to multiple elements with an
IDREFS attribute). It all works with
attributes.
An ID must be a valid XML name. XML names can not start with a number or contain spaces. An IDREF is a string that contains a space-delimited list of IDs. When parsers encounter ID attributes, they may index the document using the ids so that look-ups are fast. <!DOCTYPE Plan [
<!ELEMENT Plan (Step|StepLink)*>
<!ELEMENT Step EMPTY>
<!ATTLIST Step id ID #REQUIRED>
<!ELEMENT StepLink EMPTY>
<!ATTLIST StepLink steps IDREFS #IMPLIED>
]>
<Plan>
<Step
id="step1"/>
<Step
id="step2"/>
<StepLink
steps="step1
step2"/>
</Plan>
XML NamespacesElements are defined within a global scope. This becomes a problem when combining elements from multiple documents. Name collision is hard to avoid. Namespaces were created to deal with this problem. Without namespaces, DTD designers must prefix all element names with some sort of a prefix, obfuscating the document. Unfortunately, namespaces don't work with DTDs. If you want to use DTD validation, you can't use XML namespaces. Attributes are defined within the scope of an element. Two elements in the same document can have attributes with identical names. Attributes vs. ElementsI prefer to use attributes whenever I can because I like the notion of containment. There is no clear-cut way to decide when to use an element or an attribute. The main decision factors are personal taste and whether your tool set provides support for attributes. The following paragraphs discuss the pros and cons of attributes and elements. The Argument for ElementsMore people know about element content than attributes. Attributes are more obscure. A document with elements is generally easier to read than a document with a lot of attributes. Elements support long text strings and linefeeds can be included in the text, unlike attributes. The Argument for AttributesAttributes must be used for ID/IDREF attributes. ID/IDREF provide useful functionality, because a DTD validator can check the validity of ID references -- this functionality is not available to elements. Using attributes results in smaller files. It is also easier to read attributes from the DOM. Use attributes when you can, and use elements when you have to. Element content can ambiguous. For example, if you have
How do you know that the content is the book's title? It's a
matter of documentation. This data is better represented inside an
attribute called title, or at least inside the Use attributes when:
Use elements when:
Use element content when:
eXcelon Persistence Considerations
The store must allocate space to represent a hierarchy of elements. Since attributes only contain text, they require much less overhead. More Informationhttp://www.oasis-open.org/cover/elementAttr9804.html Shamelessly stolen from XML Authority's documentation:Elements and attributes are both containers for information. Many times the choice between an element and an attribute seems very arbitrary, almost matter of style. While the choice may indeed be arbitrary in some cases, the 'typical' roles of elements and attributes and the different types of content models and constraints these two containers support may tip the scales in one direction or the other. One way to look at elements and attributes suggests that elements are the 'real' containers of data, while attributes annotate elements with additional information describing the content of the element. In the case of empty elements, attributes provide additional information about why the element is present and possibly about what content it represents. This approach has been used extensively in HTML and a variety of other document-oriented schemas, and works very well with style tools like Cascading Style Sheet, which make this structural assumption a key part of its display model. Using elements to store content and restricting attributes to annotation does have a few drawbacks, however. Element markup is much more verbose than attribute markup, with start and end tags rather than a name and some quotes. Child elements can provide more flexibility, but that flexibility isn't always necessary. In some situations, like the exchange of large quantities of small chunks of information between databases, attributes may be more efficient containers. The 'intrinsic' differences between elements and attributes in XML 1.0 tend to define the limits of what the two containers can be used for. The most fundamental difference, which will likely continue to hold through future iterations of XML development, is that elements can contain child elements as well as content, while attributes can only hold content. If it seems at all likely that you'll need to break down the information stored in a container, make that container an element. Attributes do have some advantages over elements, however. In XML 1.0, and in the XML Schema working draft, only attributes may have default values assigned to them by the schema. In XML 1.0, attributes also have much stronger constraints - you can limit the acceptable values of an attribute to a class of values (notation or entity names) or provide an enumerated list of acceptable values. XML Schemas may change this, and XML Authority provides a larger list of possible constraints (data types) for elements than is available in XML 1.0. Another field to explore before settling on whether to use elements or attributes is your processing software and parser. While element and attribute information is equally accessible in a tree-based interface (like those using the W3C's Document Object Model - DOM), developers working with stream-based interfaces (like the Simple API for XML - SAX) may have their own set of preferences for document structures. For more resources describing the issues involved in element and attribute usage, see Robin Cover's SGML/XML Elements versus Attributes article. |