From 2f36224e54d7d9875da4c22740f7f51632d1a265 Mon Sep 17 00:00:00 2001 From: Daniel Veillard Date: Fri, 2 Mar 2001 17:36:21 +0000 Subject: [PATCH] - parser.c: one must report spaces even if the Dtd element content proves that this is not part of the element content. - result/valid/*.xml: this changed the ouptu slightly Daniel --- ChangeLog | 6 + parser.c | 7 +- result/valid/REC-xml-19980210.xml | 1505 ++++++++++++++++++++--------- result/valid/dia.xml | 192 ++-- result/valid/xhtml1.xhtml | 674 ++++++++----- result/valid/xlink.xml | 529 ++++++---- result/valid/xlink.xml.err | 4 +- 7 files changed, 1907 insertions(+), 1010 deletions(-) diff --git a/ChangeLog b/ChangeLog index e44c85fd..891141ea 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,9 @@ +Sat Mar 3 02:10:24 CET 2001 Daniel Veillard + + * parser.c: one must report spaces even if the Dtd element + content proves that this is not part of the element content. + * result/valid/*.xml: this changed the ouptu slightly + Thu Mar 1 17:53:39 CET 2001 Daniel Veillard * configure.in: bumped to 2.3.3 diff --git a/parser.c b/parser.c index b6f61449..f1617a3c 100644 --- a/parser.c +++ b/parser.c @@ -1393,6 +1393,9 @@ static int areBlanks(xmlParserCtxtPtr ctxt, const xmlChar *str, int len) { int i, ret; xmlNodePtr lastChild; + if (ctxt->keepBlanks) + return(0); + /* * Check for xml:space value. */ @@ -1417,8 +1420,6 @@ static int areBlanks(xmlParserCtxtPtr ctxt, const xmlChar *str, int len) { /* * Otherwise, heuristic :-\ */ - if (ctxt->keepBlanks) - return(0); if (RAW != '<') return(0); if (ctxt->node == NULL) return(0); if ((ctxt->node->children == NULL) && @@ -1641,7 +1642,7 @@ xmlParseName(xmlParserCtxtPtr ctxt) { return(ret); } } - xmlParseNameComplex(ctxt); + return(xmlParseNameComplex(ctxt)); } xmlChar * diff --git a/result/valid/REC-xml-19980210.xml b/result/valid/REC-xml-19980210.xml index 2d4f035f..45d941e7 100644 --- a/result/valid/REC-xml-19980210.xml +++ b/result/valid/REC-xml-19980210.xml @@ -39,11 +39,8 @@ all text and legal notices remain intact"> REC-xml-&iso6.doc.date; W3C Recommendation - -&draft.day; -&draft.month; -&draft.year; - +&draft.day;&draft.month;&draft.year; + http://www.w3.org/TR/1998/REC-xml-&iso6.doc.date; @@ -76,21 +73,15 @@ http://www.w3.org/TR/WD-xml-970807 http://www.w3.org/TR/WD-xml-971117--> - -Tim Bray +Tim Bray Textuality and Netscape -tbray@textuality.com - - -Jean Paoli +tbray@textuality.com +Jean Paoli Microsoft -jeanpa@microsoft.com - - -C. M. Sperberg-McQueen +jeanpa@microsoft.com +C. M. Sperberg-McQueen University of Illinois at Chicago -cmsmcq@uic.edu - +cmsmcq@uic.edu

The Extensible Markup Language (XML) is a subset of @@ -128,6 +119,8 @@ available at xml-editor@w3.org.

+ +

Chicago, Vancouver, Mountain View, et al.: World-Wide Web Consortium, XML Working Group, 1996, 1997.

@@ -358,7 +351,7 @@ do some housekeeping - + Introduction

Extensible Markup Language, abbreviated XML, describes a class of @@ -386,6 +379,7 @@ doing its work on behalf of another module, called the application. This specification describes the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application.

+ Origin and Goals

XML was developed by an XML Working Group (originally known as the @@ -397,11 +391,21 @@ Interest Group (previously known as the SGML Working Group) also organized by the W3C. The membership of the XML Working Group is given in an appendix. Dan Connolly served as the WG's contact with the W3C.

-

The design goals for XML are:

XML shall be straightforwardly usable over the -Internet.

XML shall support a wide variety of applications.

XML shall be compatible with SGML.

It shall be easy to write programs which process XML -documents.

The number of optional features in XML is to be kept to the -absolute minimum, ideally zero.

XML documents should be human-legible and reasonably -clear.

The XML design should be prepared quickly.

The design of XML shall be formal and concise.

XML documents shall be easy to create.

Terseness in XML markup is of minimal importance.

+

The design goals for XML are: +

XML shall be straightforwardly usable over the +Internet.

+

XML shall support a wide variety of applications.

+

XML shall be compatible with SGML.

+

It shall be easy to write programs which process XML +documents.

+

The number of optional features in XML is to be kept to the +absolute minimum, ideally zero.

+

XML documents should be human-legible and reasonably +clear.

+

The XML design should be prepared quickly.

+

The design of XML shall be formal and concise.

+

XML documents shall be easy to create.

+

Terseness in XML markup is of minimal importance.

This specification, together with associated standards @@ -415,23 +419,44 @@ and construct computer programs to process it.

This version of the XML specification &doc.distribution;.

+
+ + + + Terminology +

The terminology used to describe XML documents is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of an XML processor: -

Conforming documents and XML + + + +

Conforming documents and XML processors are permitted to but need not behave as -described.

Conforming documents and XML processors +described.

+
+ + +

Conforming documents and XML processors are required to behave as described; otherwise they are in error. -

A violation of the rules of this +

+
+ + +

A violation of the rules of this specification; results are undefined. Conforming software may detect and report an error and may -recover from it.

An error +recover from it.

+
+ + +

An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the @@ -444,16 +469,33 @@ Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way). -

Conforming software may or must (depending on the modal verb in the +

+
+ + +

Conforming software may or must (depending on the modal verb in the sentence) behave as described; if it does, it must provide users a means to enable or disable the behavior -described.

A rule which applies to all +described.

+
+ + +

A rule which applies to all valid XML documents. Violations of validity constraints are errors; they must, at user option, be reported by -validating XML processors.

A rule which applies to all well-formed XML documents. +validating XML processors.

+
+ + +

A rule which applies to all well-formed XML documents. Violations of well-formedness constraints are -fatal errors.

(Of strings or names:) +fatal errors.

+
+ + + +

(Of strings or names:) Two strings or names being compared must be identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with @@ -470,29 +512,42 @@ An element matches its declaration when it conforms in the fashion described in the constraint . -

A feature of +

+
+ + +

A feature of XML included solely to ensure that XML remains compatible with SGML. -

A +

+
+ + +

A non-binding recommendation included to increase the chances that XML documents can be processed by the existing installed base of SGML processors which predate the -&WebSGML;.

+&WebSGML;.

+ +

+ +
+ Documents -

- + +

A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further -constraints. -

+constraints.

+

Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. @@ -505,42 +560,57 @@ markup. The logical and physical structures must nest properly, as described in .

+ Well-Formed XML Documents +

A textual object is a well-formed XML document if: -

Taken as a whole, it -matches the production labeled document.

It -meets all the well-formedness constraints given in this specification.

Each of the parsed entities + +

Taken as a whole, it +matches the production labeled document.

+

It +meets all the well-formedness constraints given in this specification.

+
+

Each of the parsed entities which is referenced directly or indirectly within the document is -well-formed.

+well-formed.

+

-Documentdocumentprolog + +Document +document +prolog element -Misc* +Misc* +

Matching the document production implies that: -

It contains one or more -elements.

There is exactly +stronger. -MSM *--> +

There is exactly one element, called the root, or document element, no part of which appears in the content of any other element. For all other elements, if the start-tag is in the content of another element, the end-tag is in the content of the same element. More simply stated, the elements, delimited by start- and end-tags, nest properly within each other. -

+

+

-

-As a consequence +

As a consequence of this, for each non-root element C in the document, there is one other element P @@ -550,11 +620,11 @@ the content of any other element that is in the content of P. P is referred to as the parent of C, and C as a -child of P. -

-
+child of P.

+ Characters +

A parsed entity contains text, a sequence of characters, @@ -567,10 +637,18 @@ graphic characters of Unicode and ISO/IEC 10646. The use of "compatibility characters", as defined in section 6.8 of , is discouraged. -Character RangeChar#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] -| [#x10000-#x10FFFF]any Unicode character, excluding the -surrogate blocks, FFFE, and FFFF. + +Character Range + +Char +#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] +| [#x10000-#x10FFFF] +any Unicode character, excluding the +surrogate blocks, FFFE, and FFFF. + +

+

The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of @@ -584,13 +662,22 @@ equivalent of its UCS-4 code value.

-->
+ Common Syntactic Constructs +

This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs. -White SpaceS(#x20 | #x9 | #xD | #xA)+

+ +White Space + +S +(#x20 | #x9 | #xD | #xA)+ + + +

Characters are classified for convenience as letters, digits, or other characters. Letters consist of an alphabetic or syllabic base character possibly @@ -622,13 +709,26 @@ should accept the colon as a name character.

An Nmtoken (name token) is any mixture of name characters. -Names and TokensNameCharLetter + +Names and Tokens +NameChar +Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar -| ExtenderName(Letter | '_' | ':') -(NameChar)*NamesName -(S Name)*Nmtoken(NameChar)+NmtokensNmtoken (S Nmtoken)* +| Extender + +Name +(Letter | '_' | ':') +(NameChar)* +Names +Name +(S Name)* +Nmtoken +(NameChar)+ +Nmtokens +Nmtoken (S Nmtoken)* +

Literal data is any quoted string not containing the quotation mark used as a delimiter for that string. @@ -640,34 +740,56 @@ and external identifiers (SystemLiteral). Note that a SystemLiteral can be parsed without scanning for markup. -LiteralsEntityValue'"' + +Literals +EntityValue +'"' ([^%&"] | PEReference | Reference)* '"' -|  + +|  "'" ([^%&'] | PEReference | Reference)* -"'"AttValue'"' +"'" + +AttValue +'"' ([^<&"] | Reference)* '"' -|  + +|  "'" ([^<&'] | Reference)* -"'"SystemLiteral('"' [^"]* '"') | ("'" [^']* "'") -PubidLiteral'"' PubidChar* +"'" + +SystemLiteral +('"' [^"]* '"') | ("'" [^']* "'") + + +PubidLiteral +'"' PubidChar* '"' -| "'" (PubidChar - "'")* "'"PubidChar#x20 | #xD | #xA +| "'" (PubidChar - "'")* "'" + +PubidChar +#x20 | #xD | #xA | [a-zA-Z0-9] -| [-'()+,./:=?;!*#@$_%] +| [-'()+,./:=?;!*#@$_%] + +

+
+ Character Data and Markup +

Text consists of intermingled character data and markup. @@ -683,11 +805,9 @@ data and markup. processing instructions.

-

-All text that is not markup +

All text that is not markup constitutes the character data of -the document. -

+the document.

The ampersand character (&) and the left angle bracket (<) may appear in their literal form only when used as markup delimiters, or within a comment, a @@ -727,13 +847,20 @@ To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "&apos;", and the double-quote character (") as "&quot;". -Character DataCharData[^<&]* - ([^<&]* ']]>' [^<&]*) + +Character Data + +CharData +[^<&]* - ([^<&]* ']]>' [^<&]*) + +

+ Comments -

-Comments may + +

Comments may appear anywhere in a document outside other markup; in addition, they may appear within the document type declaration @@ -745,28 +872,41 @@ retrieve the text of comments. For compatibility, the string "--" (double-hyphen) must not occur within comments. -CommentsComment'<!--' + +Comments +Comment +'<!--' ((Char - '-') | ('-' (Char - '-')))* -'-->' - -

+'-->' + + +

An example of a comment: <!&como; declarations for <head> & <body> &comc;>

+ Processing Instructions +

Processing instructions (PIs) allow documents to contain instructions for applications. -Processing InstructionsPI'<?' PITarget + +Processing Instructions +PI +'<?' PITarget (S (Char* - (Char* &pic; Char*)))? -&pic;PITargetName - -(('X' | 'x') ('M' | 'm') ('L' | 'l')) +&pic; +PITarget +Name - +(('X' | 'x') ('M' | 'm') ('L' | 'l')) + + PIs are not part of the document's character data, but must be passed through to the application. The PI begins with a target (PITarget) used @@ -780,8 +920,10 @@ may be used for formal declaration of PI targets.

+ CDATA Sections +

CDATA sections may occur anywhere character data may occur; they are @@ -789,11 +931,24 @@ used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>": -CDATA SectionsCDSectCDStart + +CDATA Sections +CDSect +CDStart CData -CDEndCDStart'<![CDATA['CData(Char* - +CDEnd +CDStart +'<![CDATA[' + +CData +(Char* - (Char* ']]>' Char*)) -CDEnd']]>' + + +CDEnd +']]>' + + Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in @@ -801,6 +956,7 @@ their literal form; they need not (and cannot) be escaped using "&lt;" and "&amp;". CDATA sections cannot nest.

+

An example of a CDATA section, in which "<greeting>" and "</greeting>" are recognized as character data, not @@ -808,8 +964,10 @@ are recognized as character data, not <![CDATA[<greeting>Hello, world!</greeting>]]>

+ Prolog and Document Type Declaration +

XML documents may, and should, begin with an XML declaration which specifies @@ -824,6 +982,7 @@ and so is this: Hello, world! ]]>

+

The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" @@ -852,18 +1011,39 @@ declaration and if the document complies with the constraints expressed in it.

The document type declaration must appear before the first element in the document. -PrologprologXMLDecl? + +Prolog + +prolog +XMLDecl? Misc* (doctypedecl -Misc*)?XMLDecl&xmlpio; +Misc*)? +XMLDecl +&xmlpio; VersionInfo EncodingDecl? SDDecl? S? -&pic;VersionInfoS 'version' Eq +&pic; + +VersionInfo +S 'version' Eq (' VersionNum ' -| " VersionNum ")EqS? '=' S?VersionNum([a-zA-Z0-9_.:] | '-')+MiscComment | PI | -S

+| " VersionNum ") + +Eq +S? '=' S? + +VersionNum +([a-zA-Z0-9_.:] | '-')+ + +Misc +Comment | PI | +S + +

+

The XML document type declaration contains or points to @@ -896,8 +1076,7 @@ For fuller information, see Document Type Definition - -doctypedecl +doctypedecl '<!DOCTYPE' S Name (S ExternalID)? @@ -909,8 +1088,7 @@ For fuller information, see S?)? '>' - -markupdecl +markupdecl elementdecl | AttlistDecl | EntityDecl @@ -921,8 +1099,10 @@ For fuller information, see + +

The markup declarations may be made up in whole or in part of the replacement text of parameter entities. @@ -931,6 +1111,7 @@ individual nonterminals (elementdecl, AttlistDecl, and so on) describe the declarations after all the parameter entities have been included.

+ Root Element Type

@@ -938,6 +1119,7 @@ The Name in the document type declaration must match the element type of the root element.

+ Proper Declaration/PE Nesting

Parameter-entity @@ -974,13 +1156,22 @@ by using the conditional section construct; this is not allowed in the internal subset. -External SubsetextSubsetTextDecl? -extSubsetDeclextSubsetDecl( + +External Subset + +extSubset +TextDecl? +extSubsetDecl +extSubsetDecl +( markupdecl | conditionalSect | PEReference | S -)*

+)* + + +

The external subset and external parameter entities also differ from the internal subset in that in them, parameter-entity references @@ -1008,6 +1199,7 @@ This has the effect that entity and attribute-list declarations in the internal subset take precedence over those in the external subset.

+ Standalone Document Declaration

Markup declarations can affect the content of the document, @@ -1018,11 +1210,18 @@ The standalone document declaration, which may appear as a component of the XML declaration, signals whether or not there are such declarations which appear external to the document entity. -Standalone Document DeclarationSDDecl + +Standalone Document Declaration + +SDDecl + S 'standalone' Eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"')) -

+ + + +

In a standalone document declaration, the value "yes" indicates that there @@ -1050,21 +1249,16 @@ which may be desirable for some network delivery applications.

Standalone Document Declaration

The standalone document declaration must have the value "no" if any external markup declarations -contain declarations of:

- - -

attributes with default values, if +contain declarations of:

+

attributes with default values, if elements to which these attributes apply appear in the document without -specifications of values for these attributes, or

-
- -

entities (other than &magicents;), +specifications of values for these attributes, or

+

entities (other than &magicents;), if references to those entities appear in the document, or

- -

attributes with values subject to +

attributes with values subject to normalization, where the attribute appears in the document with a value which will change as a result of normalization, or

@@ -1073,14 +1267,15 @@ change as a result of normalization, or

element types with element content, if white space occurs directly within any instance of those types. -

-
+

+

An example XML declaration with a standalone document declaration:<?xml version="&XML.version;" standalone='yes'?>

White Space Handling +

In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines, denoted by the nonterminal S in this specification) to @@ -1119,6 +1314,7 @@ is considered to have signaled no intentions as regards application space handling, unless it provides a value for this attribute or the attribute is declared with a default value.

+
End-of-Line Handling @@ -1152,19 +1348,38 @@ In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by , "Tags for the Identification of Languages": -Language IdentificationLanguageIDLangcode -('-' Subcode)*LangcodeISO639Code | + +Language Identification +LanguageID +Langcode +('-' Subcode)* +Langcode +ISO639Code | IanaCode | -UserCodeISO639Code([a-z] | [A-Z]) ([a-z] | [A-Z])IanaCode('i' | 'I') '-' ([a-z] | [A-Z])+UserCode('x' | 'X') '-' ([a-z] | [A-Z])+Subcode([a-z] | [A-Z])+ +UserCode + +ISO639Code +([a-z] | [A-Z]) ([a-z] | [A-Z]) +IanaCode +('i' | 'I') '-' ([a-z] | [A-Z])+ +UserCode +('x' | 'X') '-' ([a-z] | [A-Z])+ +Subcode +([a-z] | [A-Z])+ + The Langcode may be any of the following: -

a two-letter language code as defined by + +

a two-letter language code as defined by , "Codes -for the representation of names of languages"

a language identifier registered with the Internet +for the representation of names of languages"

+

a language identifier registered with the Internet Assigned Numbers Authority ; these begin with the -prefix "i-" (or "I-")

a language identifier assigned by the user, or agreed on +prefix "i-" (or "I-")

+

a language identifier assigned by the user, or agreed on between parties in private use; these must begin with the prefix "x-" or "X-" in order to ensure that they do not conflict -with names later standardized or registered with IANA

+with names later standardized or registered with IANA

+

There may be any number of Subcode segments; if the first subcode segment exists and the Subcode consists of two @@ -1224,11 +1439,14 @@ notes in English, the xml:lang attribute might be declared this way: ]]>

+
+ Logical Structures +

Each XML document contains one or more elements, the boundaries of which are either delimited by start-tags @@ -1238,13 +1456,9 @@ identifier" (GI), and may have a set of attribute specifications. Each attribute specification has a name and a value.

- -Element - -element - -EmptyElemTag - +Element +element +EmptyElemTag | STag content ETag @@ -1274,40 +1488,47 @@ there is a declaration matching Name matches the element type, and one of the following holds:

- -

The declaration matches EMPTY and the element has no -content.

-
- -

The declaration matches children and +

The declaration matches EMPTY and the element has no +content.

+

The declaration matches children and the sequence of child elements belongs to the language generated by the regular expression in the content model, with optional white space (characters matching the nonterminal S) between each pair -of child elements.

-
- -

The declaration matches Mixed and +of child elements.

+

The declaration matches Mixed and the content consists of character data and child elements -whose types match names in the content model.

-
- -

The declaration matches ANY, and the types +whose types match names in the content model.

+

The declaration matches ANY, and the types of any child elements have -been declared.

-
+been declared.

+ Start-Tags, End-Tags, and Empty-Element Tags +

The beginning of every non-empty XML element is marked by a start-tag. -Start-tagSTag'<' Name + +Start-tag + +STag +'<' Name (S Attribute)* -S? '>'AttributeName Eq -AttValue +S? '>' + + +Attribute +Name Eq +AttValue + + + + + The Name in the start- and end-tags gives the element's type. @@ -1351,39 +1572,55 @@ to external entities. referred to directly or indirectly in an attribute value (other than "&lt;") must not contain a <. -

- +

An example of a start-tag: <termdef id="dt-dog" term="dog">

-

-The end of every element +

The end of every element that begins with a start-tag must be marked by an end-tag containing a name that echoes the element's type as given in the start-tag: -End-tagETag'</' Name -S? '>' - -

+ +End-tag + +ETag +'</' Name +S? '>' + + +

An example of an end-tag:</termdef>

-

-The +

The text between the start-tag and end-tag is called the element's content: -Content of Elementscontent(element | CharData + +Content of Elements + +content +(element | CharData | Reference | CDSect -| PI | Comment)* - -

+| PI | Comment)* +
+ +
+

If an element is empty, it must be represented either by a start-tag immediately followed by an end-tag or by an empty-element tag. An empty-element tag takes a special form: -Tags for Empty ElementsEmptyElemTag'<' Name (S + +Tags for Empty Elements + +EmptyElemTag +'<' Name (S Attribute)* S? -'/>' +'/>' + + + +

Empty-element tags may be used for any element which has no content, whether or not it is declared using the keyword @@ -1397,8 +1634,10 @@ tag must be used, and can only be used, for elements which are <br></br> <br/>

+ Element Type Declarations +

The element structure of an XML document may, for validation purposes, @@ -1407,6 +1646,7 @@ using element type and attribute-list declarations. An element type declaration constrains the element's content.

+

Element type declarations often constrain which element types can appear as children of the element. At user option, an XML processor may issue a warning @@ -1414,31 +1654,45 @@ when a declaration mentions an element type for which no declaration is provided, but this is not an error.

An element type declaration takes the form: -Element Type Declarationelementdecl'<!ELEMENT' S + +Element Type Declaration + +elementdecl +'<!ELEMENT' S Name S contentspec -S? '>'contentspec'EMPTY' +S? '>' + +contentspec +'EMPTY' | 'ANY' | Mixed | children - + + + + where the Name gives the element type being declared.

+ Unique Element Type Declaration

No element type may be declared more than once.

+

Examples of element type declarations: <!ELEMENT br EMPTY> <!ELEMENT p (#PCDATA|emph)* > <!ELEMENT %name.para; %content.para; > <!ELEMENT container ANY>

+ Element Content +

An element type has element content when elements of that type must contain only child @@ -1454,16 +1708,31 @@ The grammar is built on content particles (cps), which consist of names, choice lists of content particles, or sequence lists of content particles: -Element-content Modelschildren(choice + +Element-content Models + +children +(choice | seq) -('?' | '*' | '+')?cp(Name +('?' | '*' | '+')? +cp +(Name | choice | seq) -('?' | '*' | '+')?choice'(' S? cp +('?' | '*' | '+')? +choice +'(' S? cp ( S? '|' S? cp )* -S? ')'seq'(' S? cp +S? ')' + +seq +'(' S? cp ( S? ',' S? cp )* -S? ')' +S? ')' + + + + where each Name is the type of an element which may appear as a child. Any content @@ -1518,8 +1787,10 @@ character of the replacement text should be a connector <!ELEMENT div1 (head, (p | list | note)*, div2*)> <!ELEMENT dictionary-body (%div.mix; | %dict.mix;)*>

+ Mixed Content +

An element type has mixed content when elements of that type may contain @@ -1527,15 +1798,25 @@ character data, optionally interspersed with child elements. In this case, the types of the child elements may be constrained, but not their order or their number of occurrences: -Mixed-content DeclarationMixed'(' S? + +Mixed-content Declaration + +Mixed +'(' S? '#PCDATA' (S? '|' S? Name)* S? -')*' | '(' S? '#PCDATA' S? ')' - +')*' +| '(' S? '#PCDATA' S? ')' + + + + + + where the Names give the types of elements that may appear as children.

@@ -1543,16 +1824,17 @@ that may appear as children. No Duplicate Types

The same name must not appear more than once in a single mixed-content declaration. -

- +

Examples of mixed content declarations: <!ELEMENT p (#PCDATA|a|ul|b|i|em)*> <!ELEMENT p (#PCDATA | %font; | %phrase; | %special; | %form;)* > <!ELEMENT b (#PCDATA)>

+ Attribute-List Declarations +

Attributes are used to associate name-value pairs with elements. Attribute specifications may appear only within start-tags @@ -1561,29 +1843,39 @@ thus, the productions used to recognize them appear in . Attribute-list declarations may be used: -

To define the set of attributes pertaining to a given -element type.

To establish type constraints for these -attributes.

To provide default values -for attributes.

+ +

To define the set of attributes pertaining to a given +element type.

+

To establish type constraints for these +attributes.

+

To provide default values +for attributes.

+

-

- +

Attribute-list declarations specify the name, data type, and default value (if any) of each attribute associated with a given element type: -Attribute-list DeclarationAttlistDecl'<!ATTLIST' S + +Attribute-list Declaration +AttlistDecl +'<!ATTLIST' S Name AttDef* -S? '>'AttDefS Name +S? '>' + +AttDef +S Name S AttType -S DefaultDecl +S DefaultDecl + + The Name in the AttlistDecl rule is the type of an element. At user option, an XML processor may issue a warning if attributes are declared for an element type not itself declared, but this is not an error. The Name in the AttDef rule is -the name of the attribute. -

+the name of the attribute.

When more than one AttlistDecl is provided for a given element type, the contents of all those provided are merged. When @@ -1601,16 +1893,45 @@ provided for a given element type, or more than one attribute definition is provided for a given attribute, but this is not an error.

+ Attribute Types +

XML attribute types are of three kinds: a string type, a set of tokenized types, and enumerated types. The string type may take any literal string as a value; the tokenized types have varying lexical and semantic constraints, as noted: -Attribute TypesAttTypeStringType + +Attribute Types + +AttType +StringType | TokenizedType | EnumeratedType -StringType'CDATA'TokenizedType'ID'| 'IDREF'| 'IDREFS'| 'ENTITY'| 'ENTITIES'| 'NMTOKEN'| 'NMTOKENS' + + +StringType +'CDATA' + +TokenizedType +'ID' + + + +| 'IDREF' + +| 'IDREFS' + +| 'ENTITY' + +| 'ENTITIES' + +| 'NMTOKEN' + +| 'NMTOKENS' + + +

ID @@ -1672,9 +1993,14 @@ passing them to the application, as described in

Enumerated attributes can take one of a list of values provided in the declaration. There are two kinds of enumerated types: -Enumerated Attribute TypesEnumeratedTypeNotationType + +Enumerated Attribute Types +EnumeratedType +NotationType | Enumeration -NotationType'NOTATION' + +NotationType +'NOTATION' S '(' S? @@ -1682,19 +2008,25 @@ kinds of enumerated types: (S? '|' S? Name)* S? ')' -Enumeration'(' S? + + +Enumeration +'(' S? Nmtoken (S? '|' S? Nmtoken)* S? -')' +')' + + A NOTATION attribute identifies a notation, declared in the DTD with associated system and/or public identifiers, to be used in interpreting the element to which the attribute is attached.

+ Notation Attributes

@@ -1717,14 +2049,28 @@ declaration. enumerated attribute types of a single element type.

+ Attribute Defaults +

An attribute declaration provides information on whether the attribute's presence is required, and if not, how an XML processor should react if a declared attribute is absent in a document. -Attribute DefaultsDefaultDecl'#REQUIRED' -| '#IMPLIED' | (('#FIXED' S)? AttValue) + +Attribute Defaults + +DefaultDecl +'#REQUIRED' +| '#IMPLIED' +| (('#FIXED' S)? AttValue) + + + + + + +

In an attribute declaration, #REQUIRED means that the @@ -1751,8 +2097,7 @@ the declared default value.

If the default declaration is the keyword #REQUIRED, then the attribute must be specified for all elements of the type in the attribute-list declaration. -

- +

Attribute Default Legal

@@ -1765,8 +2110,8 @@ default value must meet the lexical constraints of the declared attribute type.

If an attribute has a default value declared with the #FIXED keyword, instances of that attribute must match the default value. -

-
+

+

Examples of attribute-list declarations: <!ATTLIST termdef id ID #REQUIRED @@ -1781,14 +2126,19 @@ match the default value.

Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize it as follows: -

a character reference is processed by appending the referenced -character to the attribute value

an entity reference is processed by recursively processing the -replacement text of the entity

a whitespace character (#x20, #xD, #xA, #x9) is processed by + +

a character reference is processed by appending the referenced +character to the attribute value

+

an entity reference is processed by recursively processing the +replacement text of the entity

+

a whitespace character (#x20, #xD, #xA, #x9) is processed by appending #x20 to the normalized value, except that only a single #x20 is appended for a "#xD#xA" sequence that is part of an external parsed entity or the literal entity value of an internal parsed -entity

other characters are processed by appending them to the normalized -value

+entity

+

other characters are processed by appending them to the normalized +value

+

If the declared value is not CDATA, then the XML processor must further process the normalized attribute value by discarding any @@ -1810,20 +2160,39 @@ by a non-validating parser as if declared which are included in, or excluded from, the logical structure of the DTD based on the keyword which governs them. -Conditional SectionconditionalSectincludeSect + +Conditional Section + +conditionalSect +includeSect | ignoreSect -includeSect'<![' S? 'INCLUDE' S? '[' + + +includeSect +'<![' S? 'INCLUDE' S? '[' extSubsetDecl ']]>' -ignoreSect'<![' S? 'IGNORE' S? '[' + + +ignoreSect +'<![' S? 'IGNORE' S? '[' ignoreSectContents* -']]>'ignoreSectContentsIgnore +']]>' + + +ignoreSectContents +Ignore ('<![' ignoreSectContents ']]>' -Ignore)*IgnoreChar* - +Ignore)* +Ignore +Char* - (Char* ('<![' | ']]>') Char*) - + + + +

Like the internal and external DTD subsets, a conditional section may contain one or more complete declarations, @@ -1861,6 +2230,8 @@ include or ignore the conditional section.

+ + +--> +
+ Physical Structures +

An XML document may consist of one or many storage units. These are called entities; they all have content and are all @@ -1903,6 +2277,7 @@ contents are referred to as its replacement text; this text is considered an integral part of the document.

+

An unparsed entity is a resource whose contents may or may not be @@ -1931,16 +2306,27 @@ are recognized in different contexts. Furthermore, they occupy different namespaces; a parameter entity and a general entity with the same name are two distinct entities.

+ Character and Entity References

A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices. -Character ReferenceCharRef'&#' [0-9]+ ';' | '&hcro;' [0-9a-fA-F]+ ';' -Legal Character

Characters referred to using character references must + +Character Reference +CharRef +'&#' [0-9]+ ';' +| '&hcro;' [0-9a-fA-F]+ ';' + + + + +Legal Character +

Characters referred to using character references must match the production for -Char.

+Char.

+ If the character reference begins with "&#x", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. @@ -1962,27 +2348,24 @@ semicolon

Entity Reference - -Reference +Reference EntityRef -| CharRef - - -EntityRef +| CharRef +EntityRef '&' Name ';' - -PEReference +PEReference '%' Name ';' + Entity Declared

In a document without any DTD, a document with only an internal @@ -2052,20 +2435,44 @@ is classified &security-level;.

%ISOLat2;]]>

+ Entity Declarations +

Entities are declared thus: -Entity DeclarationEntityDeclGEDecl | PEDeclGEDecl'<!ENTITY' S Name + +Entity Declaration + +EntityDecl +GEDecl | PEDecl + + +GEDecl +'<!ENTITY' S Name S EntityDef -S? '>'PEDecl'<!ENTITY' S '%' S +S? '>' + +PEDecl +'<!ENTITY' S '%' S Name S -PEDef S? '>'EntityDefEntityValue +PEDef S? '>' + + +EntityDef +EntityValue | (ExternalID -NDataDecl?)PEDefEntityValue -| ExternalID +NDataDecl?) + + + +PEDef +EntityValue +| ExternalID + + The Name identifies the entity in an entity reference or, in the case of an unparsed entity, in the value of an ENTITY or ENTITIES @@ -2074,8 +2481,10 @@ If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.

+ Internal Entities +

If the entity definition is an EntityValue, @@ -2094,25 +2503,35 @@ entity.

<!ENTITY Pub-Status "This is a pre-release of the specification.">

+ External Entities -

-If the entity is not + +

If the entity is not internal, it is an external entity, declared as follows: -External Entity DeclarationExternalID'SYSTEM' S -SystemLiteral| 'PUBLIC' S + --> +ExternalID +'SYSTEM' S +SystemLiteral +| 'PUBLIC' S PubidLiteral S SystemLiteral -NDataDeclS 'NDATA' S -Name + + +NDataDecl +S 'NDATA' S +Name + + If the NDataDecl is present, this is a general unparsed -entity; otherwise it is a parsed entity. -

+entity; otherwise it is a parsed entity.

Notation Declared

@@ -2162,17 +2581,26 @@ and leading and trailing white space must be removed.

SYSTEM "../grafix/OpenHatch.gif" NDATA gif >

+
+ Parsed Entities The Text Declaration

External parsed entities may each begin with a text declaration. -Text DeclarationTextDecl&xmlpio; + +Text Declaration + +TextDecl +&xmlpio; VersionInfo? EncodingDecl -S? &pic; +S? &pic; + + +

The text declaration must be provided literally, not by reference to a parsed entity. @@ -2189,9 +2617,17 @@ parsed entity is well-formed if it matches the production labeled An external parameter entity is well-formed if it matches the production labeled extPE. -Well-Formed External Parsed EntityextParsedEntTextDecl? -contentextPETextDecl? -extSubsetDecl + +Well-Formed External Parsed Entity +extParsedEnt +TextDecl? +content + +extPE +TextDecl? +extSubsetDecl + + An internal general parsed entity is well-formed if its replacement text matches the production labeled content. @@ -2212,6 +2648,7 @@ can begin in one entity and end in another.

Character Encoding in Entities +

Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors must be able to read entities in either UTF-8 or UTF-16. @@ -2231,11 +2668,20 @@ to read entities that use them. Parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration containing an encoding declaration: -Encoding DeclarationEncodingDeclS + +Encoding Declaration +EncodingDecl +S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" ) -EncName[A-Za-z] ([A-Za-z0-9._] | '-')*Encoding name contains only Latin characters + + +EncName +[A-Za-z] ([A-Za-z0-9._] | '-')* +Encoding name contains only Latin characters + + In the document entity, the encoding declaration is part of the XML declaration. The EncName is the name of the encoding used. @@ -2278,6 +2724,7 @@ declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

+

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process.

Examples of encoding declarations: @@ -2292,29 +2739,44 @@ entity references, and invocations of unparsed entities might appear and the required behavior of an XML processor in each case. The labels in the leftmost column describe the recognition context: -

as a reference + + +

as a reference anywhere after the start-tag and before the end-tag of an element; corresponds -to the nonterminal content.

as a reference within either the value of an attribute in a +to the nonterminal content.

+
+ + +

as a reference within either the value of an attribute in a start-tag, or a default value in an attribute declaration; corresponds to the nonterminal -AttValue.

as a Name, not a reference, appearing either as +AttValue.

+ + +

as a Name, not a reference, appearing either as the value of an attribute which has been declared as type ENTITY, or as one of the space-separated tokens in the value of an attribute which has been -declared as type ENTITIES.

as a reference +declared as type ENTITIES.

+
+ +

as a reference within a parameter or internal entity's literal entity value in the entity's declaration; corresponds to the nonterminal -EntityValue.

as a reference within either the internal or external subsets of the +EntityValue.

+ +

as a reference within either the internal or external subsets of the DTD, but outside of an EntityValue or -AttValue.

+AttValue.

+ +

- - + Entity Type Character @@ -2327,99 +2789,50 @@ General Unparsed + Reference in Content - -Not recognized - - -Included - - -Included if validating - - -Forbidden - - -Included - +Not recognized +Included +Included if validating +Forbidden +Included Reference in Attribute Value - -Not recognized - - -Included in literal - - -Forbidden - - -Forbidden - - -Included - +Not recognized +Included in literal +Forbidden +Forbidden +Included Occurs as Attribute Value - -Not recognized - - -Forbidden - - -Forbidden - - -Notify - - -Not recognized - +Not recognized +Forbidden +Forbidden +Notify +Not recognized Reference in EntityValue - -Included in literal - - -Bypassed - - -Bypassed - - -Forbidden - - -Included - +Included in literal +Bypassed +Bypassed +Forbidden +Included Reference in DTD - -Included as PE - - -Forbidden - - -Forbidden - - -Forbidden - - -Forbidden - +Included as PE +Forbidden +Forbidden +Forbidden +Forbidden @@ -2434,8 +2847,7 @@ when they appear in the value of an appropriately declared attribute.
Included -

-An entity is +

An entity is included when its replacement text is retrieved and processed, in place of the reference itself, @@ -2452,8 +2864,7 @@ data. (The string "AT&amp;T;" expands to as an entity-reference delimiter.) A character reference is included when the indicated character is processed in place of the reference itself. - -

+

Included If Validating @@ -2482,11 +2893,16 @@ presence and retrieve it for display only on demand. Forbidden

The following are forbidden, and constitute fatal errors: -

the appearance of a reference to an + +

the appearance of a reference to an unparsed entity. -

the appearance of any character or general-entity reference in the +

+

the appearance of any character or general-entity reference in the DTD except within an EntityValue or -AttValue.

a reference to an external entity in an attribute value.

+AttValue.

+

a reference to an external entity in an attribute value.

+
+

@@ -2505,8 +2921,7 @@ For example, this is well-formed: while this is not: <!ENTITY EndAttr "27'" > <element attribute='a-&EndAttr;> -

-
+

Notify

When the name of an unparsed @@ -2538,6 +2953,7 @@ text of parameter entities to contain an integral number of grammatical tokens in the DTD.

+
Construction of Internal Entity Replacement Text @@ -2553,6 +2969,7 @@ text is the content of the entity, after replacement of character references and parameter-entity references.

+

The literal entity value as given in an internal entity declaration (EntityValue) may contain character, @@ -2582,11 +2999,11 @@ content or an attribute value.

discussion of a difficult example, see .

+
Predefined Entities -

-Entity and character +

Entity and character references can both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (&magicents;) is specified for this purpose. @@ -2595,8 +3012,7 @@ expanded immediately when recognized and must be treated as character data, so the numeric character references "&#60;" and "&#38;" may be used to escape < and & when they occur -in character data. -

+in character data.

All XML processors must recognize these entities whether they are declared or not. For interoperability, @@ -2618,34 +3034,38 @@ are doubly escaped to meet the requirement that entity replacement be well-formed.

+ Notation Declarations -

-Notations identify by + +

Notations identify by name the format of unparsed entities, the format of elements which bear a notation attribute, or the application to which a processing instruction is -addressed. -

-

- +addressed.

+

Notation declarations provide a name for the notation, for use in entity and attribute-list declarations and in attribute specifications, and an external identifier for the notation which may allow an XML processor or its client application to locate a helper application capable of processing data in the given notation. -Notation DeclarationsNotationDecl'<!NOTATION' S Name + +Notation Declarations +NotationDecl +'<!NOTATION' S Name S (ExternalID | PublicID) -S? '>'PublicID'PUBLIC' S +S? '>' +PublicID +'PUBLIC' S PubidLiteral - - -

+ + +

XML processors must provide applications with the name and external identifier(s) of any notation declared and referred to in an attribute value, attribute definition, or entity declaration. They may @@ -2657,8 +3077,11 @@ is not an error, however, for XML documents to declare and refer to notations for which notation-specific applications are not available on the system where the XML processor or application is running.)

+ + Document Entity +

The document entity serves as the root of the entity tree and a starting-point for an XML @@ -2669,10 +3092,14 @@ processor; unlike other entities, the document entity has no name and might well appear on a processor input stream without any identification at all.

+ +
+ Conformance + Validating and Non-Validating Processors

Conforming XML processors fall into two @@ -2725,7 +3152,8 @@ validity violations. Less is required of a non-validating processor; it need not read any part of the document other than the document entity. This has two effects that may be important to users of XML processors: -

Certain well-formedness errors, specifically those that require + +

Certain well-formedness errors, specifically those that require reading external entities, may not be detected by a non-validating processor. Examples include the constraints entitled Entity Declared, @@ -2733,7 +3161,8 @@ Examples include the constraints entitled No Recursion, as well as some of the cases described as forbidden in -.

The information passed from the processor to the application may +.

+

The information passed from the processor to the application may vary, depending on whether the processor reads parameter and external entities. For example, a non-validating processor may not @@ -2742,7 +3171,8 @@ For example, a non-validating processor may not internal entities, or supply default attribute values, where doing so depends on having read declarations in -external or parameter entities.

+external or parameter entities.

+

For maximum reliability in interoperating between different XML processors, applications which use non-validating processors should not @@ -2752,8 +3182,10 @@ attributes or internal entities which are declared in external entities should use validating XML processors.

+ Notation +

The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form @@ -2764,9 +3196,13 @@ otherwise. Literal strings are quoted.

+

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters: -

where N is a hexadecimal integer, the + + + +

where N is a hexadecimal integer, the expression matches the character in ISO/IEC 10646 whose canonical (UCS-4) code value, when interpreted as an unsigned binary number, has @@ -2774,36 +3210,105 @@ the value indicated. The number of leading zeros in the #xN form is insignificant; the number of leading zeros in the corresponding code value is governed by the character -encoding in use and is not significant for XML.

matches any character -with a value in the range(s) indicated (inclusive).

matches any character +encoding in use and is not significant for XML.

+
+ + +

matches any character +with a value in the range(s) indicated (inclusive).

+
+ + +

matches any character with a value outside the -range indicated.

matches any character -with a value not among the characters given.

matches a literal string matching -that given inside the double quotes.

matches a literal string matching -that given inside the single quotes.

+range indicated.

+ + + +

matches any character +with a value not among the characters given.

+
+ + +

matches a literal string matching +that given inside the double quotes.

+
+ + +

matches a literal string matching +that given inside the single quotes.

+
+ These symbols may be combined to match more complex patterns as follows, where A and B represent simple expressions: -

expression is treated as a unit -and may be combined as described in this list.

matches A or nothing; optional A.

matches A followed by B.

matches A or B but not both.

matches any string that matches A but does not match + + + +

expression is treated as a unit +and may be combined as described in this list.

+
+ + +

matches A or nothing; optional A.

+
+ + +

matches A followed by B.

+
+ + +

matches A or B but not both.

+
+ + +

matches any string that matches A but does not match B. -

matches one or more occurrences of A.

matches zero or more occurrences of A.

+

+ + + +

matches one or more occurrences of A.

+
+ + +

matches zero or more occurrences of A.

+
+ + Other notations used in the productions are: -

comment.

well-formedness constraint; this identifies by name a + + + +

comment.

+
+ + +

well-formedness constraint; this identifies by name a constraint on well-formed documents -associated with a production.

validity constraint; this identifies by name a constraint on +associated with a production.

+
+ + +

validity constraint; this identifies by name a constraint on valid documents associated with -a production.

-

-
+a production.

+ + +

+ + + + References Normative References + (Internet Assigned Numbers Authority) Official Names for @@ -2811,18 +3316,21 @@ Character Sets, ed. Keld Simonsen et al. See ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets. + IETF (Internet Engineering Task Force). RFC 1766: Tags for the Identification of Languages, ed. H. Alvestrand. 1995. + (International Organization for Standardization). ISO 639:1988 (E). Code for the representation of names of languages. [Geneva]: International Organization for Standardization, 1988. + (International Organization for Standardization). ISO 3166-1:1997 (E). @@ -2830,6 +3338,7 @@ Codes for the representation of names of countries and their subdivisions — Part 1: Country codes [Geneva]: International Organization for Standardization, 1997. + ISO (International Organization for Standardization). ISO/IEC 10646-1993 (E). Information technology — Universal @@ -2838,24 +3347,31 @@ Architecture and Basic Multilingual Plane. [Geneva]: International Organization for Standardization, 1993 (plus amendments AM 1 through AM 7). + The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, Mass.: Addison-Wesley Developers Press, 1996. + + - -Other References + +Other References + + Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Reading: Addison-Wesley, 1986, rpt. corr. 1988. + Berners-Lee, T., R. Fielding, and L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax and Semantics. 1997. (Work in progress; see updates to RFC1738.) + Brüggemann-Klein, Anne. Regular Expressions into Finite Automata. Extended abstract in I. Simon, Hrsg., LATIN 1992, @@ -2863,12 +3379,14 @@ S. 97-98. Springer-Verlag, Berlin 1992. Full Version in Theoretical Computer Science 120: 197-213, 1993. + Brüggemann-Klein, Anne, and Derick Wood. Deterministic Regular Languages. Universität Freiburg, Institut für Informatik, Bericht 38, Oktober 1991. + James Clark. Comparison of SGML and XML. See http://www.w3.org/TR/NOTE-sgml-xml-971215. @@ -2879,18 +3397,21 @@ IETF (Internet Engineering Task Force). ed. T. Berners-Lee, L. Masinter, M. McCahill. 1994. + IETF (Internet Engineering Task Force). RFC 1808: Relative Uniform Resource Locators, ed. R. Fielding. 1995. + IETF (Internet Engineering Task Force). RFC 2141: URN Syntax, ed. R. Moats. 1997. + ISO (International Organization for Standardization). ISO 8879:1986(E). Information processing — Text and Office @@ -2898,6 +3419,8 @@ Systems — Standard Generalized Markup Language (SGML). First edition — 1986-10-15. [Geneva]: International Organization for Standardization, 1986. + + ISO (International Organization for Standardization). ISO/IEC 10744-1992 (E). Information technology — @@ -2909,6 +3432,9 @@ Standardization, 1992. [Geneva]: International Organization for Standardization, 1996. + + + @@ -2921,8 +3447,14 @@ diacritics), ideographic characters, and combining characters (among others, this class contains most diacritics); these classes combine to form the class of letters. Digits and extenders are also distinguished. -CharactersLetterBaseChar -| IdeographicBaseChar[#x0041-#x005A] + +Characters + +Letter +BaseChar +| Ideographic +BaseChar +[#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D8-#x00F6] @@ -3124,10 +3656,14 @@ also distinguished. | [#x30A1-#x30FA] | [#x3105-#x312C] | [#xAC00-#xD7A3] -Ideographic[#x4E00-#x9FA5] + +Ideographic +[#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029] -CombiningChar[#x0300-#x0345] + +CombiningChar +[#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486] | [#x0591-#x05A1] @@ -3222,7 +3758,9 @@ also distinguished. | [#x302A-#x302F] | #x3099 | #x309A -Digit[#x0030-#x0039] + +Digit +[#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9] | [#x0966-#x096F] @@ -3237,7 +3775,9 @@ also distinguished. | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29] -Extender#x00B7 + +Extender +#x00B7 | #x02D0 | #x02D1 | #x0387 @@ -3248,26 +3788,61 @@ also distinguished. | [#x3031-#x3035] | [#x309D-#x309E] | [#x30FC-#x30FE] - + + + +

The character classes defined here can be derived from the Unicode character database as follows: -

Name start characters must have one of the categories Ll, Lu, -Lo, Lt, Nl.

Name characters other than Name-start characters -must have one of the categories Mc, Me, Mn, Lm, or Nd.

Characters in the compatibility area (i.e. with character code + + +

Name start characters must have one of the categories Ll, Lu, +Lo, Lt, Nl.

+
+ +

Name characters other than Name-start characters +must have one of the categories Mc, Me, Mn, Lm, or Nd.

+
+ +

Characters in the compatibility area (i.e. with character code greater than #xF900 and less than #xFFFE) are not allowed in XML -names.

Characters which have a font or compatibility decomposition (i.e. those +names.

+
+ +

Characters which have a font or compatibility decomposition (i.e. those with a "compatibility formatting tag" in field 5 of the database -- -marked by field 5 beginning with a "<") are not allowed.

The following characters are treated as name-start characters +marked by field 5 beginning with a "<") are not allowed.

+
+ +

The following characters are treated as name-start characters rather than name characters, because the property file classifies -them as Alphabetic: [#x02BB-#x02C1], #x0559, #x06E5, #x06E6.

Characters #x20DD-#x20E0 are excluded (in accordance with -Unicode, section 5.14).

Character #x00B7 is classified as an extender, because the -property list so identifies it.

Character #x0387 is added as a name character, because #x00B7 -is its canonical equivalent.

Characters ':' and '_' are allowed as name-start characters.

Characters '-' and '.' are allowed as name characters.

+them as Alphabetic: [#x02BB-#x02C1], #x0559, #x06E5, #x06E6.

+ + +

Characters #x20DD-#x20E0 are excluded (in accordance with +Unicode, section 5.14).

+
+ +

Character #x00B7 is classified as an extender, because the +property list so identifies it.

+
+ +

Character #x0387 is added as a name character, because #x00B7 +is its canonical equivalent.

+
+ +

Characters ':' and '_' are allowed as name-start characters.

+
+ +

Characters '-' and '.' are allowed as name characters.

+
+

XML and SGML +

XML is designed to be a subset of SGML, in that every valid XML document should also be a conformant SGML document. @@ -3318,29 +3893,34 @@ solely for reference. 8 This sample shows a &tricky; method. ]]> This produces the following: -

in line 4, the reference to character 37 is expanded immediately, + +

in line 4, the reference to character 37 is expanded immediately, and the parameter entity "xx" is stored in the symbol table with the value "%zz;". Since the replacement text is not rescanned, the reference to parameter entity "zz" is not recognized. (And it would be an error if it were, since -"zz" is not yet declared.)

in line 5, the character reference "&#60;" is +"zz" is not yet declared.)

+

in line 5, the character reference "&#60;" is expanded immediately and the parameter entity "zz" is stored with the replacement text "<!ENTITY tricky "error-prone" >", -which is a well-formed entity declaration.

in line 6, the reference to "xx" is recognized, +which is a well-formed entity declaration.

+

in line 6, the reference to "xx" is recognized, and the replacement text of "xx" (namely "%zz;") is parsed. The reference to "zz" is recognized in its turn, and its replacement text ("<!ENTITY tricky "error-prone" >") is parsed. The general entity "tricky" has now been -declared, with the replacement text "error-prone".

+declared, with the replacement text "error-prone".

+

in line 8, the reference to the general entity "tricky" is recognized, and it is expanded, so the full content of the "test" element is the self-describing (and ungrammatical) string This sample shows a error-prone method. -

+

+

-
+ Deterministic Content Models

For compatibility, it is @@ -3409,9 +3989,35 @@ In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF".

-

00 00 00 3C: UCS-4, big-endian machine (1234 order)

3C 00 00 00: UCS-4, little-endian machine (4321 order)

00 00 3C 00: UCS-4, unusual octet order (2143)

00 3C 00 00: UCS-4, unusual octet order (3412)

FE FF: UTF-16, big-endian

FF FE: UTF-16, little-endian

00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark -(and thus, strictly speaking, in error)

3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark -(and thus, strictly speaking, in error)

3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, + + +

00 00 00 3C: UCS-4, big-endian machine (1234 order)

+
+ +

3C 00 00 00: UCS-4, little-endian machine (4321 order)

+
+ +

00 00 3C 00: UCS-4, unusual octet order (2143)

+
+ +

00 3C 00 00: UCS-4, unusual octet order (3412)

+
+ +

FE FF: UTF-16, big-endian

+
+ +

FF FE: UTF-16, little-endian

+
+ +

00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark +(and thus, strictly speaking, in error)

+
+ +

3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark +(and thus, strictly speaking, in error)

+
+ +

3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, @@ -3419,11 +4025,19 @@ and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the ASCII characters, the encoding declaration itself may be read reliably -

4C 6F A7 94: EBCDIC (in some flavor; the full +

+
+ +

4C 6F A7 94: EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in -use)

other: UTF-8 without an encoding declaration, or else +use)

+
+ +

other: UTF-8 without an encoding declaration, or else the data stream is corrupt, fragmentary, or enclosed in -a wrapper of some kind

+a wrapper of some kind

+ +

This level of autodetection is enough to read the XML encoding @@ -3469,97 +4083,64 @@ MIME-type label in an external header, for example, should be part of the RFC document defining the text/xml and application/xml MIME types. In the interests of interoperability, however, the following rules are recommended. -

If an XML entity is in a file, the Byte-Order Mark + +

If an XML entity is in a file, the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery. -

If an XML entity is delivered with a +

+

If an XML entity is delivered with a MIME type of text/xml, then the charset parameter on the MIME type determines the character encoding method; all other heuristics and sources of information are solely for error recovery. -

If an XML entity is delivered +

+

If an XML entity is delivered with a MIME type of application/xml, then the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery. -

+

+ These rules apply only in the absence of protocol-level documentation; in particular, when the MIME types text/xml and application/xml are defined, the recommendations of the relevant RFC will supersede these rules.

+
+ W3C XML Working Group +

This specification was prepared and approved for publication by the W3C XML Working Group (WG). WG approval of this specification does not necessarily imply that all WG members voted for its approval. The current and former members of the XML WG are:

+ - -Jon Bosak, Sun -Chair - - -James Clark -Technical Lead - - -Tim Bray, Textuality and Netscape -XML Co-editor - - -Jean Paoli, Microsoft -XML Co-editor - - -C. M. Sperberg-McQueen, U. of Ill. -XML -Co-editor - - -Dan Connolly, W3C -W3C Liaison - - -Paula Angerstein, Texcel - - -Steve DeRose, INSO - - -Dave Hollander, HP - - -Eliot Kimber, ISOGEN - - -Eve Maler, ArborText - - -Tom Magliery, NCSA - - -Murray Maloney, Muzmo and Grif - - -Makoto Murata, Fuji Xerox Information Systems - - -Joel Nava, Adobe - - -Conleth O'Connell, Vignette - - -Peter Sharpe, SoftQuad - - -John Tigue, DataChannel - +Jon Bosak, SunChair +James ClarkTechnical Lead +Tim Bray, Textuality and NetscapeXML Co-editor +Jean Paoli, MicrosoftXML Co-editor +C. M. Sperberg-McQueen, U. of Ill.XML +Co-editor +Dan Connolly, W3CW3C Liaison +Paula Angerstein, Texcel +Steve DeRose, INSO +Dave Hollander, HP +Eliot Kimber, ISOGEN +Eve Maler, ArborText +Tom Magliery, NCSA +Murray Maloney, Muzmo and Grif +Makoto Murata, Fuji Xerox Information Systems +Joel Nava, Adobe +Conleth O'Connell, Vignette +Peter Sharpe, SoftQuad +John Tigue, DataChannel +
diff --git a/result/valid/dia.xml b/result/valid/dia.xml index c1959845..f7f18531 100644 --- a/result/valid/dia.xml +++ b/result/valid/dia.xml @@ -39,100 +39,100 @@ ]> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/result/valid/xhtml1.xhtml b/result/valid/xhtml1.xhtml index 75644faf..6089e4e2 100644 --- a/result/valid/xhtml1.xhtml +++ b/result/valid/xhtml1.xhtml @@ -65,16 +65,33 @@ The Extensible HyperText Markup Language

W3C Proposed Recommendation 10 December 1999

-
This version:
+
+
This version:
+ +
http://www.w3.org/TR/1999/PR-xhtml1-19991210
(Postscript version, PDF version, ZIP archive, or Gzip'd TAR archive) -
Latest version:
-http://www.w3.org/TR/xhtml1
Previous versions:
-http://www.w3.org/TR/1999/WD-xhtml1-19991124
-http://www.w3.org/TR/1999/PR-xhtml1-19990824
Authors:
See acknowledgements.
+
+ +
Latest version:
+ +
+http://www.w3.org/TR/xhtml1
+ +
Previous versions:
+ +
+http://www.w3.org/TR/1999/WD-xhtml1-19991124
+
+http://www.w3.org/TR/1999/PR-xhtml1-19990824
+ +
Authors:
+ +
See acknowledgements.
+

+

Abstract

+

This specification defines XHTML 1.0, a reformulation of HTML 4.0 as an XML 1.0 application, and three DTDs corresponding to the ones defined by HTML 4.0. The semantics of the elements and @@ -93,12 +112,13 @@ their attributes are defined in the W3C Recommendation for HTML 4.0. These semantics provide the foundation for future extensibility of XHTML. Compatibility with existing HTML user agents is possible by following a small set of guidelines.

+

Status of this document

-

-This section describes the status of this document at the time + +

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The -latest status of this document series is maintained at the W3C. -

+latest status of this document series is maintained at the W3C.

+

This specification is a Proposed Recommendation of the HTML Working Group. It is a revision of the Proposed Recommendation dated 24 August 1999 incorporating changes as a result of comments from the Proposed @@ -106,6 +126,7 @@ Recommendation review, and comments and further deliberations of the W3C HTML Working Group. A diff-marked version from the previous proposed recommendation is available for comparison purposes.

+

On 10 December 1999, this document enters a Proposed Recommendation review period. From that date until 8 January @@ -115,63 +136,115 @@ to review this specification and return comments in their completed ballots to w3c-html-review@w3.org. Please send any comments of a confidential nature in separate email to w3t-html@w3.org, which is visible to the Team only.

+

No sooner than 14 days after the end of the review period, the Director will announce the document's disposition: it may become a W3C Recommendation (possibly with minor changes), it may revert to Working Draft status, or it may be dropped as a W3C work item.

+

Publication as a Proposed Recommendation does not imply endorsement by the W3C membership. This is still a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Proposed Recommendation as other than "work in progress."

+

This document has been produced as part of the W3C HTML Activity. The goals of the HTML Working Group (members only) are discussed in the HTML Working Group charter (members only).

+

A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR.

+

Public discussion on HTML features takes place on the mailing list www-html@w3.org (archive). The W3C staff contact for work on HTML is Dave Raggett.

+

Please report errors in this document to www-html-editor@w3.org.

+

The list of known errors in this specification is available at http://www.w3.org/1999/12/PR-xhtml1-19991210-errata.

-

-Contents -

+ +

Contents

+ + -

-1. What is XHTML? -

+

1. What is XHTML?

+

XHTML is a family of current and future document types and modules that reproduce, subset, and extend HTML 4.0 [HTML]. XHTML family document types are XML based, and ultimately are designed to work in conjunction with XML-based user agents. The details of this family and its evolution are discussed in more detail in the section on Future Directions.

+

XHTML 1.0 (this specification) is the first document type in the XHTML family. It is a reformulation of the three HTML 4.0 document types as applications of XML 1.0 [XML]. It is intended @@ -179,6 +252,7 @@ to be used as a language for content that is both XML-conforming and, if some simple guidelines are followed, operates in HTML 4.0 conforming user agents. Developers who migrate their content to XHTML 1.0 will realize the following benefits:

+ +

The XHTML family is the next step in the evolution of the Internet. By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility.

-

-1.1 What is HTML 4.0? -

+ +

1.1 What is HTML 4.0?

+

HTML 4.0 [HTML] is an SGML (Standard Generalized Markup Language) application conforming to International Standard ISO 8879, and is widely regarded as the standard publishing language of the World Wide Web.

+

SGML is a language for describing markup languages, particularly those used in electronic document exchange, document management, and document publishing. HTML is an example of a language defined in SGML.

+

SGML has been around since the middle 1980's and has remained quite stable. Much of this stability stems from the fact that the language is both feature-rich and flexible. This flexibility, however, comes at a price, and that price is a level of complexity that has inhibited its adoption in a diversity of environments, including the World Wide Web.

+

HTML, as originally conceived, was to be a language for the exchange of scientific and other technical documents, suitable for use by non-document specialists. HTML addressed the problem @@ -220,6 +298,7 @@ semantic tags suitable for authoring relatively simple documents. In addition to simplifying the document structure, HTML added support for hypertext. Multimedia capabilities were added later.

+

In a remarkably short space of time, HTML became wildly popular and rapidly outgrew its original purpose. Since HTML's inception, there has been rapid invention of new elements for use @@ -227,27 +306,31 @@ within HTML (as a standard) and for adapting HTML to vertical, highly specialized, markets. This plethora of new elements has led to compatibility problems for documents across different platforms.

+

As the heterogeneity of both software and platforms rapidly proliferate, it is clear that the suitability of 'classic' HTML 4.0 for use on these platforms is somewhat limited.

-

-1.2 What is XML? -

+ +

1.2 What is XML?

+

XML is the shorthand for Extensible Markup Language, and is an acronym of Extensible Markup Language [XML].

+

XML was conceived as a means of regaining the power and flexibility of SGML without most of its complexity. Although a restricted form of SGML, XML nonetheless preserves most of SGML's power and richness, and yet still retains all of SGML's commonly used features.

+

While retaining these beneficial features, XML removes many of the more complex features of SGML that make the authoring and design of suitable software both difficult and costly.

-

-1.3 Why the need for XHTML? -

+ +

1.3 Why the need for XHTML?

+

The benefits of migrating to XHTML 1.0 are described above. Some of the benefits of migrating to XHTML in general are:

+
  • Document developers and user agent designers are constantly discovering new ways to express their ideas through new markup. In XML, it is @@ -258,6 +341,7 @@ modules (described in the forthcoming XHTML Modularization specification). These modules will permit the combination of existing and new feature sets when developing content and when designing new user agents.
  • +
  • Alternate ways of accessing the Internet are constantly being introduced. Some estimates indicate that by the year 2002, 75% of Internet document viewing will be carried out on these alternate @@ -267,49 +351,62 @@ mechanism, servers, proxies, and user agents will be able to perform best effort content transformation. Ultimately, it will be possible to develop XHTML-conforming content that is usable by any XHTML-conforming user agent.
  • +
-

-2. Definitions -

-

-2.1 Terminology -

+

2. Definitions

+ +

2.1 Terminology

+

The following terms are used in this specification. These terms extend the definitions in [RFC2119] in ways based upon similar definitions in ISO/IEC 9945-1:1990 [POSIX.1]:

+
Implementation-defined
+
A value or behavior is implementation-defined when it is left to the implementation to define [and document] the corresponding requirements for correct document construction.
+
May
+
With respect to implementations, the word "may" is to be interpreted as an optional feature that is not required in this specification but can be provided. With respect to Document Conformance, the word "may" means that the optional feature must not be used. The term "optional" has the same definition as "may".
+
Must
+
In this specification, the word "must" is to be interpreted as a mandatory requirement on the implementation or on Strictly Conforming XHTML Documents, depending upon the context. The term "shall" has the same definition as "must".
+
Reserved
+
A value or behavior is unspecified, but it is not allowed to be used by Conforming Documents nor to be supported by a Conforming User Agents.
+
Should
+
With respect to implementations, the word "should" is to be interpreted as an implementation recommendation, but not a requirement. With respect to documents, the word "should" is to be interpreted as recommended programming practice for documents and a requirement for Strictly Conforming XHTML Documents.
+
Supported
+
Certain facilities in this specification are optional. If a facility is supported, it behaves as specified by this specification.
+
Unspecified
+
When a value or behavior is unspecified, the specification defines no portability requirements for a facility on an implementation even when faced with a document that uses the @@ -317,68 +414,85 @@ facility. A document that requires specific behavior in such an instance, rather than tolerating any behavior when using that facility, is not a Strictly Conforming XHTML Document.
-

-2.2 General Terms -

+ +

2.2 General Terms

+
Attribute
+
An attribute is a parameter to an element declared in the DTD. An attribute's type and value range, including a possible default value, are defined in the DTD.
+
DTD
+
A DTD, or document type definition, is a collection of XML declarations that, as a collection, defines the legal structure, elements, and attributes that are available for use in a document that complies to the DTD.
+
Document
+
A document is a stream of data that, after being combined with any other streams it references, is structured such that it holds information contained within elements that are organized as defined in the associated DTD. See Document Conformance for more information.
+
Element
+
An element is a document structuring unit declared in the DTD. The element's content model is defined in the DTD, and additional semantics may be defined in the prose description of the element.
-
-Facilities -
+ +
Facilities
+
Functionality includes elements, attributes, and the semantics associated with those elements and attributes. An implementation supporting that functionality is said to provide the necessary facilities.
+
Implementation
+
An implementation is a system that provides collection of facilities and services that supports this specification. See User Agent Conformance for more information.
+
Parsing
+
Parsing is the act whereby a document is scanned, and the information contained within the document is filtered into the context of the elements in which the information is structured.
+
Rendering
+
Rendering is the act whereby the information in a document is presented. This presentation is done in the form most appropriate to the environment (e.g. aurally, visually, in print).
+
User Agent
+
A user agent is an implementation that retrieves and processes XHTML documents. See User Agent Conformance for more information.
+
Validation
+
Validation is a process whereby documents are verified against the associated DTD, ensuring that the structure, use of elements, and use of attributes are consistent with the definitions in the DTD.
-
-Well-formed -
+ +
Well-formed
+
A document is well-formed when it is structured according to the rules defined in Section 2.1 of the XML 1.0 Recommendation [XML]. @@ -386,42 +500,45 @@ Basically, this definition states that elements, delimited by their start and end tags, are nested properly within one another.
+ -

-3. Normative Definition of -XHTML 1.0 -

-

-3.1 Document -Conformance -

+

3. Normative Definition of +XHTML 1.0

+ +

3.1 Document +Conformance

+

This version of XHTML provides a definition of strictly conforming XHTML documents, which are restricted to tags and attributes from the XHTML namespace. See Section 3.1.2 for information on using XHTML with other namespaces, for instance, to include metadata expressed in RDF within XHTML documents.

-

-3.1.1 Strictly Conforming -Documents -

+ +

3.1.1 Strictly Conforming +Documents

+

A Strictly Conforming XHTML Document is a document that requires only the facilities described as mandatory in this specification. Such a document must meet all of the following criteria:

+
  1. It must validate against one of the three DTDs found in Appendix A.

  2. +
  3. The root element of the document must be <html>.

  4. +
  5. The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.

  6. +
  7. There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in @@ -445,7 +562,9 @@ local system conventions.

+

Here is an example of a minimal XHTML document.

+
 <?xml version="1.0" encoding="UTF-8"?>
@@ -461,22 +580,25 @@ local system conventions.

</body> </html>
+

Note that in this example, the XML declaration is included. An XML declaration like the one above is not required in all XML documents. XHTML document authors are strongly encouraged to use XML declarations in all their documents. Such a declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16.

-

-3.1.2 Using XHTML with -other namespaces -

+ +

3.1.2 Using XHTML with +other namespaces

+

The XHTML namespace may be used with other XML namespaces as per [XMLNAMES], although such documents are not strictly conforming XHTML 1.0 documents as defined above. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.

+

The following example shows the way in which XHTML 1.0 could be used in conjunction with the MathML Recommendation:

+
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
@@ -497,8 +619,10 @@ be used in conjunction with the MathML Recommendation:

</html>
+

The following example shows the way in which XHTML 1.0 markup could be incorporated into another XML namespace:

+
 <?xml version="1.0" encoding="UTF-8"?>
@@ -516,33 +640,40 @@ could be incorporated into another XML namespace:

</book>
-

-3.2 User Agent -Conformance -

+ +

3.2 User Agent +Conformance

+

A conforming user agent must meet all of the following criteria:

+
  1. In order to be consistent with the XML 1.0 Recommendation [XML], the user agent must parse and evaluate an XHTML document for well-formedness. If the user agent claims to be a validating user agent, it must also validate documents against their referenced DTDs according to [XML].
  2. +
  3. When the user agent claims to support facilities defined within this specification or required by this specification through normative reference, it must do so in ways consistent with the facilities' definition.
  4. +
  5. When a user agent processes an XHTML document as generic XML, it shall only recognize attributes of type ID (e.g. the id attribute on most XHTML elements) as fragment identifiers.
  6. +
  7. If a user agent encounters an element it does not recognize, it must render the element's content.
  8. +
  9. If a user agent encounters an attribute it does not recognize, it must ignore the entire attribute specification (i.e., the attribute and its value).
  10. +
  11. If a user agent encounters an attribute value it doesn't recognize, it must use the default attribute value.
  12. +
  13. If it encounters an entity reference (other than one of the predefined entities) for which the User Agent has processed no declaration (which could happen if the declaration @@ -550,12 +681,19 @@ is in the external subset which the User Agent hasn't read), the entity reference should be rendered as the characters (starting with the ampersand and ending with the semi-colon) that make up the entity reference.
  14. +
  15. When rendering content, User Agents that encounter characters or character entity references that are recognized but not renderable should display the document in such a way that it is obvious to the user that normal rendering has not taken place.
  16. +
  17. The following characters are defined in [XML] as whitespace characters: -
    • Space (&#x0020;)
    • Tab (&#x0009;)
    • Carriage return (&#x000D;)
    • Line feed (&#x000A;)
    +
      +
    • Space (&#x0020;)
    • +
    • Tab (&#x0009;)
    • +
    • Carriage return (&#x000D;)
    • +
    • Line feed (&#x000A;)
    • +

    The XML processor normalizes different system's line end codes into one @@ -563,7 +701,10 @@ single line-feed character, that is passed up to the application. The XHTML user agent in addition, must treat the following characters as whitespace:

    -
    • Form feed (&#x000C;)
    • Zero-width space (&#x200B;)
    +
      +
    • Form feed (&#x000C;)
    • +
    • Zero-width space (&#x200B;)
    • +

    In elements where the 'xml:space' attribute is set to 'preserve', the user @@ -573,20 +714,26 @@ Otherwise, whitespace is handled according to the following rules:

    -
    • +
        +
      • All whitespace surrounding block elements should be removed. -
      • +
      • +
      • Comments are removed entirely and do not affect whitespace handling. One whitespace character on either side of a comment is treated as two white space characters. -
      • +
      • +
      • Leading and trailing whitespace inside a block element must be removed. -
      • Line feed characters within a block element must be converted into a +
      • +
      • Line feed characters within a block element must be converted into a space (except when the 'xml:space' attribute is set to 'preserve'). -
      • +
      • +
      • A sequence of white space characters must be reduced to a single space character (except when the 'xml:space' attribute is set to 'preserve'). -
      • +
      • +
      • With regard to rendition, the User Agent should render the content in a manner appropriate to the language in which the content is written. @@ -602,134 +749,146 @@ look like words in Arabic to an English eye frequently encode several words, e.g. 'kitAbuhum' = 'kitAbu-hum' = 'book them' == their book); and languages in the Chinese script tradition typically neither encode such delimiters nor use typographic whitespace in this way. -
      +
    • +

    Whitespace in attribute values is processed according to [XML].

+ -

-4. Differences with HTML -4.0 -

+

4. Differences with HTML +4.0

+

Due to the fact that XHTML is an XML application, certain practices that were perfectly legal in SGML-based HTML 4.0 [HTML] must be changed.

-

-4.1 Documents must be -well-formed -

+ +

4.1 Documents must be +well-formed

+

Well-formedness is a new concept introduced by [XML]. Essentially this means that all elements must either have closing tags or be written in a special form (as described below), and that all the elements must nest.

+

Although overlapping is illegal in SGML, it was widely tolerated in existing browsers.

+

CORRECT: nested elements.

<p>here is an emphasized <em>paragraph</em>.</p>

+

INCORRECT: overlapping elements

<p>here is an emphasized <em>paragraph.</p></em>

-

-4.2 Element and attribute -names must be in lower case -

+ +

4.2 Element and attribute +names must be in lower case

+

XHTML documents must use lower case for all HTML element and attribute names. This difference is necessary because XML is case-sensitive e.g. <li> and <LI> are different tags.

-

-4.3 For non-empty elements, -end tags are required -

+ +

4.3 For non-empty elements, +end tags are required

+

In SGML-based HTML 4.0 certain elements were permitted to omit the end tag; with the elements that followed implying closure. This omission is not permitted in XML-based XHTML. All elements other than those declared in the DTD as EMPTY must have an end tag.

+

CORRECT: terminated elements

<p>here is a paragraph.</p><p>here is another paragraph.</p>

+

INCORRECT: unterminated elements

<p>here is a paragraph.<p>here is another paragraph.

-

-4.4 Attribute values must -always be quoted -

+ +

4.4 Attribute values must +always be quoted

+

All attribute values must be quoted, even those which appear to be numeric.

+

CORRECT: quoted attribute values

<table rows="3">

+

INCORRECT: unquoted attribute values

<table rows=3>

-

-4.5 Attribute -Minimization -

+ +

4.5 Attribute +Minimization

+

XML does not support attribute minimization. Attribute-value pairs must be written in full. Attribute names such as compact and checked cannot occur in elements without their value being specified.

+

CORRECT: unminimized attributes

<dl compact="compact">

+

INCORRECT: minimized attributes

<dl compact>

-

-4.6 Empty Elements -

+ +

4.6 Empty Elements

+

Empty elements must either have an end tag or the start tag must end with />. For instance, <br/> or <hr></hr>. See HTML Compatibility Guidelines for information on ways to ensure this is backward compatible with HTML 4.0 user agents.

+

CORRECT: terminated empty tags

<br/><hr/>

+

INCORRECT: unterminated empty tags

<br><hr>

-

-4.7 Whitespace handling in -attribute values -

+ +

4.7 Whitespace handling in +attribute values

+

In attribute values, user agents will strip leading and trailing whitespace from attribute values and map sequences of one or more whitespace characters (including line breaks) to a single inter-word space (an ASCII space character for western scripts). See Section 3.3.3 of [XML].

-

-4.8 Script and Style -elements -

+ +

4.8 Script and Style +elements

+

In XHTML, the script and style elements are declared as having #PCDATA content. As a result, < and & will be treated as the start of markup, and @@ -739,6 +898,7 @@ will be recognized as entity references by the XML processor to the content of the script or style element within a CDATA marked section avoids the expansion of these entities.

+
 <script>
@@ -748,18 +908,21 @@ entities.

</script>
+

CDATA sections are recognized by the XML processor and appear as nodes in the Document Object Model, see Section 1.3 of the DOM Level 1 Recommendation [DOM].

+

An alternative is to use external script and style documents.

-

-4.9 SGML exclusions -

+ +

4.9 SGML exclusions

+

SGML gives the writer of a DTD the ability to exclude specific elements from being contained within an element. Such prohibitions (called "exclusions") are not possible in XML.

+

For example, the HTML 4.0 Strict DTD forbids the nesting of an 'a' element within another 'a' element to any descendant depth. It is not possible to spell out such @@ -768,10 +931,10 @@ defined in the DTD, certain elements should not be nested. A summary of such elements and the elements that should not be nested in them is found in the normative Appendix B.

-

-4.10 The elements with 'id' and 'name' -attributes -

+ +

4.10 The elements with 'id' and 'name' +attributes

+

HTML 4.0 defined the name attribute for the elements a, applet, frame, @@ -794,190 +957,197 @@ compatible when serving XHTML documents as media type text/html.

Note that in XHTML 1.0, the name attribute of these elements is formally deprecated, and will be removed in a subsequent version of XHTML.

+ -

-5. Compatibility Issues -

+

5. Compatibility Issues

+

Although there is no requirement for XHTML 1.0 documents to be compatible with existing user agents, in practice this is easy to accomplish. Guidelines for creating compatible documents can be found in Appendix C.

-

-5.1 Internet Media Type -

+ +

5.1 Internet Media Type

As of the publication of this recommendation, the general recommended MIME labeling for XML-based applications has yet to be resolved.

+

However, XHTML Documents which follow the guidelines set forth in Appendix C, "HTML Compatibility Guidelines" may be labeled with the Internet Media Type "text/html", as they are compatible with most HTML browsers. This document makes no recommendation about MIME labeling of other XHTML documents.

+ -

-6. Future Directions -

+

6. Future Directions

+

XHTML 1.0 provides the basis for a family of document types that will extend and subset XHTML, in order to support a wide range of new devices and applications, by defining modules and specifying a mechanism for combining these modules. This mechanism will enable the extension and sub-setting of XHTML 1.0 in a uniform way through the definition of new modules.

-

-6.1 Modularizing HTML -

+ +

6.1 Modularizing HTML

+

As the use of XHTML moves from the traditional desktop user agents to other platforms, it is clear that not all of the XHTML elements will be required on all platforms. For example a hand held device or a cell-phone may only support a subset of XHTML elements.

+

The process of modularization breaks XHTML up into a series of smaller element sets. These elements can then be recombined to meet the needs of different communities.

+

These modules will be defined in a later W3C document.

-

-6.2 Subsets and -Extensibility -

+ +

6.2 Subsets and +Extensibility

+

Modularization brings with it several advantages:

+
  • It provides a formal mechanism for sub-setting XHTML.

  • +
  • It provides a formal mechanism for extending XHTML.

  • +
  • It simplifies the transformation between document types.

  • +
  • It promotes the reuse of modules in new document types.

-

-6.3 Document -Profiles -

+ +

6.3 Document +Profiles

+

A document profile specifies the syntax and semantics of a set of documents. Conformance to a document profile provides a basis for interoperability guarantees. The document profile specifies the facilities required to process documents of that type, e.g. which image formats can be used, levels of scripting, style sheet support, and so on.

+

For product designers this enables various groups to define their own standard profile.

+

For authors this will obviate the need to write several different versions of documents for different clients.

+

For special groups such as chemists, medical doctors, or mathematicians this allows a special profile to be built using standard HTML elements plus a group of elements geared to the specialist's needs.

+

Appendix A. DTDs

-

-This appendix is normative. -

+ +

This appendix is normative.

+

These DTDs and entity sets form a normative part of this specification. The complete set of DTD files together with an XML declaration and SGML Open Catalog is included in the zip file for this specification.

-

-A.1 Document Type -Definitions -

+ +

A.1 Document Type +Definitions

+

These DTDs approximate the HTML 4.0 DTDs. It is likely that when the DTDs are modularized, a method of DTD construction will be employed that corresponds more closely to HTML 4.0.

+ -

-A.2 Entity Sets -

+ +

A.2 Entity Sets

+

The XHTML entity sets are the same as for HTML 4.0, but have been modified to be valid XML 1.0 entity declarations. Note the entity for the Euro currency sign (&euro; or &#8364; or &#x20AC;) is defined as part of the special characters.

+ + -

-Appendix B. Element -Prohibitions -

-

-This appendix is normative. -

+

Appendix B. Element +Prohibitions

+ +

This appendix is normative.

+

The following elements have prohibitions on which elements they can contain (see Section 4.9). This prohibition applies to all depths of nesting, i.e. it contains all the descendant elements.

-
-
-a -
+ +
a
cannot contain other a elements.
-
-pre -
+
pre
cannot contain the img, object, big, small, sub, or sup elements.
-
-button -
+ +
button
cannot contain the input, select, textarea, label, button, form, fieldset, iframe or isindex elements.
-
-label -
+
label
cannot contain other label elements.
-
-form -
+
form
cannot contain other form elements.
+ -

-Appendix C. -HTML Compatibility Guidelines -

-

-This appendix is informative. -

+

Appendix C. +HTML Compatibility Guidelines

+ +

This appendix is informative.

+

This appendix summarizes design guidelines for authors who wish their XHTML documents to render on existing HTML user agents.

+

C.1 Processing Instructions

Be aware that processing instructions are rendered on some user agents. However, also note that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.

+

C.2 Empty Elements

Include a space before the trailing / and > of empty elements, e.g. @@ -986,12 +1156,14 @@ or UTF-16.

src="karen.jpg" alt="Karen" />. Also, use the minimized tag syntax for empty elements, e.g. <br />, as the alternative syntax <br></br> allowed by XML gives uncertain results in many existing user agents.

+

C.3 Element Minimization and Empty Element Content

Given an empty instance of an element whose content model is not EMPTY (for example, an empty title or paragraph) do not use the minimized form (e.g. use <p> </p> and not <p />).

+

C.4 Embedded Style Sheets and Scripts

Use external style sheets if your style sheet uses < or & or ]]> or --. Use @@ -1001,18 +1173,22 @@ are permitted to silently remove the contents of comments. Therefore, the histor practice of "hiding" scripts and style sheets within comments to make the documents backward compatible is likely to not work as expected in XML-based implementations.

+

C.5 Line Breaks within Attribute Values

Avoid line breaks and multiple whitespace characters within attribute values. These are handled inconsistently by user agents.

+

C.6 Isindex

Don't include more than one isindex element in the document head. The isindex element is deprecated in favor of the input element.

+

C.7 The lang and xml:lang Attributes

Use both the lang and xml:lang attributes when specifying the language of an element. The value of the xml:lang attribute takes precedence.

+

C.8 Fragment Identifiers

In XML, URIs [RFC2396] that end with fragment identifiers of the form "#foo" do not refer to elements with an attribute @@ -1022,6 +1198,7 @@ id attribute in HTML 4.0. Many existing HTML clients don't support the use of ID-type attributes in this way, so identical values may be supplied for both of these attributes to ensure maximum forward and backward compatibility (e.g., <a id="foo" name="foo">...</a>).

+

Further, since the set of legal values for attributes of type ID is much smaller than for those of type CDATA, the type of the name @@ -1039,6 +1216,7 @@ conversion.

name attribute of the a, applet, frame, iframe, img, and map elements, and it will be removed from XHTML in subsequent versions.

+

C.9 Character Encoding

To specify a character encoding in the document, use both the encoding attribute specification on the xml declaration (e.g. @@ -1048,6 +1226,7 @@ encoding="EUC-JP"?>) and a meta http-equiv statement content='text/html; charset="EUC-JP"' />). The value of the encoding attribute of the xml processing instruction takes precedence.

+

C.10 Boolean Attributes

Some HTML user agents are unable to interpret boolean attributes when these appear in their full (non-minimized) form, @@ -1058,6 +1237,7 @@ ismap, declare, noshade, checked, disabled, readonly, multiple, selected, noresize, defer.

+

C.11 Document Object Model and XHTML

The Document Object Model level 1 Recommendation [DOM] @@ -1089,6 +1269,7 @@ extraneous elements, XHTML has made the elements optional. Applications need to adapt to this accordingly. +

C.12 Using Ampersands in Attribute Values

When an attribute value contains an ampersand, it must be expressed as a character @@ -1101,44 +1282,51 @@ CGI script that takes parameters, it must be expressed as rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.

+

C.13 Cascading Style Sheets (CSS) and XHTML

+

The Cascading Style Sheets level 2 Recommendation [CSS2] defines style properties which are applied to the parse tree of the HTML or XML document. Differences in parsing will produce different visual or aural results, depending on the selectors used. The following hints will reduce this effect for documents which are served without modification as both media types:

+
  1. CSS style sheets for XHTML should use lower case element and attribute names.
  2. + +
  3. In tables, the tbody element will be inferred by the parser of an HTML user agent, but not by the parser of an XML user agent. Therefore you should always explicitely add a tbody element if it is referred to in a CSS selector.
  4. +
  5. Within the XHTML name space, user agents are expected to recognize the "id" attribute as an attribute of type ID. Therefore, style sheets should be able to continue using the shorthand "#" selector syntax even if the user agent does not read the DTD.
  6. +
  7. Within the XHTML name space, user agents are expected to recognize the "class" attribute. Therefore, style sheets should be able to continue using the shorthand "." selector syntax.
  8. +
  9. CSS defines different conformance rules for HTML and XML documents; be aware that the HTML rules apply to XHTML documents delivered as HTML and the XML rules apply to XHTML documents delivered as XML.
-

-Appendix D. -Acknowledgements -

-

-This appendix is informative. -

+

Appendix D. +Acknowledgements

+ +

This appendix is informative.

+

This specification was written with the participation of the members of the W3C HTML working group:

+
Steven Pemberton, CWI (HTML Working Group Chair)
Murray Altheim, Sun Microsystems
@@ -1165,108 +1353,87 @@ Chris Wilson, Microsoft
Ted Wugofski, Gateway 2000
Dan Zigmond, WebTV Networks
+ -

-Appendix E. References -

-

-This appendix is informative. -

+

Appendix E. References

+ +

This appendix is informative.

+
-
- -[CSS2] - -
+ +
[CSS2]
+
"Cascading Style Sheets, level 2 (CSS2) Specification", B. Bos, H. W. Lie, C. Lilley, I. Jacobs, 12 May 1998.
Available at: http://www.w3.org/TR/REC-CSS2
-
- -[DOM] - -
+ +
[DOM]
+
"Document Object Model (DOM) Level 1 Specification", Lauren Wood et al., 1 October 1998.
Available at: http://www.w3.org/TR/REC-DOM-Level-1
-
- -[HTML] - -
+ +
[HTML]
+
"HTML 4.01 Specification", D. Raggett, A. Le Hors, I. Jacobs, 24 August 1999.
Available at: http://www.w3.org/TR/1999/PR-html40-19990824
-
- -[POSIX.1] - -
+ +
[POSIX.1]
+
"ISO/IEC 9945-1:1990 Information Technology - Portable Operating System Interface (POSIX) - Part 1: System Application Program Interface (API) [C Language]", Institute of Electrical and Electronics Engineers, Inc, 1990.
-
- - -[RFC2046] - -
+ +
+[RFC2046]
+
"RFC2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", N. Freed and N. Borenstein, November 1996.
Available at http://www.ietf.org/rfc/rfc2046.txt. Note that this RFC obsoletes RFC1521, RFC1522, and RFC1590.
-
- - -[RFC2119] - -
+ +
+[RFC2119]
+
"RFC2119: Key words for use in RFCs to Indicate Requirement Levels", S. Bradner, March 1997.
Available at: http://www.ietf.org/rfc/rfc2119.txt
-
- - -[RFC2376] - -
+ +
+[RFC2376]
+
"RFC2376: XML Media Types", E. Whitehead, M. Murata, July 1998.
Available at: http://www.ietf.org/rfc/rfc2376.txt
-
- - -[RFC2396] - -
+ +
+[RFC2396]
+
"RFC2396: Uniform Resource Identifiers (URI): Generic Syntax", T. Berners-Lee, R. Fielding, L. Masinter, August 1998.
This document updates RFC1738 and RFC1808.
Available at: http://www.ietf.org/rfc/rfc2396.txt
-
- -[XML] - -
+ +
[XML]
+
"Extensible Markup Language (XML) 1.0 Specification", T. Bray, J. Paoli, C. M. Sperberg-McQueen, 10 February 1998.
Available at: http://www.w3.org/TR/REC-xml
-
- -[XMLNAMES] - -
+ +
[XMLNAMES]
+
"Namespaces in XML", T. Bray, D. Hollander, A. Layman, 14 January 1999.
XML namespaces provide a simple method for qualifying names used @@ -1274,11 +1441,10 @@ in XML documents by associating them with namespaces identified by URI.
Available at: http://www.w3.org/TR/REC-xml-names
+
-

- -Level Triple-A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0 -

+

+Level Triple-A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0