Contents
lang
and xml:lang
AttributesThis appendix is informative.
This appendix summarizes design guidelines for authors who wish their XHTML documents to render on existing HTML user agents. Note that this recommendation does not define how HTML conforming
user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html
. For these definitions, see [HTML4] and [RFC2854] respectively.
Be aware that processing instructions are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected. For compatibility with these types of legacy browsers, you may want to avoid using processing instructions and XML declarations. Remember, however, that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.
Include a space before the trailing /
and >
of empty elements, e.g. <br />
, <hr />
and <img src="karen.jpg" alt="Karen" />
. Also, use the minimized tag syntax for empty elements, e.g. <br />
, as the
alternative syntax <br></br>
allowed by XML gives uncertain results in many existing user agents.
Given an empty instance of an element whose content model is not EMPTY
(for example, an empty title or paragraph) do not use the minimized form (e.g. use
<p> </p>
and not <p />
).
Use external style sheets if your style sheet uses <
or &
or ]]>
or --
. Use external scripts if your script uses <
or
&
or ]]>
or --
. Note that XML parsers are permitted to silently remove the contents of comments. Therefore, the historical practice of "hiding" scripts
and style sheets within "comments" to make the documents backward compatible is likely to not work as expected in XML-based user agents.
Avoid line breaks and multiple white space characters within attribute values. These are handled inconsistently by user agents.
Don't include more than one isindex
element in the document head
. The isindex
element is deprecated in favor of the input
element.
lang
and xml:lang
AttributesUse both the lang
and xml:lang
attributes when specifying the language of an element. The value of the xml:lang
attribute takes precedence.
In XML, URI-references [RFC2396] that end with fragment identifiers of the form
"#foo"
do not refer to elements with an attribute name="foo"
; rather, they refer to elements with an attribute defined to be of type ID
, e.g., the id
attribute in HTML 4. Many existing HTML clients don't support the use of ID
-type attributes in this way, so identical values may be supplied for both of these attributes to ensure
maximum forward and backward compatibility (e.g., <a id="foo" name="foo">...</a>
).
Further, since the set of legal values for attributes of type ID
is much smaller than for those of type CDATA
, the type of the name
attribute has been
changed to NMTOKEN
. This attribute is constrained such that it can only have the same values as type ID
, or as the Name
production in XML 1.0 Section 2.3,
production 5. Unfortunately, this constraint cannot be expressed in the XHTML 1.0 DTDs. Because of this change, care must be taken when converting existing HTML documents. The values of these
attributes must be unique within the document, valid, and any references to these fragment identifiers (both internal and external) must be updated should the values be changed during conversion.
Note that the collection of legal values in XML 1.0 Section 2.3, production 5 is much larger than that permitted to be used in the ID
and NAME
types defined in HTML 4.
When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]*
should be used. See Section 6.2 of [HTML4] for more information.
Finally, note that XHTML 1.0 has deprecated the name
attribute of the a
, applet
, form
, frame
, iframe
,
img
, and map
elements, and it will be removed from XHTML in subsequent versions.
Historically, the character encoding of an HTML document is either specified by a web server via the charset parameter of the HTTP Content-Type header, or via a meta
element in the
document itself. In an XML document, the character encoding of the document is specified on the XML declaration (e.g., <?xml version="1.0" encoding="EUC-JP"?>
).
In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that
wants to set its character encoding explicitly must include both the XML declaration an encoding declaration and a meta
http-equiv statement (e.g., <meta
http-equiv="Content-type" content="text/html; charset=EUC-JP" />
). In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence.
Note: be aware that if a document must include the character encoding declaration in a meta http-equiv statement, that document may always be interpreted by HTTP servers and/or user agents as being of the internet media type defined in that statement. If a document is to be served as multiple media types, the HTTP server must be used to set the encoding of the document.
Some HTML user agents are unable to interpret boolean attributes when these appear in their full (non-minimized) form, as required by XML 1.0. Note this problem doesn't affect user agents
compliant with HTML 4. The following attributes are involved: compact
, nowrap
, ismap
, declare
, noshade
, checked
,
disabled
, readonly
, multiple
, selected
, noresize
, defer
.
The Document Object Model level 1 Recommendation [DOM] defines document object model interfaces for XML and HTML 4. The HTML 4 document object model specifies that HTML element and attribute names are returned in upper-case. The XML document object model specifies that element and attribute names are returned in the case they are specified. In XHTML 1.0, elements and attributes are specified in lower-case. This apparent difference can be addressed in two ways:
text/html
via the DOM can use the HTML DOM, and can rely upon element
and attribute names being returned in upper-case from those interfaces.text/xml
, application/xml
, or application/xhtml+xml
can also use the XML DOM.
Elements and attributes will be returned in lower-case. Also, some XHTML elements may or may not appear in the object tree because they are optional in the content model (e.g. the tbody
element within table
). This occurs because in HTML 4 some elements were permitted to be minimized such that their start and end tags are both omitted (an SGML feature). This is not
possible in XML. Rather than require document authors to insert extraneous elements, XHTML has made the elements optional. User agents need to adapt to this accordingly. For further information on
this topic, see [DOM2]In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user
agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents
will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that
documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an
entity reference (e.g. "&
"). For example, when the href
attribute of the a
element refers to a CGI script that takes parameters, it must be expressed as
http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user
rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user
.
The Cascading Style Sheets level 2 Recommendation [CSS2] defines style properties which are applied to the parse tree of the HTML or XML documents. Differences in parsing will produce different visual or aural results, depending on the selectors used. The following hints will reduce this effect for documents which are served without modification as both media types:
In HTML 4 and XHTML, the style
element can be used to define document-internal style rules. In XML, an XML stylesheet declaration is used to define style rules. In order to be
compatible with this convention, style
elements should have their fragment identifier set using the id
attribute, and an XML stylesheet declaration should reference this
fragment. For example:
<?xml-stylesheet href="W3C-REC.css" type="text/css"?> <?xml-stylesheet href="#internalStyle" type="text/css"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>An internal stylesheet example</title> <style type="text/css" id="internalStyle"> code { color: green; font-family: monospace; font-weight: bold; } </style> </head> <body> <p> This is text that uses our <code>internal stylesheet</code>. </p> </body> </html>
Some characters that are legal in HTML documents, are illegal in XML document. For example, in HTML, the Formfeed character (U+000C) is treated as white space, in XHTML, due to XML's definition of characters, it is illegal.
The named character reference '
(the apostrophe, U+0027) was introduced in XML 1.0 but does not appear in HTML. Authors should therefore use '
instead of
'
to work as expected in HTML 4 user agents.