Unicode Character Database |
Revision | 5.1.0 |
Authors | Asmus Freytag, Ken Whistler |
Date | 2007-06-13 |
This Version | http://www.unicode.org/Public/5.1.0/ucd/NamesList.html |
Previous Version | http://www.unicode.org/Public/5.0.0/ucd/NamesList.html |
Latest Version | http://www.unicode.org/Public/UNIDATA/NamesList.html |
This file describes the format and contents of NamesList.txt
The file and the files described herein are part of the Unicode Character Database (UCD) and are governed by the UCD Terms of Use stated at the end.
The Unicode name list file NamesList.txt (also NamesList.lst) is a plain text file used to drive the layout of the character code charts in the Unicode Standard. The information in this file is a combination of several fields from the UnicodeData.txt and Blocks.txt files, together with additional annotations for many characters.
This document describes the syntax rules for the file format, but also gives brief information on how each construct is rendered when laid out for the code charts. Some of the syntax elements are used only in preparation of the drafts of the code charts and are not present in the final, released form of the NamesList.txt file.
The syntax for formal aliases and index tabs was introduced with Unicode 5.0. The syntax for marginal sidebar comments is utilized extensively in draft versions of the NamesList.txt file.
The same input file can be used for the draft preparation for ISO/IEC 10646 (referred below as ISO-style). This necessitates the presence of some information in the name list file that is not needed (and in fact removed during parsing) for the Unicode code charts.
With access to the layout program (unibook.exe) it is a simple matter of creating name lists for the purpose of formatting working drafts containing proposed characters.
The content of the NamesList.txt file is optimized for code chart creation. Some information that can be inferred by the reader from context has been suppressed to make the code charts more readable.
The NamesList files are plain text files which in their most simple form look like this:
@@<tab>0020<tab>BASIC LATIN<tab>007F
; this is a file comment (ignored)
0020<tab>SPACE
0021<tab>EXCLAMATION MARK
0022<tab>QUOTATION MARK
. . .
007F<tab>DELETE
The semicolon (as first character), @ and <tab> characters are used by the file syntax and must be provided as shown. Hexadecimal digits must be in UPPER CASE. A double @@ introduces a block header, with the title, and start and ending code of the block provided as shown.
For a minimal name list, only the NAME_LINE and BLOCKHEADER and their constituent syntax elements are needed.
The full syntax with all the options is provided in the following sections.
This section defines the overall file structure
NAMELIST: TITLE_PAGE* BLOCK* TITLE_PAGE: TITLE | TITLE_PAGE SUBTITLE | TITLE_PAGE SUBHEADER | TITLE_PAGE IGNORED_LINE | TITLE_PAGE EMPTY_LINE | TITLE_PAGE COMMENT_LINE | TITLE_PAGE NOTICE_LINE | TITLE_PAGE PAGEBREAK BLOCK: BLOCKHEADER | BLOCKHEADER INDEX_TAB | BLOCK CHAR_ENTRY | BLOCK SUBHEADER | BLOCK NOTICE_LINE | BLOCK EMPTY_LINE | BLOCK IGNORED_LINE | BLOCK SIDEBAR_LINE | BLOCK PAGEBREAK CHAR_ENTRY: NAME_LINE | RESERVED_LINE | CHAR_ENTRY ALIAS_LINE | CHAR_ENTRY FORMALALIAS_LINE | CHAR_ENTRY COMMENT_LINE | CHAR_ENTRY CROSS_REF | CHAR_ENTRY DECOMPOSITION | CHAR_ENTRY COMPAT_MAPPING | CHAR_ENTRY IGNORED_LINE | CHAR_ENTRY EMPTY_LINE | CHAR_ENTRY NOTICE_LINE
In other words:
Neither TITLE nor SUBTITLE may occur after the first BLOCKHEADER.
Only TITLE, SUBTITLE, SUBHEADER, PAGEBREAK, COMMENT_LINE, NOTICE_LINE, EMPTY_LINE and IGNORED_LINE may occur before the first BLOCKHEADER.
Directly following either a NAME_LINE or a RESERVED_LINE an uninterrupted sequence of the following lines may occur (in any order and repeated as often as needed): ALIAS_LINE, CROSS_REF, DECOMPOSITION, COMPAT_MAPPING, FORMALALIAS_LINE, NOTICE_LINE, EMPTY_LINE and IGNORED_LINE.
Except for EMPTY_LINE, NOTICE_LINE, SIDEBAR_LINE and IGNORED_LINE, none of these lines may occur in any other place.
A PAGEBREAK may appear anywhere, except the middle of a CHARACTER_ENTRY. A PAGEBREAK before the file title lines may not be supported. INDEX_TABs may appear after any block header.
This section provides the details of the syntax for the individual elements.
ELEMENT SYNTAX // How rendered NAME_LINE: CHAR <tab> NAME LF // The CHAR and the corresponding image are echoed, // followed by the name as given in NAME CHAR TAB "<" LCNAME ">" LF // Control and non-characters use this form of // lower case, bracketed pseudo character name CHAR TAB NAME SP COMMENT LF // Names may have a comment, which is stripped off // unless the file is parsed for an ISO style list RESERVED_LINE: CHAR TAB <reserved> // The CHAR is echoed followed by an icon for the // reserved character and a fixed string e.g. <reserved> COMMENT_LINE: <tab> "*" SP EXPAND_LINE // * is replaced by BULLET, output line as comment <tab> EXPAND_LINE // Output line as comment ALIAS_LINE: <tab> "=" SP LINE // Replace = by itself, output line as alias FORMALALIAS_LINE: <tab> "%" SP LINE // Replace % by U+203B, output line as formal alias CROSS_REF: <tab> "X" SP CHAR SP LCNAME // X is replaced by a right arrow <tab> "X" SP "(" LCNAME SP "-" SP CHAR ")" // X is replaced by a right arrow, // the "(", "-", ")" are removed, and the // order of CHAR and LCNAME is reversed; // i.e. both inputs result in the same output FILE_COMMENT: ";" LINE EMPTY_LINE: LF // Empty and ignored lines as well as // file comments are ignored SIDEBAR_LINE: ";;" LINE // Skip ';;' characters, output line // as marginal note IGNORED_LINE: <tab> ";" EXPAND_LINE // Skip ':' character, ignore text DECOMPOSITION: <tab> ":" SP EXPAND_LINE // Replace ':' by EQUIV, expand line into // decomposition COMPAT_MAPPING: <tab> "#" SP EXPAND_LINE COMPAT_MAPPING: <tab> "#" SP "<" LCTAG ">" SP EXPAND_LINE // Replace '#' by APPROX, output line as mapping; // check the <tag> for balanced < > NOTICE_LINE: "@+" <tab> LINE // Skip '@+', output text as notice "@+" TAB * SP LINE // Skip '@', output text as notice // "*" expands to a bullet character // Notices following a character code apply to the // character and are indented. Notices not following // a character code apply to the page/block/column // and are italicized, but not indented SUBTITLE: "@@@+" <tab> LINE // Skip "@@@+", output text as subtitle SUBHEADER: "@" <tab> LINE // Skip '@', output line as text as column header BLOCKHEADER: "@@" <tab> BLOCKSTART <tab> BLOCKNAME <tab> BLOCKEND // Skip "@@", cause a page break and optional // blank page, then output one or more charts // followed by the list of character names. // Use BLOCKSTART and BLOCKEND to define // what characters belong to a block. // Use blockname in page and table headers "@@" <tab> BLOCKSTART <tab> BLOCKNAME COMMENT <tab> BLOCKEND // If a comment is present it replaces the blockname // when an ISO-style namelist is laid out BLOCKNAME: LABEL LABEL SP "(" LABEL ")" // If an alternate label is present it replaces // the blockname when an ISO-style namelist is // laid out; it is ignored in the Unicode charts BLOCKSTART: CHAR // First character position in block BLOCKEND: CHAR // Last character position in block PAGE_BREAK: "@@" // Insert a (column) break INDEX_TAB: "@@+" // Start a new index tab at latest BLOCKSTART TITLE: "@@@" <tab> LINE // Skip "@@@", output line as text // Title is used in page headers EXPAND_LINE: {CHAR | STRING}+ LF // All instances of CHAR *) are replaced by // CHAR NBSP x NBSP where x is the single Unicode // character corresponding to CHAR. // If character is combining, it is replaced with // CHAR NBSP <circ> x NBSP where <circ> is the // dotted circleNotes:
The following are the primitives and terminals for the NamesList syntax.
LINE: STRING LF COMMENT: "(" LABEL ")" "(" LABEL ")" "*" NAME: <sequence of uppercase ASCII letters, digits, space and hyphen> LCNAME: <sequence of lowercase ASCII letters, digits space and hyphen> LCNAME "-" CHAR LCTAG: <sequence of lowercase ASCII letters> STRING: <sequence of Latin-1 characters, except space and controls> LABEL: <sequence of Latin-1 characters, except controls, "(" or ")"> CHAR: X X X X | X X X X X | X X X X X X X: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"|"A"|"B"|"C"|"D"|"E"|"F" <tab>: <sequence of one or more ASCII tab characters 0x09> SP: <ASCII 20> LF: <any sequence of ASCII 0A and 0D>
Notes:
Version 5.1.0
Version 5.0.0
Version 4.0.0
Version 3.2.0
Version 3.1.0 (2)
The Unicode Character Database is provided as is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been purchased on magnetic or optical media from Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.
This disclaimer is applicable for all other data files accompanying the Unicode Character Database, some of which have been compiled by the Unicode Consortium, and some of which have been supplied by other sources.
Recipient is granted the right to make copies in any form for internal distribution and to freely use the information supplied in the creation of products supporting the UnicodeTM Standard. The files in the Unicode Character Database can be redistributed to third parties or other organizations (whether for profit or not) as long as this notice and the disclaimer notice are retained. Information can be extracted from these files and used in documentation or programs, as long as there is an accompanying notice indicating the source.