Tuesday, 5 June 2007

Unicode and ISO/IEC 10646

The thing about blogging is that you never quite know who is reading your blog, and you certainly don't know whether any readers you do have find what you write of any great interest or not (other than the history and rules of long s, which are my most googled posts), so when a few days ago I got an email suggesting that I write a post about current Unicode developments (my first and only blog request) I was very pleased. As I'm going to be away for two or three months, with only intermittent internet access, I won't be doing much blogging over the summer, so I think now would be a good time to take a break from the detailed script stuff that I have been engrossed in recently, and end with a six-part series on Unicode current affairs.

In order to understand how the Unicode character repertoire grows it is necessary to have at least a basic understanding of the relationship between the Unicode Standard and the corresponding international standard ISO/IEC 10646 and the work of the respective committees. However, very few people who are not actively involved in Unicode or 10646 ("ten-six-four-six") have any real appreciation of the complex relationship between these standards, and so I thought that it would be useful to start off by discussing the process by which characters get added to Unicode and 10646.



ISO/IEC 10646

ISO/IEC 10646 "Information technology -- Universal Multiple-Octet Coded Character Set (UCS)" is an international standard that defines a character set that is exactly equivalent to the Unicode repertoire. At the start of their lives Unicode and 10646 were separate and incompatible attempts to solve the same problem, but in 1991, after much blood, sweat and tears, an agreement was reached to merge the two standards (see Chronology of Unicode Version 1.0 for a brief overview of the years leading up to the Unicode/10646 merger), and since then Unicode and 10646 have co-existed in a symbiotic relationship.

Until recently the 10646 standard was only available if you bought it from ISO, but since September last year it has been freely available for download from the ISO website :

  • ISO/IEC 10646:2003 [Version française] (82MB) [no longer available]
  • Amendment 1 (2005) [Version française] [no longer available]
  • Amendment 2 (2006) [Version française] [no longer available]
  • Amendment 3 (2008) [no longer available]
  • Amendment 4 (2008) [no longer available]
  • Amendment 5 (2008) [no longer available]
  • Amendment 6 (2009) [no longer available]
  • Amendment 7 (2010) [no longer available]
  • ISO/IEC 10646:2011 (75MB) [electronic inserts] [no longer available]
  • ISO/IEC 10646:2012 (128MB) [electronic inserts]
  • Amendment 1 (2013) [electronic inserts]

ISO/IEC 1064 is referenced as a particular published edition, with amendments as appropriate. 10646 was originally published in two parts: Part 1 covering the BMP (characters in the range U+0000..U+FFFF) was published as 10646-1 in 1993, with a second edition in 2000; whereas Part 2 covering the supplementary planes (characters in the range U+10000..U+10FFFF) was published as 10646-2 in 2001. 10646-1 and 10646-2 were superceded by the first combined edition of 10646, published in 2003. A second combined edition (incorporating ISO/IEC 10646:2003 Amds. 1-8) was published in 2011, and a third edition that fixed a defect with CJK code charts in the 2nd edition (no multi-column charts for CJK-B because of font problems) was published in 2012.

  • ISO/IEC 10646-1:1993 [1st edition]
    • Amd. 1 (1996) : Transformation Format for 16 planes of group 00 (UTF-16)
    • Amd. 2 (1996) : UCS Transformation Format 8 (UTF-8)
    • Amd. 3 (1996) : Code positions for control characters
    • Amd. 4 (1996) : Removal of annex G (UTF-1)
    • Amd. 5 (1998) : Hangul syllables
    • Amd. 6 (1997) : Tibetan
    • Amd. 7 (1997) : 33 additional characters
    • Amd. 8 (1997) : New annex on CJK Ideographs
    • Amd. 9 (1997) : Identifiers for characters
    • Amd. 10 (1998) : Ethiopic
    • Amd. 11 (1998) : Unified Canadian Aboriginal Syllabics
    • Amd. 12 (1998) : Cherokee
    • Amd. 13 (1998) : CJK unified ideographs with supplementary sources
    • Amd. 14 (1999) : Yi syllables and Yi radicals
    • Amd. 15 (2000) : Kang Xi radicals and CJK radicals supplement
    • Amd. 16 (1998) : Braille patterns
    • Amd. 17 (1999) : CJK Unified Ideographs Extension A
    • Amd. 18 (1999) : Symbols and other characters
    • Amd. 19 (1998) : Runic
    • Amd. 20 (1998) : Ogham
    • Amd. 21 (1999) : Sinhala
    • Amd. 22 (1999) : Keyboard symbols
    • Amd. 23 (1999) : Bopomofo Extended and other characters
    • Amd. 24 (1999) : Thaana
    • Amd. 25 (1999) : Khmer
    • Amd. 26 (1999) : Myanmar
    • Amd. 27 (1999) : Syriac
    • Amd. 28 (2000) : Ideographic description characters
    • Amd. 29 (1999) : Mongolian
    • Amd. 30 (1999) : Additional Latin and other characters
    • Amd. 31 (1999) : Tibetan extension
  • ISO/IEC 10646-1:2000 (Part 1: Architecture and Basic Multilingual Plane) [2nd edition]
    • Amd. 1 (2002) : Mathematical symbols and other characters
  • ISO/IEC 10646-2:2001 (Part 2: Supplementary Planes)
  • ISO/IEC 10646:2003 [1st edition, 2003-12-15]
    • Amd. 1 (2005) : Glagolitic, Coptic, Georgian and other characters
    • Amd. 2 (2006) : N'Ko, Phags-pa, Phoenician and other characters
    • Amd. 3 (2008) : Lepcha, Ol Chiki, Saurashtra, Vai, and other characters
    • Amd. 4 (2008) : Cham, Game Tiles, and other characters
    • Amd. 5 (2008) : Tai Tham, Tai Viet, Avestan, Egyptian Hieroglyphs, CJK Unified Ideographs Extension C, and other characters
    • Amd. 6 (2009) : Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, and other characters
    • Amd. 7 (2010) : Mandaic, Batak, Brahmi, and other characters
    • Amd. 8 (2011) : Additional symbols, Bamum supplement, CJK Unified Ideographs Extension D, and other characters (NB Amd. 8 was never published as it was incorporated into the 2nd edition)
  • ISO/IEC 10646:2011 [2nd edition, 2011-03-15]
  • ISO/IEC 10646:2012 [3rd edition, 2012-06-01]
    • Amd. 1 (2013) : Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, and other characters
    • Amd. 2 (2014) : Caucasian Albanian, Psalter Pahlavi, Mahajani, Grantha, Modi, Pahawh Hmong, Mende Kikakui, and other characters
  • ISO/IEC 10646:2014 [4th edition, 2014-09-01]
    • Amd. 1 (2014) : Cherokee supplement and other characters
    • Amd. 2 (2015) : Bhaiksuki, Marchen, Tangut, Zanabazar Square, and other characters

As can be seen, the amendments to the first two-part edition (1993) were done piecemeal, with one script per amendment. Since the publication of the first combined edition (2003) an amendment covers a basket of different scripts and character additions which often corresponds to the additions to a particular version of Unicode.

10646 is maintained by Subcommittee 2 (SC2) "Coded character sets" of Joint Technical Committee 1 (JTC1) of ISO (International Organization for Standardization -- contrary to popular belief ISO is not an acronym but a word derived from the Greek ισος "equal") and IEC (International Electrotechnical Commission). The membership of SC2 comprises standards organisations of various countries (e.g. ANSI for USA and BSI for UK). SC2 currently has 29 P-members (participating members who have voting rights), 20 O-members (observing members who do not have voting rights) and a number of Liaison members such as TCA (Taipei Computer Association, representing Taiwan), the Script Encoding Initiative and the Unicode Consortium.

The actual technical work on 10646 is done by SC2's Working Group 2 (WG2), which normally meets for five days twice a year (spring and autumn) at various exotic locations throughout the world (the meetings tend to alternate between North America, Europe and the Far East). WG2 comprises experts representing SC2's member national bodies and liaison members, although not all of the SC2 P-members regularly send experts to WG2 meetings -- currently Canada, China, Ireland, Japan, Republic of Korea, TCA, Unicode, UK and USA are the most active members of WG2. Since 2011, due to a change in the ballot process, WG2 meets once every 9 months.



Unicode

The Unicode Standard (often abbeviated as TUS) is produced by an American-based consortium whose membership mostly comprises large US corporations such as Adobe, Apple, Google, IBM, Microsoft, Sun and Yahoo.

Major versions of the Unicode standard were published in printed book form up to version 5.0, but since version 5.2 the core specification has been published as print-on-demand, with the full specification available in PDF form on-line. The data files, character charts and PDF files for the latest version are all available on-line from the Unicode web site.

Unicode Versions
Version Date Book Corresponding ISO/IEC 10646 Edition
1.0.0October 1991ISBN 0-201-56788-1 (Vol.1) 
1.0.1June 1992ISBN 0-201-60845-6 (Vol.2) 
1.1June 1993 ISO/IEC 10646-1:1993
2.0July 1996ISBN 0-201-48345-9ISO/IEC 10646-1:1993 plus Amds. 5, 6 and 7
2.1May 1998 
ISO/IEC 10646-1:1993 plus Amds. 5, 6 and 7, and two characters from Amd. 18 (Euro Sign and Object Replacement Character)
3.0September 1999ISBN 0-201-61633-5ISO/IEC 10646-1:2000
3.1March 2001 ISO/IEC 10646-1:2000
ISO/IEC 10646-2:2001
3.2March 2002 ISO/IEC 10646-1:2000 plus Amd. 1
ISO/IEC 10646-2:2001
4.0April 2003ISBN 0-321-18578-1ISO/IEC 10646:2003
4.1March 2005 ISO/IEC 10646:2003 plus Amd. 1
5.0July 2006ISBN 0-321-48091-0ISO/IEC 10646:2003 plus Amds. 1 and 2, and four characters from Amd. 3 (Devanagari Letters GGA, JJA, DDDA and BBA)
5.1April 2008 ISO/IEC 10646:2003 plus Amds. 1, 2, 3 and 4
5.2October 2009ISBN 978-1-936213-00-9ISO/IEC 10646:2003 plus Amds. 1, 2, 3, 4, 5 and 6
6.0October 2010ISBN 978-1-936213-01-6ISO/IEC 10646:2011 (equivalent to ISO/IEC 10646:2003 plus Amds. 1 through 8) and one character not yet in ISO/IEC 10646 (Indian Rupee Sign)
6.1January 2012ISBN 978-1-936213-02-3ISO/IEC 10646:2012
6.2September 2012ISBN 978-1-936213-07-8ISO/IEC 10646:2012 and one character not yet in ISO/IEC 10646 (Turkish Lira Sign, included in ISO/IEC 10646:2012 Amd. 1)
6.3September 2013ISBN 978-1-936213-08-5ISO/IEC 10646:2012 and five characters not yet in ISO/IEC 10646 (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate, to be included in ISO/IEC 10646:2012 Amd. 2)
7.0June 2014ISBN 978-1-936213-09-2ISO/IEC 10646:2012 plus Amds. 1 and 2, and one character not yet in ISO/IEC 10646 (Ruble sign, to be included in ISO/IEC 10646:2014)
8.0June 2015ISO/IEC 10646:2014 plus Amd. 1

Versions of Unicode are normally synchronised to a particular edition (plus amendments) of ISO/IEC 10646, although Unicode versions 2.1, 5.0, 6.0, 6.2, 6.3 and 7.0 contained one or a few characters from a 10646 edition or amendment that had not yet been published.

The technical committee that makes decisions about character additions to the Unicode Standard is the Unicode Technical Committee (UTC), which meets for about 4 or 5 days every three months (always somewhere on the West Coast). Thus some lucky members of the UTC who are also participants of WG2 get to spend up to six weeks of every year in Unicode/10646 committee meetings.



The Ballot Process

For a new character or set of characters to be accepted into either Unicode or 10646 it has to be accepted by both Unicode and 10646. Thus proposals for character additions go before both the UTC and WG2 (it does not matter much which of these two committees see the proposal document first). The UTC process is very straight forward: a proposal is reviewed at one of the quarterly UTC meetings, and is accepted, rejected or otherwise dealt with (for example, returned to the submitter for changes or with a request for additional information). If a proposal is accepted by the UTC it is then forwarded to WG2 for review. In theory the UTC and WG2 could be unable to reach a common decision on a particular proposal, and the process would become deadlocked, but this has never yet happened, largely because the members of the two committees are reasonable people, but also because there is considerable overlap in the composition of WG2 and the UTC, so that neither committee will do something that would be unacceptable to the other committee.

A proposal submitted to WG2 will be reviewed at one of its biannual meetings. If it is the first time that a major proposal (e.g. for a new script) has been seen by the committee, it may be simply noted and national bodies requested to review it and provide feedback for the next meeting. However, if the proposal is not controversial and/or has already been accepted by Unicode it may be accepted for inclusion in an amendment straight away. Draft editions and amendments go through a series of ballots, each lasting several months :


Stage Ballot Scope Ballot Length Notes
Committee CD (Committee Draft) or
PDAM (Proposed Draft Amendment)
SC2 3 months May be 2, 3 or 4 months, but is always 3 months in SC2.
Enquiry DIS (Draft International Standard) or
DAM (Draft Amendment)
JTC1 3 months Prior to 2011 there was a 4-month FCD or FPDAM ballot at SC2 level instead of the DIS/DAM ballot at JTC1 level.
Approval FDIS (Final Draft International Standard) or
FDAM (Final Draft Amendment)
JTC1 2 months This ballot may be skipped under certain circumstances.

The first two ballots are technical ballots in which P-members of SC2 or JTC1 may request technical changes to the edition or amendment in the form of ballot comments accompanying a positive or negative vote. Typically a negative vote will be accompanied by proposed changes which if accepted will change the negative vote to a positive vote. Ballot comments are disposed at the next WG2 meeting, and where possible changes are made so that negative votes can be eliminated. Occasionally disagreements cannot be resolved, and an amendment is carried by a majority vote (e.g. ISO/IEC 10646:2003 FPDAM2 was carried despite unchanged no votes by Canada and Germany in relation to N'Ko and Phoenician respectively). It is unheard of (within SC2 at least) for an amendment to be voted down as it would not get to the ballot stage unless there was a consensus in support of the amendment (consensus is the key word with WG2).

New submissions that have been favourably reviewed by WG2 normally go into a CD/PDAM ballot together with a basket of other proposed character additions, but sometimes a submission may get added directly to an (FP)DAM ballot if it is urgent (as was the case recently with U+9FC3, which was added directly to ISO/IEC 10646:2003 FPDAM4). At the CD/PDAM ballot stage the repertoire of the amendment is still very much in flux, and ballot comments may request that characters be added, deleted, moved or otherwise amended. If the PDAM ballot results indicate that there is controversy over a particular set of characters they may be removed entirely or else moved back to the following amendment (as was the case with Phags-pa which was moved back from ISO/IEC 10646:2003 PDAM1 to PDAM2, and CJK-C which was moved back from ISO/IEC 10646:2003 PDAM4 to PDAM5, and most recently Tangut which was moved back from ISO/IEC 10646:2003 PDAM6 to PDAM7 and then removed for addition to a future amendment). And if major changes are made to an amendment after the PDAM ballot, it may even be resubmitted for a second PDAM ballot (as was the case with ISO/IEC 10646:2003 PDAM3 and PDAM6). Technical changes may also be requested in the DIS/DAM ballot comments made by JTC1 P-members (a wider membership than SC2), but as the major problems should already have been sorted out at the first ballot, DIS/DAM ballot comments are generally fewer and of less significance.

Once an amendment has passed the DIS/DAM ballot no further technical changes may be made, and the amendment is submitted for a final FDIS/FDAM ballot by JTC1 members, which is largely a formality as it is a simple Yes/No vote, and no comments requesting changes to the amendment are allowed. When the FDIS/FDAM ballot is approved the amendment will be published within two months, and implementations of 10646 (e.g. GB-18030) can use the new characters. However, it is not until a new major version of Unicode which includes the new characters is officially released that most software and fonts will start (usually very slowly) to support them.

Because it may take a long time to complete the actual publication of a standard or an amendment, the corresponding Unicode version is often released several months earlier than the 10646 publication date. For example, Unicode 4.0 (April 2003) was released a full year before the corresponding ISO/IEC 10646:2003 was published (April 2004), and Unicode 4.1 (31st March 2005) was released seven and a half months before the corresponding Amd. 1 was published (2005-11-15). On the other hand, Unicode 5.0 (14th July 2006) was not released until two weeks after the corresponding Amd. 2 was published (2006-07-01).



What's under Ballot ?

[This section is obviously completely out of date now, but I am not going to update it as what is currently under ballot changes every six months or so]


So now it should make some sense if I say that there are three amendments to ISO/IEC 10646:2003 currently [at the time that I originally wrote this post] going through the ballot process (list of open SC2 ballots) :

  • Amd. 3 (Lepcha, Ol Chiki, Saurashtra, Vai, and other characters) : submitted for FDAM ballot
  • Amd. 4 (Lanna, Cham, Game Tiles, and other characters) : under FPDAM ballot (due 2007-09-11)
  • Amd. 5 (Meitei Mayek, Bamum, Tai Viet, Avestan, Egyptian Hieroglyphs, CJK Unified Ideographs Extension C, and other characters) : under PDAM ballot (due 2007-09-10)

Amd. 3 has now completed its technical ballots, and its character repertoire is stable. Amd. 4 is on its final technical ballot, and there are unlikely to be major changes to its character repertoire. Amd. 5 is on its first technical ballot, and there will almost certainly be changes to its character repertoire.

Because people are still trying to come to terms with Unicode 5.0, a new version of Unicode corresponding to Amd. 3 will not be released, but instead Unicode will synchronise its repertoire with ISO/IEC 10646 on Amd. 4 with Unicode 5.1. That is to say, the character additions for Unicode 5.1 will correspond to Amds.3 and 4 (less the four Devanagari letters for Sindhi from Amd. 3 that are already in Unicode 5.0). In the next post I will look at the contents of Unicode 5.1 in more detail [if you want more up-to-date information please read What's new in Unicode 5.2 ?].



[Last updated : 2014-09-10]


3 comments:

Yan Han said...

Andrew,

I scan your post on unicode v.s. ISO/IEC 10646. This is an excellent post to clear out the confusion between the two standards. The Unicode standard book (published by the Unicode Consortium) is another excellent source to find more info.

Yan

jedi787plus said...

This post is outdated. Amendments 4 & 5 are complete, and Amd 6 has just started.

Andrew West said...

This post is outdated.

Which is the nature of blogging. Any blog post is necessarily a reflection of when it was written.

I actually revise and update my blog posts regularly (see list of recently updated posts at the bottom of the page), but in this case I made a conscious decision not to update the final What's under ballot ? section as it would destroy the connsequential linkage between posts in this linked series of posts.