The thing about blogging is that you never quite know who is reading your blog, and you certainly don't know whether any readers you do have find what you write of any great interest or not (other than the history and rules of long s, which are my most googled posts), so when a few days ago I got an email suggesting that I write a post about current Unicode developments (my first and only blog request) I was very pleased. As I'm going to be away for two or three months, with only intermittent internet access, I won't be doing much blogging over the summer, so I think now would be a good time to take a break from the detailed script stuff that I have been engrossed in recently, and end with a six-part series on Unicode current affairs.
In order to understand how the Unicode character repertoire grows it is necessary to have at least a basic understanding of the relationship between the Unicode Standard and the corresponding international standard ISO/IEC 10646 and the work of the respective committees. However, very few people who are not actively involved in Unicode or 10646 ("ten-six-four-six") have any real appreciation of the complex relationship between these standards, and so I thought that it would be useful to start off by discussing the process by which characters get added to Unicode and 10646.
ISO/IEC 10646 "Information technology -- Universal Multiple-Octet Coded Character Set (UCS)" is an international standard that defines a character set that is exactly equivalent to the Unicode repertoire. At the start of their lives Unicode and 10646 were separate and incompatible attempts to solve the same problem, but in 1991, after much blood, sweat and tears, an agreement was reached to merge the two standards (see Chronology of Unicode Version 1.0 for a brief overview of the years leading up to the Unicode/10646 merger), and since then Unicode and 10646 have co-existed in a symbiotic relationship.
- ISO/IEC 10646:2003 [Version française] (82MB) [no longer available]
- Amendment 1 (2005) [Version française] [no longer available]
- Amendment 2 (2006) [Version française] [no longer available]
- Amendment 3 (2008) [no longer available]
- Amendment 4 (2008) [no longer available]
- Amendment 5 (2008) [no longer available]
- Amendment 6 (2009) [no longer available]
- Amendment 7 (2010) [no longer available]
- ISO/IEC 10646:2011 (75MB) [electronic inserts] [no longer available]
- ISO/IEC 10646:2012 (128MB) [electronic inserts]
ISO/IEC 1064 is referenced as a particular published edition, with amendments as appropriate. 10646 was originally published in two parts: Part 1 covering the BMP (characters in the range U+0000..U+FFFF) was published as 10646-1 in 1993, with a second edition in 2000; whereas Part 2 covering the supplementary planes (characters in the range U+10000..U+10FFFF) was published as 10646-2 in 2001. 10646-1 and 10646-2 were superceded by the first combined edition of 10646, published in 2003. A second combined edition (incorporating ISO/IEC 10646:2003 Amds. 1-8) was published in 2011, and a third edition that fixed a defect with CJK code charts in the 2nd edition (no multi-column charts for CJK-B because of font problems) was published in 2012.
- ISO/IEC 10646-1:1993 [1st edition]
- Amd.1 (1996) : Transformation Format for 16 planes of group 00 (UTF-16)
- Amd.2 (1996) : UCS Transformation Format 8 (UTF-8)
- Amd.3 (1996) : Code positions for control characters
- Amd.4 (1996) : Removal of annex G (UTF-1)
- Amd.5 (1998) : Hangul syllables
- Amd.6 (1997) : Tibetan
- Amd.7 (1997) : 33 additional characters
- Amd.8 (1997) : New annex on CJK Ideographs
- Amd.9 (1997) : Identifiers for characters
- Amd.10 (1998) : Ethiopic
- Amd.11 (1998) : Unified Canadian Aboriginal Syllabics
- Amd.12 (1998) : Cherokee
- Amd.13 (1998) : CJK unified ideographs with supplementary sources
- Amd.14 (1999) : Yi syllables and Yi radicals
- Amd.15 (2000) : Kang Xi radicals and CJK radicals supplement
- Amd.16 (1998) : Braille patterns
- Amd.17 (1999) : CJK Unified Ideographs Extension A
- Amd.18 (1999) : Symbols and other characters
- Amd.19 (1998) : Runic
- Amd.20 (1998) : Ogham
- Amd.21 (1999) : Sinhala
- Amd.22 (1999) : Keyboard symbols
- Amd.23 (1999) : Bopomofo Extended and other characters
- Amd.24 (1999) : Thaana
- Amd.25 (1999) : Khmer
- Amd.26 (1999) : Myanmar
- Amd.27 (1999) : Syriac
- Amd.28 (2000) : Ideographic description characters
- Amd.29 (1999) : Mongolian
- Amd.30 (1999) : Additional Latin and other characters
- Amd.31 (1999) : Tibetan extension
- ISO/IEC 10646-1:2000 (Part 1: Architecture and Basic Multilingual Plane) [2nd edition]
- Amd.1 (2002) : Mathematical symbols and other characters
- ISO/IEC 10646-2:2001 (Part 2: Supplementary Planes)
- ISO/IEC 10646:2003 [1st edition, 2003-12-15]
- Amd.1 (2005) : Glagolitic, Coptic, Georgian and other characters
- Amd.2 (2006) : N'Ko, Phags-pa, Phoenician and other characters
- Amd.3 (2008) : Lepcha, Ol Chiki, Saurashtra, Vai, and other characters
- Amd.4 (2008) : Cham, Game Tiles, and other characters
- Amd.5 (2008) : Tai Tham, Tai Viet, Avestan, Egyptian Hieroglyphs, CJK Unified Ideographs Extension C, and other characters
- Amd.6 (2009) : Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, and other characters
- Amd.7 (2010) : Mandaic, Batak, Brahmi, and other characters
- Amd.8 (2011) : Additional symbols, Bamum supplement, CJK Unified Ideographs Extension D, and other characters (NB Amd.8 was never published as it was incorporated into the 2nd edition)
- ISO/IEC 10646:2011 [2nd edition, 2011-03-15]
- ISO/IEC 10646:2012 [3rd edition, 2012-06-01]
- Amd.1 (2013) : Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, and other characters
- Amd.2 (2014) : Caucasian Albanian, Psalter Pahlavi, Old Hungarian, Mahajani, Grantha, Modi, Pahawh Hmong, Mende, and other characters
As can be seen, the amendments to the first two-part edition (1993) were done piecemeal, with one script per amendment. Since the publication of the first combined edition (2003) an amendment covers a basket of different scripts and character additions which often corresponds to the additions to a particular version of Unicode.
10646 is maintained by Subcommittee 2 (SC2) "Coded character sets" of Joint Technical Committee 1 (JTC1) of ISO (International Organization for Standardization -- contrary to popular belief ISO is not an acronym but a word derived from the Greek ισος "equal") and IEC (International Electrotechnical Commission). The membership of SC2 comprises standards organisations of various countries (e.g. ANSI for USA and BSI for UK). SC2 currently has 29 P-members (participating members who have voting rights), 20 O-members (observing members who do not have voting rights) and a number of Liaison members such as TCA (Taipei Computer Association, representing Taiwan), the Script Encoding Initiative and the Unicode Consortium.
The actual technical work on 10646 is done by SC2's Working Group 2 (WG2), which normally meets for five days twice a year (spring and autumn) at various exotic locations throughout the world (the meetings tend to alternate between North America, Europe and the Far East). WG2 comprises experts representing SC2's member national bodies and liaison members, although not all of the SC2 P-members regularly send experts to WG2 meetings -- currently Canada, China, Ireland, Japan, Republic of Korea, TCA, Unicode, UK and USA are the most active members of WG2. Since 2011, due to a change in the ballot process, WG2 meets once every 9 months.
The Unicode Standard (often abbeviated as TUS) is produced by an American-based consortium whose membership mostly comprises large US corporations such as Adobe, Apple, Google, IBM, Microsoft, Sun and Yahoo.
Major versions of the Unicode standard were published in printed book form up to version 5.0, but since version 5.2 the core specification has been published as print-on-demand, with the full specification available in PDF form on-line. The data files, character charts and PDF files for the latest version are all available on-line from the Unicode web site.
|Version||Date||Book||Corresponding ISO/IEC 10646 Edition|
|1.0.0||October 1991||ISBN 0-201-56788-1 (Vol.1)|
|1.0.1||June 1992||ISBN 0-201-60845-6 (Vol.2)|
|1.1||June 1993||ISO/IEC 10646-1:1993|
|2.0||July 1996||ISBN 0-201-48345-9||ISO/IEC 10646-1:1993 plus Amds. 5, 6 and 7|
ISO/IEC 10646-1:1993 plus Amds. 5, 6 and 7, and two characters from Amd.18 (Euro Sign and Object Replacement Character)
|3.0||September 1999||ISBN 0-201-61633-5||ISO/IEC 10646-1:2000|
|3.1||March 2001||ISO/IEC 10646-1:2000|
|3.2||March 2002||ISO/IEC 10646-1:2000 plus Amd.1|
|4.0||April 2003||ISBN 0-321-18578-1||ISO/IEC 10646:2003|
|4.1||March 2005||ISO/IEC 10646:2003 plus Amd.1|
|5.0||July 2006||ISBN 0-321-48091-0||ISO/IEC 10646:2003 plus Amds. 1 and 2, and four characters from Amd.3 (Devanagari Letters GGA, JJA, DDDA and BBA)|
|5.1||April 2008||ISO/IEC 10646:2003 plus Amds. 1, 2, 3 and 4|
|5.2||October 2009||ISBN 978-1-936213-00-9||ISO/IEC 10646:2003 plus Amds. 1, 2, 3, 4, 5 and 6|
|6.0||October 2010||ISBN 978-1-936213-01-6||ISO/IEC 10646:2011 (equivalent to ISO/IEC 10646:2003 plus Amds. 1 through 8) and one character not yet in ISO/IEC 10646 (Indian Rupee Sign)|
|6.1||January 2012||ISBN 978-1-936213-02-3||ISO/IEC 10646:2012|
|6.2||September 2012||ISBN 978-1-936213-07-8||ISO/IEC 10646:2012 and one character not yet in ISO/IEC 10646 (Turkish Lira Sign, included in ISO/IEC 10646:2012 Amd.1)|
|6.3||Spring 2013||ISBN 978-1-936213-08-5||ISO/IEC 10646:2012 and five characters not yet in ISO/IEC 10646 (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate, to be included in ISO/IEC 10646:2012 Amd.2)|
Versions of Unicode are normally synchronised to a particular edition (plus amendments) of ISO/IEC 10646, although Unicode 2.1, 5.0 and 6.0 contained a few characters from a 10646 amendment that had not yet been published.
The technical committee that makes decisions about character additions to the Unicode Standard is the Unicode Technical Committee (UTC), which meets for about 4 or 5 days every three months (always somewhere on the West Coast). Thus some lucky members of the UTC who are also participants of WG2 get to spend up to six weeks of every year in Unicode/10646 committee meetings.
The Ballot Process
For a new character or set of characters to be accepted into either Unicode or 10646 it has to be accepted by both Unicode and 10646. Thus proposals for character additions go before both the UTC and WG2 (it does not matter much which of these two committees see the proposal document first). The UTC process is very straight forward: a proposal is reviewed at one of the quarterly UTC meetings, and is accepted, rejected or otherwise dealt with (for example, returned to the submitter for changes or with a request for additional information). If a proposal is accepted by the UTC it is then forwarded to WG2 for review. In theory the UTC and WG2 could be unable to reach a common decision on a particular proposal, and the process would become deadlocked, but this has never yet happened, largely because the members of the two committees are reasonable people, but also because there is considerable overlap in the composition of WG2 and the UTC, so that neither committee will do something that would be unacceptable to the other committee.
A proposal submitted to WG2 will be reviewed at one of its biannual meetings. If it is the first time that a major proposal (e.g. for a new script) has been seen by the committee, it may be simply noted and national bodies requested to review it and provide feedback for the next meeting. However, if the proposal is not controversial and/or has already been accepted by Unicode it may be accepted for inclusion in an amendment straight away. Draft editions and amendments go through a series of ballots, each lasting several months :
- CD (Committee Draft) or PDAM (Proposed Draft Amendment) : 3 month ballot at SC2 level
- DIS (Draft International Standard) or DAM (Draft Amendment) : 5 month ballot at JTC1 level (prior to 2011 this was a 4 month FCD or FPDAM ballot at the SC2 level)
- FDIS (Final Draft International Standard) or FDAM (Final Draft Amendment) : 2 month ballot at JTC1 level
The first ballot is a technical ballot in which P-members of SC2 may request technical changes to the edition or amendment in the form of ballot comments accompanying a positive or negative vote. Typically a negative vote will be accompanied by proposed changes which if accepted will change the negative vote to a positive vote. Ballot comments are disposed at the next WG2 meeting, and where possible changes are made so that negative votes can be eliminated. Occasionally disagreements cannot be resolved, and an amendment is carried by a majority vote (e.g. ISO/IEC 10646:2003 FPDAM2 was carried despite unchanged no votes by Canada and Germany in relation to N'Ko and Phoenician respectively). It is unheard of (within SC2 at least) for an amendment to be voted down as it would not get to the ballot stage unless there was a consensus in support of the amendment (consensus is the key word with WG2).
New submissions that have been favourably reviewed by WG2 normally go into a CD/PDAM ballot together with a basket of other proposed character additions, but sometimes a submission may get added directly to an (FP)DAM ballot if it is urgent (as was the case recently with U+9FC3, which was added directly to ISO/IEC 10646:2003 FPDAM4). At the CD/PDAM ballot stage the repertoire of the amendment is still very much in flux, and ballot comments may request that characters be added, deleted, moved or otherwise amended. If the PDAM ballot results indicate that there is controversy over a particular set of characters they may be removed entirely or else moved back to the following amendment (as was the case with Phags-pa which was moved back from ISO/IEC 10646:2003 PDAM1 to PDAM2, and CJK-C which was moved back from ISO/IEC 10646:2003 PDAM4 to PDAM5, and most recently Tangut which was moved back from ISO/IEC 10646:2003 PDAM6 to PDAM7 and then removed for addition to a future amendment). And if major changes are made to an amendment after the PDAM ballot, it may even be resubmitted for a second PDAM ballot (as was the case with ISO/IEC 10646:2003 PDAM3 and PDAM6). Technical changes may also be requested in the DIS/DAM ballot comments made by JTC1 P-members (a wider membership than SC2), but as the major problems should already have been sorted out at the first ballot, DIS/DAM ballot comments are generally fewer and of less significance.
Once an amendment has passed the DIS/DAM ballot no further technical changes may be made, and the amendment is submitted for a final FDIS/FDAM ballot by JTC1 members, which is largely a formality. When the FDIS/FDAM ballot is approved the amendment will be published, and implementations of 10646 (e.g. GB-18030) can use the new characters. However, it is not until a new major version of Unicode which includes the new characters is officially released that most software and fonts will start (usually very slowly) to support them.
Because it may take a long time to complete the actual publication of a standard or an amendment, the corresponding Unicode version is often released several months earlier than the 10646 publication date. For example, Unicode 4.0 (April 2003) was released a full year before the corresponding ISO/IEC 10646:2003 was published (April 2004), and Unicode 4.1 (31st March 2005) was released seven and a half months before the corresponding Amd.1 was published (2005-11-15). On the other hand, Unicode 5.0 (14th July 2006) was not released until two weeks after the corresponding Amd.2 was published (2006-07-01).
What's under Ballot ?
[This section is obviously completely out of date now, but I am not going to update it as what is currently under ballot changes every six months or so]
So now it should make some sense if I say that there are three amendments to ISO/IEC 10646:2003 currently [at the time that I originally wrote this post] going through the ballot process (list of open SC2 ballots) :
- Amd.3 (Lepcha, Ol Chiki, Saurashtra, Vai, and other characters) : submitted for FDAM ballot
- Amd.4 (Lanna, Cham, Game Tiles, and other characters) : under FPDAM ballot (due 2007-09-11)
- Amd.5 (Meitei Mayek, Bamum, Tai Viet, Avestan, Egyptian Hieroglyphs, CJK Unified Ideographs Extension C, and other characters) : under PDAM ballot (due 2007-09-10)
Amd.3 has now completed its technical ballots, and its character repertoire is stable. Amd.4 is on its final technical ballot, and there are unlikely to be major changes to its character repertoire. Amd.5 is on its first technical ballot, and there will almost certainly be changes to its character repertoire.
Because people are still trying to come to terms with Unicode 5.0, a new version of Unicode corresponding to Amd.3 will not be released, but instead Unicode will synchronise its repertoire with ISO/IEC 10646 on Amd.4 with Unicode 5.1. That is to say, the character additions for Unicode 5.1 will correspond to Amds.3 and 4 (less the four Devanagari letters for Sindhi from Amd.3 that are already in Unicode 5.0). In the next post I will look at the contents of Unicode 5.1 in more detail [if you want more up-to-date information please read What's new in Unicode 5.2 ?].
[Last updated : 2013-04-18]