|
| <prev next> |
Unicode normalization on MacOS/OSX (internationalization and interoperabili: msg#00000network.gnutella.limewire.core.devel
To Adam Fisk (LimeWire), To other LimeWire GUI/Core developers, For information, to the GDF subscribers too, This email is quite long, sorry, but this is a complex issue, which may require immediate changes in LimeWire (or other servents as well), first because of a bug in Java for MacOS (and in MacOS itself for servents written in C/C++), but also because it raises a bug related to immediate internationalization issues and already existing interoperability problems. It will also discuss some future evolutions in some critical part of the search algorithm used on Gnutella. It you have followed the discussions related to the problem found in MacOS/OSX when HFS+ filenames are exposed to the application as decomposed Unicode strings (in fact in FCD form, after I looked into the Apple VFS sources, and not as I thought initially in NFD form), We are concerned by the fact that MANY filename strings exposed by MacOS are NOT even using simple ISO-8859-1 codes for ISO-8859-1 composite characters, and to the fact that some of these decomposed characters are displayed correctly on the Mac, and will neer be displayed correctly on other systems. This raises many interoperability systems in LimeWire, even among MacOS users themselves, because this behavior depends on the filesystem used on each Mac (may be HFS+, encoded and exposed with Unicode FCD, or HFS encoded with a legacy reduced 8-bit Mac character set exposed as Unicode NFC, or UFS/NTFS/FAT32 encoded with Unicode NFC). So when reading filenames from filesystems, we really are affected by these differences, and by the propagation, on the network, of different binary encodings of Unicode strings, despite they are all valid and equivalent according to the NFC or NFD encoding form). MacOS/OSX filesystems all accept to store files with names in NFC, NFD, NFKD, NFKC form, but all of them will be forced to the denormalized form FCD. Sharing FCD strings on the network is supported nowhere else than on MacOS/OSX and only for file storage purpose on HFS+ volumes only. We are also concerned by the fact that Asian input methods or keyboard drivers expose to Java strings in other denormalized forms, that may be interoperable only within these systems, mainly on Unix and Windows. Given that the vast majority of servents on the network expect the NFC form which is canonically equivalent to the normalized NFD form and FCD form, but offers a much broader support than any other encofing forms, it seems that we cannot avoid, even in the case of Western European languages, the need for a canonicalizing NFC converter for filenames read from the filesystem. On Mac OS/OSX, using such a converter will be always safe, given the fact that HFS+ will always force its own "canonicalization" (which does not conform to any Unicode standard, except for a related technical note related to "fast decomposition" technics used to perform some internal processing of strings, in which FCC and FCD forms are discussed). It will also enhance the user experience on Mac, because their Mac system cannot even display correctly filenames using the Apple's FCD encoding form, dur to limitations in font renderers that can't compose characters to find their appropriate glyph in fonts. This is true even for French or German MacOS users! Then we will probably need to perform things in two steps, to integrate a Unicode NFC normalizer in LimeWire, for example the one in the ICU4J package (open-sourced by IBM with a X-Licence and already licenced by Sun in Java), or a reduced version of ICU4J (where we would just import the Unicode normalizer functions, and their NFC/NFD conversion tables). Note that character composition in the current Unicode 3.2 version only concerns less than 2000 pairs of base+combining characters (if performed recursively to handle the case of multiple compositions)and the 11172 compositions of Korean Hangul syllables (performed algorithmically without a table). There are some tricky cases when performing any String normalization in Unicode: the case of multiple diacritics that must be reordered after first converting the input string to NFD, the case of multiple diacritics with the same combining class that must not be reordered and whose composition must be tried in their existing order, and the more complex case of intermediate diacritics with different combining classes that may not be combined with the base character, but do not block combining with a further diacritic, and the associated case where two remaining diacritics may combine each other if not blocked by another intermediate diacritic. In a first attempt I looked in an Apple Technote related to HFS+ internal encoding. Then I found that this technote was really old and only considered the case of Unicode 2.0 rules, forgetting also many combinations that were not documented (notably the Japanese voice marks applies to Hiragana/Katakana syllables to modify their leading consonnant sound, as one of our users reported as a bug in LimeWire). So I looked in another associated document, related to the integration of HFS+ into Apple VFS (Virtual File System) which allows MacOS/OSX to use many Apple and non Apple filesystems using a common Mac API in applications. However, the VFS design forgot that case of interoperability, and even in MacOSX 10.2, such support is still missing, and its conversion routines are missing important Unicode additions. The current Apple FCD algorithm is flawed, because it is not stable and does not work well with all other GUI elements including its Finder, and Apple expects to change its policy regarding the support of Unicode strings in HFS+, to use preferably a NFC or NFD composition form, whose stability is guaranteed across versions of Unicode, with much less interoperability issues. The initial "performance" gains when handling Unicode strings in filenames for operations like ordering and B-tree storage are now irrelevant, as B-tree just needs a coherent and stable form for equivalent strings, and logical ordering (in the GUI interface) is locale dependant and thus does not have to match the binary ordering of strings in HFS+ volumes. This issue is critical for many reasons: the QRP algorithm does not match canonically equivalent filenames across different platforms (for example a French, German or Spanish Windows user cannot exchange files that contain French, German or Spanish accents with a French, German or Spanish Macintosh user); the GUI font renderer appropriate for the locales of users cannot (most often) display correctly strings that are not normalized to their normalized and canonical NFC form. So the idea would be to force the NFC composition of strings both in the GUI, and in all strings used in the protocol. This issue is *independant* of the fact that we use UTF-8, ISO-8859-1 or other legacy encodings in Gnutella messages. It is also *indendant* of the fact that other servents can or cannot handle messages using Unicode: for example if you look at the FCD encoding form {e; acute} used on MacOS to store a simple "é" (U+00C9), if we send it with ISO-8859-1 we will send the {e} (U+0065), but not the {acute} (U+0301) which is not convertible in ISO-8859-1, or we will send the wrong character such as a {SOH control character} (U+0001). This will potentially break some messages, and will definitely not match in QRP or with other servents that expect to see the single "é" character (U+00C9). So I propose this migration scheme: 1) First integrate and test a NFC/NFD converter in LimeWire. Its compliance with Unicode can be tested using the test file provided on the Unicode.org web site. I have started to perform this job, looking at possible issues, because the ICU4J module will not be easy to adapt to Java 1.1.8 (on MacOS with MRJ 2.5), and because a full integration of ICU4J will import many classes that LimeWire currently don't need (notably those related to NFKD/NFKC composition, charset converters, and String parsing utilities) or that already have a partial implementation in Java 1.1 (extended in 1.4, and that may be extended in the future to include some features found in ICU4J, already licenced by Sun and Apple in Java). 2) Then, on MacOS/MacOSX *only*, force all strings returned by File.getName() to their NFC form. For example: String name = file.getName(); would be followed by: if (CommonUtils.isAnyMac()) name = UnicodeString.NFC(name): This change would be performed throughout the code. An alternative approach would be to add a method in CommonUtils, and replace instead the above first line by: String name = CommonUtils.getFileName(file); where the new method would call File.getName() and use the UnicodeString.NFC converter. This will solve most of the issues related to the MacOS behavior (that neither MRJ 2.5 for Mac OS 8/9, nor Java2 for Mac OSX correct for now). But there will still be interoperability issues with other systems. 3) Change the way we store and display the filenames in the library, so that the physical filename reported by the filesystem can be distinct from the NFC form we use to display and share them. Or ensure that possible conflicts (mainly on Unix where the string normalization is not performed accurately) will be handled gracefully to avoid the case of distinct files with the same canonically equivalent filenames. In that case, all strings received from the network would first need to be canonicalized to their NFC form, as well as all strings entered in the Search form or when renaming a file by user input in the Library. 4) Discussing with the GDF of the way to specify that QRP tables will be created by hashing Unicode strings in a normalized form. Here we have 4 choices, with distinct semantics face to search operations: - a.1) The NFC form is simpler to implement, but it does not match the often desirable feature of users that would like to find "cafe" "café" or "CAFE" or "CAFÉ" when searching for either keywords. Note that supporting the NFC form already implies supporting the NFD form. - a.2) The NFD form (with decomposed accents) can be first used to detect and remove diacritics (i.e. all Unicode characters that have a non-zero combining class in Unicode, i.e. non-starter characters), before converting keywords to lowercase. Supporting it is good only in the case where all combining chars are removed from the decomposed string (in that case, the filtered NFD string becomes also a simpler NFC string). - b.1) The NFKC form has some merits (as it creates compatibility equivalences for minor distinctions such as A-ring and Angström, or full-width/half-width variants of Japanese Hiragana/Katakana or Korean Jamos characters). Note that supporting the NFKC form already implies supporting the NFKD form. - b.2) The NFKD form is probably the best form as it combines the advantages of NFD and NFKC for search operation. Supporting it is good only in the case where all combining chars are removed from the decomposed string (in that case, the filtered NFKD string becomes also a simpler NFD string and a simpler NFC string which is also a NFKC string). I would advocate for solution b.2, but this requires a second, larger, conversion table for NFKD (needed also for solution b.1), than the conversion table for NFD (needed also for NFC). However, if we want to mask case differences, then the case-folding operation needed in all 4 forms would require a large table too. For QRP and search string matching, if case folding is preferable, then combining the case folding conversion table with the NFKD table will not create a much larger table. Fast conversion is possible using tables compacted with a "Trie" (with Unicode 3.2, such a table just requires about useful 8000 entries out of the 1.1 million of possible Unicode characters and sequences, compacted in a Trie with a 10000 ints table). Of course any implementation of a Trie table will not be computed by hand. This must be generated either in a spreadsheet or a program that parses the "Unicode Character Database" text file, that contains all the decompositions and their canonical/compatible status, the "Unicode Combining Character" class text file (needed for correct canonical reordering of diacritics in decomposed strings), and the "Unicode Combining Exclusion" text file (described in UTR #15, and needed for stability of normalized strings across versions of Unicode). I have already computed such a table, and compared it with the ICU4J implementation (that for now only complies to Unicode 3.1 which already integrates post-publication corrections, but does not handle characters added in Unicode 3.2, and whose support by application would require using an updated and compliant implementation in ICU4J). However I note that ICU4J was designed to use the same tables that will better perform in C/C++ (they are imported from ICU4C), than in 100% pure Java (whose construction time is significant). It also has too many classes for our purposes. I think that we should first concentrate on implementing NFD/NFKC and, later, NKFD/NFKC for search enhancements with an updated QRP table format exchange where the UltraPeer and the Leaf Node can communicate on which format they best support and can agree upon, related to their common hashing algorithm (truncation compatible with ISO-8859-1, Unicode NFD packed to NFC, Unicode NFKD packed to NFKC)... It is also notable that a case-folding algorithm is to be defined more formally for QRP, as this is a more complex issue, that other servents would not like to discuss or implement for now; but they will change of mind in the future, as support for normalization forms and case-folding will be integrated in all OS'es and most C and Java libraries, as complete support for Unicode 3.2 is already required by JISX in Japan or GB in Taiwan, and is mandatory for ALL systems sold in China to conform to the GB18030 standard that has superceded the previous GBK and older GB2312 standards. Many other national standardization entities will also require it in operating systems (notably in Europe, where support for all official languages of the European Union is needed in most applications, including languages with non Latin scripts such as Greek, or future members using a Cyrillic alphabet, or more rare Latin letters such as Turkish, Romanian, or Maltese). The consequence of these requirements, is that normalization forms of Unicode will soon be available everywhere and there will be no good reason to not support them. -- Philippe. To unsubscribe from this group, send an email to: the_gdf-unsubscribe@xxxxxxxxxxx Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| News | Mail Home | sitemap | FAQ | advertise |