|
Re: Unicode Normalization on MS-Windows: msg#00368text.unicode.devel
Jane Liu wrote: > Thanks for sharing some backgrounds. Yes, the character "U+FA19" was > originally from the JIS standard, level-3, IBM extensions and NEC > selected extensions and it has been assigned two code points: 0xFB7E > (IBM portion) and 0xEE62 (NEC portion) in the native encoding ... i.e., in Code Page 932. But in EUC-JP (proper), there is *no* encoding for this. In the EUC-JP vendor extensions to interoperate with Code Page 932, there is *one* encoding: 0xFB 0x7E. In the IBM host encoding Code Page 930, there is *one* encoding: (0x0E) 0x5F 0xD5 (0x0F). And in GB 18030, U+FA19 converts to: 0x84 0x30 0x9B 0x39. So even interoperating between systems without normalizing, you have to be concerned about the retention or substitution of this character. Most other systems cannot roundtrip the two encodings of the character in Code Page 932. And if you have a Windows system hooked up to an EUC-JP back end, U+FA19 might or might not survive, depending on the level of that system and the support or non-support for EUC-JP extensions. > > Correct me if I'm wrong, it seems to me, not only for this case, > actually in general, neither Microsoft Windows nor those popular UNIX > systems (AIX, Solaris, HP-UX) currently supply the explicit support > of Unicode normalization at the encoding/converison level. I suspect > this would also apply to all major databases. You have to be cautious here, too. Major databases may or may not normalize Unicode data when their internal storage is in Unicode. This may or may not be a user-definable setting. The choice to normalize (for canonical equivalences) in databases is often made for performance reasons, because optimizing comparisons across table joins for non-normalized data can be very messy and can kill database performance on queries. > The bottom line would > be "WYSIWYG = What You See Is What You Get", Right? Nope. The point of most canonical equivalences is that a typical end-user cannot tell the difference in what they get. Canonical equivalences usually refer to two different sequences for representing the "same" thing (where, in a few cases, as for CJK compatibility characters, the *sequence* may consist only of a single character). You have picked out a particularly problematical subset of the CJK compatibility characters, however. U+FA19 consists of a user-visible and distinguishable variant of U+795E. From the point of view of Han unification, the two forms are simply variants of the *same* unified character. Only a requirement for roundtrip convertibility with Code Page 932 resulted in separate encoding for U+FA19. But in this case, as for others of the IBM 32, the separate encoding of the variant was to make it visible and usable by end users, distinct from U+795E. Some of the discussion underway about finding ways to declare tailorings of normalization are precisely to enable retention of these variant distinctions for some CJK compatibility characters. > > If that's true, can we conclude that in order to maintain the > transperancy and round-trip safty between application and OS, the > application should not use normalization? Yes, but... The problem often is that there is no clear boundary to "the application". Applications these days are often distributed, and parts may operate on different platforms. It may not be easy to define what parts are or are not using normalization of Unicode data. The conformance requirement for the Unicode Standard is that one process cannot *demand* that another process maintain a distinction between canonically equivalent sequences. So even if your application doesn't normalize, if you interact with any other application (and the OS platforms and databases also constitute complex applications, in their own ways), you cannot guarantee that they will not normalize. The defensive way to program is to write one's own application in such a way that it does not maintain distinctions between canonically equivalent sequences, or when it does, it does not break if interoperating with some other process that does not maintain such distinctions. --Ken > > Alos, it would be nice to give the flexibility that allowing the > application user to choose On/Off of the normalization process, > however, this may sounds useless since the majority of those systems > don't even care. > > Jane >
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Title Case (Was: [OT] multilingual support in MS products, John Cowan |
|---|---|
| Next by Date: | RE: Private Use Area, Peter_Constable |
| Previous by Thread: | Re: Unicode Normalization on MS-Windows, David Starner |
| Next by Thread: | IUC23 paper available..., Addison Phillips [wM] |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |