osdir.com
mailing list archive

Subject: Re: NFS4 requires UTF-8 [NFC versus NFD] - msg#00177

List: internationalization.linux

Date: Prev Next Index Thread: Prev Next Index
"McDonald, Ira" wrote on 2002-02-23 21:45 UTC:
> told us to use NFKC (which folds compatibility equivalents into
> their base characters).

Well, NFKC is a subset of NFC, and it is certainly a more "proper" for
of Unicode. If given as advice to people who enter new Unicode strings,
sticking to NFKC is certainly a good idea, as it eliminates the use of a
number of compatibility characters such as the much hated ANGSTROEM
SIGN.

NFKC is just not suitable for applications that have to deal with
already existing text (e.g., a file system) and have to take and
preserve whatever information they are provided. NFKC also takes away a
number of characters that are perfectly useable for terminal emulator
applications, e.g. the subscript/superscript digits, but which should
not be used in a proper word processing environment where there are
better ways to select such presentation forms.

So it really depends on the exact application. There are good reasons
why different normalization forms exist, even though I am sure there are
purists who will say that NFKD is the only clean and proper form of
Unicode.

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>



Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: utf8 on irc?

On Fri, 22 Feb 2002, Erika Pacholleck wrote: > But from time to time there are just small questions arising > and I do not want to disturb, so is there any irc serv/chan > for discussion? I already tried #utf8 on some nets but they > were all negative. Some of us once used #unicode on irc.gnome.org... roozbeh

Next Message by Date: click to view message preview

Re: Jamo

Jungshik Shin wrote on 2002-02-23 22:30 UTC: > In addition, due to its another > not-so-insightful decision, whatever NF we use, we still are left with > multiple representation of Hangul syllables as Kent noted. Well, you can lobby with the Unicode consortium to at least formally define a Normalization Form J that uses only Jamos. Costs a factor three memory, not that it really matters in practice though. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Previous Message by Thread: click to view message preview

RE: NFS4 requires UTF-8 [NFC versus NFD]

Hi Markus, This becomes even murkier. W3C _was_ using NFC, as you say, but: a) When the SLP Project (successor to IETF Service Location WG) recently asked for advice about which normalization to use for SLP string compares, Harald Alvestrand -- author of RFC 2277 "IETF Policy on Character Sets and Languages" and RFC 3066 "Tags for the Identification of Languages" -- told us to use NFKC (which folds compatibility equivalents into their base characters). Note that SLP service attributes frequently contain URLs, so this amounts to advice to use NFKC for comparing URLs. b) The latest "Stringprep Profile for Internationalized Host Names" <draft-ietf-idn-nameprep-07.txt> (9 January 2002) by Paul Hoffman (a Unicode and IETF guru) also uses NFKC. Paul is co-author of RFC 2781 "UTF-16, an encoding of ISO 10646". Note that IDN WG core specs are now in working group 'last call'. NFC and NFD are at least reconcilable, without data loss. NFKC makes life much harder, if it creeps into file systems (because it loses the ability to make round-trip transcoding _back_ to the local system's legacy charset). By the way, Harald Alvestrand is now the _Chair_ of the IESG, so his recommendations carry considerable weight in IETF standards. Cheers, - Ira McDonald High North Inc -----Original Message----- From: Markus Kuhn [mailto:Markus.Kuhn@xxxxxxxxxxxx] Sent: Saturday, February 23, 2002 1:38 PM To: linux-utf8@xxxxxxxxxxxx Subject: Re: NFS4 requires UTF-8 "Kent Karlsson" wrote on 2002-02-23 13:33 UTC: > Also of interest here may be that, IIRC, HFS+ and UFS (the Apple > file systems) represent all file names in NFD (and for UFS: in UTF-8). > NFD, not NFC. Oops, I didn't know that. That's far more of a concern when files are exchanged between Macs and Linux. In particular since MacOS is in it's latest incarnation just running on top of Berkeley Unix, I expect the Mac platform to be far more frequently integrated with Unix systems, via NFS, tar, pkzip, etc. Alternative solutions: a) Linux goes NFD. b) MacOS goes NFC. c) Normalization when transfering files between the two worlds. d) Both sides learn to work well with either form. The reasons for Linux prefering NFC were - That's far closer to existing practice with ISO 8859, JIS, etc. - The W3C has said the NFC shall be what the Web uses and are as far as I can see still valid. The Linux world will in the long run have to learn how to use combining characters anyway, as some scripts depend on them (Thai most notably), so the occasional NFD file from a Mac shouldn't cause major disruption. GUI file selection will run as before, independent of coding variants, and for the shell I can see numerous tiny improvements to globbing and the TAB filename expansion mechanism to make handling the NFC/NFD difference far more convenient. It would be nice, if the MacOS world and the Linux world used the same convention, but if not, I think it is a matter of user interface maturity, how easy it will be to deal with the difference. Example: You have two files Müller Müllerin in a directory, the first in NFD, the second in NFC. If you press M+TAB in a yet to be written UTF-8 aware version of bash, it will fail to expand to Müller, as the two strings differ after the first letter. Typing Mu+TAB will expand one, and typing Mü+TAB will expand the other, so there is a solution for experienced users. A user interface inprovement would be to provide two control keys that allows to scroll through the list of files that are available in the current state of the TAB selection. I could also imagine bash doing a normalization, such that entering a prefix in one normalization will include the file name in the other one as well. There are lots of ways to implement this in a convenient way, and the only real problem is to get the bash maintainers interested in UTF-8 at all ... Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/

Next Message by Thread: click to view message preview

RE: NFS4 requires UTF-8 [NFC versus NFD]

> -----Original Message----- > From: linux-utf8-bounce@xxxxxxxxxxxx > [mailto:linux-utf8-bounce@xxxxxxxxxxxx]On Behalf Of Markus Kuhn ... > sticking to NFKC is certainly a good idea, as it eliminates > the use of a > number of compatibility characters such as the much hated ANGSTROEM > SIGN. The ANGSTROM SIGN (properly spelled 'Ångström sign') has a CANONICAL mapping to Å, just as it should. So it is "normalised away" already by NFC and NFD. Kind regards /kent k
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by