|
ASCII and JIS X 0201 Roman - the backslash problem: msg#00070internationalization.linux
Hi all, Tomohiro Kubota, in http://www.debian.or.jp/~kubota/unicode-symbols-yen.html, explains the YEN SIGN versus REVERSE SOLIDUS problem. He writes: "Solution is very simple. Just regard YEN SIGN and REVERSE SOLIDUS as a different glyphs of the same character. Then, distinction between ASCII and JIS X 0201 Roman can be neglected." I don't think it is a good solution. It will never allow Japanese users to use the same fonts for ASCII as other users elsewhere. The way to make it possible for Japanese users to work in a UTF-8 locale consists of 1) Admit that YEN SIGN and REVERSE SOLIDUS are different things. 2) Never use backslash as a directory separator. 3) For programs that interpret backslash as some kind of escape character and use Unicode internally but should work with text in Shift_JIS encoding, consider the multibyte character 0x5C as being the escape trigger, not [only] the Unicode character U+005C. This is already done in bash and gettext. For example, in GNU gettext, we have the code static bool mb_iseq (mbc, sc) const mbchar_t mbc; char sc; { /* Note: It is wrong to compare only mbc->uc, because when the encoding is SHIFT_JIS, mbc->buf[0] == '\\' corresponds to mbc->uc == 0x00A5, but we want to treat it as an escape character, although it looks like a Yen sign. */ #if HAVE_ICONV && 0 if (mbc->uc_valid) return (mbc->uc == sc); /* wrong! */ else #endif return (mbc->bytes == 1 && mbc->buf[0] == sc); } 4) When people convert files from Shift_JIS to Unicode, they need to disambiguate the two uses of the character that Tomohiro mentions: "When a Japanese person is a writer, it means YEN SIGN in most cases. When a non-Japanese person is a writer, it always means REVERSE SOLIDUS." These "most cases" need to be distinguished - in a financial text the use is likely different than in a shell script. It can not be done by the iconv program. Bruno |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | wcwidth.c updated: 00070, Markus Kuhn |
|---|---|
| Next by Date: | Re: ASCII and JIS X 0201 Roman - the backslash problem: 00070, Glenn Maynard |
| Previous by Thread: | wcwidth.c updatedi: 00070, Markus Kuhn |
| Next by Thread: | Re: ASCII and JIS X 0201 Roman - the backslash problem: 00070, Glenn Maynard |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |