osdir.com
mailing list archive
Mozy Online Backup: 2GB Free. Automatic. Secure.

Subject: New UTF-8 decoder stress test file - msg#00141

List: internationalization.linux

Date: Prev Next Index Thread: Prev Next Index
I have updated the UTF-8 decoder stress test file to also cover
overlong sequences, which a good UTF-8 decoder should reject just
like malformed sequences for security reasons.

One part of UTF-8's ASCII compatibility is the property:

ASCII compatibility of the first kind:

ASCII bytes (00-7f) will only represent ASCII characters and will not
show up in other contexts.

Equally important is another property:

ASCII compatibility of the second kind:

ASCII characters can only be represented with a single ASCII
byte (00-7f) and cannot be decoded from other multi-byte sequences.

Section 4 in the test file helps you to establish the robustness of your
decoder here. Testing for overlong UTF-8 sequences is very easy, once
you have fully understood that all overlong sequences fall into one of
the following patterns:

1100000x 10xxxxxx
11100000 100xxxxx 10xxxxxx
11110000 1000xxxx 10xxxxxx 10xxxxxx
11111000 10000xxx 10xxxxxx 10xxxxxx 10xxxxxx
11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

For instance, in xterm, all I had to add were the two lines

/* continuation byte 10xxxxxx found */
if (screen->utf_char == 0 && ((c & 0x7f) >> (7 - screen->utf_count)) == 0) {
screen->utf_char = UCS_REPL;
}

and

/* start byte 110xxxxx found */
if ((c & 0x1e) == 0)
screen->utf_char = UCS_REPL; /* overlong sequence */

at the right place to catch all overlong UTF-8 sequences and replace
them with the REPLACEMENT CHARACTER. (The second "if" checks the start
character of a 2-byte sequence, the first "if" checks the first
continuation byte of any sequence, where c is the input byte, screen->
utf_char is the UCS-2 word accumulated so far and screen->utf_count is
the expected number of remaining continuation bytes incl. the current
one.)

Summary: Adding a safety check to a UTF-8 decoder such that ASCII
compatibility of the second kind is ensured is really trivial, and the
example code on the Unicode ftp site should definitely be corrected
accordingly.

Test your UTF-8 decoder with the attached file! It is very likely that
you will discover strange bugs this way. I haven't yet seen an UTF-8
decoder that was really correct the first time I tested it. Most I saw
treat malformed UTF-8 sequences very badly, xterm being a notable
exception. Netscape is one of the worst and does not even get past test
2.1.1.

The test file is also available as

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

or in

http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz

in the examples/ directory. Both directories contain many more
interesting UTF-8 test files, especially for font proof-reading.

Happy decoder testing ...

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Attachment: UTF-8-test.txt
Description: UTF-8-test.txt

Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: GNU Emacs UTF-8 patch

Otfried Cheong wrote on 1999-09-25 05:48 UTC: > I was planning to extend this to full BMP coverage by adding two new > internal character sets that cover the missing characters (you need > two since the current rendering engine assumes that all characters are > equal width). Since it turns out that GNU Emacs is going to migrate to > used Unicode internally in the long run, I'm probably not going to do > this now, but if anybody needs it, I think it wouldn't take more than > a day. Yes, it would be really cool if you could find the time to make this extension for full BMP coverage. It will probably be quite some time before emacs has changed internally to Unicode, so an interim solution for Mule would indeed be most useful. I guess, the split into halfwidth and full-width characters is best done with an algorithm like the following: /* This function tests, whether the ISO 10646/Unicode character code * ucs belongs into the East Asian Wide (W) or East Asian FullWidth * (F) category as defined in Unicode Technical Report #11. In this * case, the terminal emulator should represent the character using a * a glyph from a double-wide font that covers two normal (Latin) * character cells. */ int iswide(int ucs) { if (ucs < 0x1100) return 0; return (ucs >= 0x1100 && ucs <= 0x115f) || /* Hangul Jamo */ (ucs >= 0x2e80 && ucs <= 0xa4cf && ucs != 0x303f) || /* CJK ... Yi */ (ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */ (ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility Ideographs */ (ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */ (ucs >= 0xff00 && ucs <= 0xff5f) || /* Fullwidth Forms */ (ucs >= 0xffe0 && ucs <= 0xffe6); } Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/> - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/

Next Message by Date: click to view message preview

Re: XLFDs for ISO 10646-1 subsets

Thomas> So apparently those current font sets can only be used with Thomas> applications specifically aware of them and are useless for Thomas> general X installation unless the currently discussed font Thomas> reference scheme gets into X servers in a few years. Genau. ----------------------------------------------------------------------------- Mark Leisher Computing Research Lab The more I see of the representatives New Mexico State University of the people, the more I admire my dogs. Box 30001, Dept. 3CRL -- Alphonse de Lamartine, 1790-1869 Las Cruces, NM 88003 - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/

Previous Message by Thread: click to view message preview

GNU Emacs UTF-8 patch

"The Unicode HOWTO" appears to be missing a reference to my patch to add a unicode-utf8 encoding to GNU Emacs. (http://www.cs.ust.hk/faculty/otfried/Mule/) You don't need Emacs sources to use this, but it only covers the part of Unicode that Emacs is already aware of. I was planning to extend this to full BMP coverage by adding two new internal character sets that cover the missing characters (you need two since the current rendering engine assumes that all characters are equal width). Since it turns out that GNU Emacs is going to migrate to used Unicode internally in the long run, I'm probably not going to do this now, but if anybody needs it, I think it wouldn't take more than a day. Otfried - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/

Next Message by Thread: click to view message preview

Re: New UTF-8 decoder stress test file

Markus Kuhn wrote: > > I have updated the UTF-8 decoder stress test file to also cover > overlong sequences, which a good UTF-8 decoder should reject just > like malformed sequences for security reasons. > Not just that, but we don't know at this stage if and how UTF-8 will be extended in the future. -hpa - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by