logo       

Re: javascript and unicode: msg#00680

text.unicode.general

Subject: Re: javascript and unicode

From: "Markus Scherer" <markus.scherer@xxxxxxxxx>
> Paul Hastings wrote:
> > would it be correct to say that javascript "natively" supports unicode?
>
> ECMAScript, of which JavaScript and JScript are implementations, is defined
> on 16-bit Unicode
> scripts and using 16-bit Unicode strings.
>
> In other words, the basic encoding support is there, but there are basically
> no Unicode-specific
> APIs in the standard. No character properties, no collation that is
> guaranteed to do more than
> strcmp, etc. Script writers have to rely on implementation-specific functions
> or supply their own.

It would be more correct to say that ECMAScript handles text using the UTF-16
encoding form on most platforms, and so can handle any Unicode character.
However, it's true that ECMAScript will allow you to create invalide Unicode
strings, as it allows you to create strings where surrogate characters do not
pair.

This says nothing on the internal encoding of strings within ECMA engines: it
could as well use CESU-8 internally, but this will internal encoding will be
hidden.

So the situation of ECMAScript isexactly similar to Java (in which the builtin
type "char" is an unsigned 16 bit integer, and the String type is handled in
terms of "char" code units with UTF-16). However the serialization of compiled
Java classes internally encodes these strings with UTF-8, which is deserialized
to UTF-16 when the class is loaded.

You will have a similar situation on Windows with the Win32 API, and in its
C/C++ binding using TCHAR (and the T() macro for string constants) with the
_UNICODE compile-time define. Or on all systems where the ANSI C type wchar_t
is defined as a 16 bit integer.

Note that we are speaing here about code units, not codepoints. The code units
is what programming languages use to handle strings, not codepoints. As code
units are well defined in Unicode in relation with a encoding form, any
language or system can be made compliant to fully support Unicode, if it also
provides library functions for string handling that implement the
Unicode-defined algorithms (described in terms of code points).

It's up to the library (not the language) to make its implementation of Unicode
with code units comply with the standard algorithms based on code points. Of
course it is much easier to implement these algorithms with 16-bit code units
than with 8-bit code units. But the language itself has no other special
Unicode compliance characteristic.


------------------------ Yahoo! Groups Sponsor ---------------------~-->
Get A Free Psychic Reading! Your Online Answer To Life's Important Questions.
http://us.click.yahoo.com/Lj3uPC/Me7FAA/CNxFAA/8FfwlB/TM
---------------------------------------------------------------------~->

To Unsubscribe, send a blank message to: unicode-unsubscribe@xxxxxxxxxxxxxxx

This mailing list is just an archive. The instructions to join the true Unicode
List are on http://www.unicode.org/unicode/consortium/distlist.html


Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/





<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise