logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: [OT] Character Sets (Was Re: ContentHander characters discrepancy): msg#00092

Subject: Re: [OT] Character Sets (Was Re: ContentHander characters discrepancy)
Carl Hume wrote:

On January 14, 2004 03:34 pm, Bob Foster wrote:

Of course not. The default encoding is UTF-8. An erroneous encoding
declaration won't make things better. I mean, specify the actual encoding.


I'm going to take a step back for a second, because I'm (obviously) confused on several fronts.

When I receive the xml file, it currently specifies a UTF-8 encoding. Some characters are invalid for UTF-8, so are interpreted based on the locale settings of the operating system. This interpretation can be avoided by using the correct character set for the interpretation, and specifying it instead of UTF-8.

Could someone point me to a reference / tutorial for deciding the appropriate character set to use? Christopher mentioned Cp 1252 in an earlier email. Would that be more appropriate?

Thanks for all the help.

And you seem to be getting a lot of it! ;-}

I don't know that any reference / tutorial is going to tell you what character set is used to write a document, since that depends on the program that writes it and the platform the program runs on. CP-1252 is the default encoding for Windows US English systems, so if the program is running on one of these and it doesn't do anything special about the encoding, chances are that's what it is writing. But from this distance, that is just speculation.

Anyway, the way to fix this problem is to fix the program that writes the document so it writes an xml declaration to match the encoding it writes the document in. (If that sentence isn't too confusing.)

BTW, my handy table says the IANA name for CP-1252 is "WINDOWS-1252" while the Java name for it is "Cp1252". I would hope Xerces accepts either one.

Bob Foster
http://xmlbuddy.com/


<Prev in Thread] Current Thread [Next in Thread>