Carl Hume wrote:
On January 14, 2004 03:34 pm, Bob Foster wrote:
Of course not. The default encoding is UTF-8. An erroneous encoding
declaration won't make things better. I mean, specify the actual encoding.
I'm going to take a step back for a second, because I'm (obviously) confused
on several fronts.
When I receive the xml file, it currently specifies a UTF-8 encoding. Some
characters are invalid for UTF-8, so are interpreted based on the locale
settings of the operating system. This interpretation can be avoided by
using the correct character set for the interpretation, and specifying it
instead of UTF-8.
Could someone point me to a reference / tutorial for deciding the appropriate
character set to use? Christopher mentioned Cp 1252 in an earlier email.
Would that be more appropriate?
Thanks for all the help.
And you seem to be getting a lot of it! ;-}
I don't know that any reference / tutorial is going to tell you what
character set is used to write a document, since that depends on the
program that writes it and the platform the program runs on. CP-1252 is
the default encoding for Windows US English systems, so if the program
is running on one of these and it doesn't do anything special about the
encoding, chances are that's what it is writing. But from this distance,
that is just speculation.
Anyway, the way to fix this problem is to fix the program that writes
the document so it writes an xml declaration to match the encoding it
writes the document in. (If that sentence isn't too confusing.)
BTW, my handy table says the IANA name for CP-1252 is "WINDOWS-1252"
while the Java name for it is "Cp1252". I would hope Xerces accepts
either one.
Bob Foster
http://xmlbuddy.com/
|