Hi Peter,
Thank you for the response. I have a specific problem and will try to
be more clea this time.
What i can see in memory(i use C/C++ version of the parser in
VisualStudio) is that the XML buffer is represented as one byte per
character.
For e.g space char is represented as 32(dec) and so on.
Now this buffer also contains the character ¤ which has ASCII code
164(decimal) or A4(hex).
When the parser starts parsing this buffer to build a DOM tree, it is
replacing ¤(code 164 and represented as one byte) with two bytes(which
when represented in ASCII are the characters ¤). I think this is an
UTF-8 representaion, where two bytes are used to represent ¤, whereas
in ASCII/latin1 it would be represented with just one byte.
So when i get back the XML from the DOM, and i am expecting a latin1
encoding, i get ¤ wherever ¤ was expected.
So is the behaviour of any XML parser to store XML internally in UTF-8
and then probably its the responsibility of the application using this
parser to convert this XML in UTF-8 encoding to any encoding which it
wishes.
Looking forward to your comments
Thanks
Vinu
|