Encoding detection happens when the document is opened; after that, a
conversion error may have caused a well-formed error, but it cannot be
identified as a charset problem.
Most likely the parser isn't detecting the non-UTF-8 characters because
Java isn't. I have seen mention that you can ask Java's encoding
converters to throw if they encounter invalid character sequences? Does
anyone know if this is true? And if so, why doesn't Xerces do it?
Bob
DeSmet_Ringo@xxxxxxx wrote:
Maybe because the bad character is in the comment. I suspect the parser
skips everything until the closing comment tag. What happens when the bad
character is in an attribute value for example?
Ringo
-----Original Message-----
From: Berchner Matthias ICM Berlin
[mailto:matthias.berchner@xxxxxxxxxxx]
Sent: vrijdag 20 februari 2004 15:15
To: 'xerces-j-user@xxxxxxxxxxxxxx'
Subject: UTF-8 encoding errors are not always detected
Hi,
I'm using Xerces 1.4.2, unfortunally UTF-8 coding errors are not always
detected:
Example:
--------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<Project>
<!-- für ONC -->
</Project>
--------------------------------------------
<!-- für ONC --> correponds to
hex 3C 21 2D 2D 20 66 FC 72 20 4F 4E 43 20 2D 2D 3E
Non-UTF-8 character: ü <-> FC
Kind Regards,
Matthias
|
Try Searching:
servers, voip, java, networking, microsoft ...
|
|
|
|