|
Related Msgs:
audio.musicbrai...
enbd.general/20...
ietf.idr/2002-0...
java.ant-contri...
gnu.make.genera...
qplus.devel/200...
video.freevo.cv...
os.netbsd.ports...
yellowdog.gener...
xfree86.cvs/200...
search.nutch.us...
freedesktop.xse...
programming.swi...
capabilities.ge...
telephony.pbx.a...
mail.sylpheed.c...
db.firebase.por...
boot-loaders.u-...
recreation.radi...
netbsd.bugs/200...
web.zope.plone....
user-groups.lin...
|
RE: high value unicode characters: msg#00036
|
Subject: |
RE: high value unicode characters |
Hello,
Any more advice or info on this one?
Can anyone say if these characters ( � ) are added because the
original data is not encoded correctly (missing the surrogate pairs or
whatever) or if this is a bug in the SAX2Print utility? It seams odd to me
that these characters are added later on in the statement and not right
next to the Unicode characters (𝖢𝖧) that are fouling up the
process.
Thanks again,
josh
At 12:26 PM 4/8/2004 -0400, Joshua Santelli wrote:
Processing this file with the encoding as ASCII produces these strange
character (see the � characters below). Are these the "surrogate
pairs?" I guess I need to look into this a little more.
Can anyone advise how to encode these characters, if the original is not
encoded correctly? If SAX2Print puts these in during the first pass, why
does it not like them after the first pass?
# SAX2Print -v=always -x=ASCII test1.xml
[ snip ]
Assuming 𝖢𝖧, Hindman [ht1]� �showed
that the existence
[ snip ]
Thanks for the help,
josh
At 03:29 PM 4/7/2004 -0700, Christopher Ebert wrote:
See:
http://uk.geocities.com/BabelStone1357/Unicode/surrogates.html
Unicode values such as 𝖧 are encoded as 'surrogate pairs'
(Unicode has a slightly messy history, consequently there's a slight
messiness about it). There may be a bug in SAX2Print's handling of
these: I would suggest checking the file in other ways: see if Xerces
will parse it correctly; see if SAX2Print will output it correctly
encoded in UTF-16 or ASCII (which should encode anything that's not
ASCII with an escape sequence). There might be something wrong with the
document, but the error sounds like its somewhere in the handling of
surrogate pairs.
Checking the bytes in memory is often useful.
Chris
-----Original Message-----
From: Joshua Santelli [mailto:js434@xxxxxxxxxxx]
Sent: Wednesday, April 07, 2004 14:00
To: xerces-j-user@xxxxxxxxxxxxxx
Subject: high value unicode characters
Hello,
We're using Xerces SAX2Print, version 2.5.0
(xerces-c_2_5_0-solaris_27-cc_62) and have run into a problem with a few
"high value" unicode characters. What we would like to do is validate
the
file and convert it to UTF-8. The SAX2Print process completes with no
error but there appears to be some strange characters after the high
value
unicode characters (𝖢, 𝖧 and 𝒫) in the output.
The command is: # SAX2Print -v=always -x=UTF-8 test1.xml
The error that I get using SAX2Print on the output XML file is:
Fatal Error at file test1-out.xml, line 5, char 35
Message: Got an unexpected trailing surrogate character
Any idea what is going wrong here?
Thanks in advance,
josh
=========================
<?xml version="1.0"?>
<!DOCTYPE test SYSTEM "test.dtd">
<test>
<testPara>
<head>1. high value Unicode characters and some
punctuation as entities</head>
<p>Assuming 𝖢𝖧, Hindman [ht1] showed
that
the existence of certain ultrafilters on the power set of the natural
numbers is equivalent to Hindman’s Theorem. Adapting this work
to a
countable setting formalized in RCA<sub>0</sub>, this article proves the
equivalence of the existence of certain ultrafilters on countable
Boolean
algebras and an iterated form of Hindman’s Theorem, which is
closely
related to Milliken’s Theorem.</p>
</testPara>
<testPara>
<head>2. high value Unicode char and some Greek as
entities</head>
<p>This article is a continuation of our search for
tautologies that are hard even for strong propositional proof systems
like
EF, cf. [Kra-wphp,Kra-tau]. The particular tautologies we study, the
τ-formulas, are obtained from any 𝒫/poly map g; they
express
that a string is outside of the range of g. Maps g considered here are
particular pseudorandom generators. The ultimate goal is to deduce the
hardness of the τ-formulas for at least EF from some general,
plausible computational hardness hypothesis.</p>
</testPara>
</test>
=========================
<!ELEMENT test (testPara+) >
<!ELEMENT testPara (head, p) >
<!ELEMENT head (#PCDATA) >
<!ELEMENT p (#PCDATA | b | i | sub)* >
<!ELEMENT b (#PCDATA) >
<!ELEMENT i (#PCDATA) >
<!ELEMENT sub (#PCDATA) >
=========================
|
Try Searching:
servers, voip, java, networking, microsoft ...
|
|
|
| |