logo       

Related Msgs: audio.musicbrai...    enbd.general/20...    ietf.idr/2002-0...    java.ant-contri...    gnu.make.genera...    qplus.devel/200...    video.freevo.cv...    os.netbsd.ports...    yellowdog.gener...    xfree86.cvs/200...    search.nutch.us...    freedesktop.xse...    programming.swi...    capabilities.ge...    telephony.pbx.a...    mail.sylpheed.c...    db.firebase.por...    boot-loaders.u-...    recreation.radi...    netbsd.bugs/200...    web.zope.plone....    user-groups.lin...   

RE: high value unicode characters: msg#00036

Subject: RE: high value unicode characters
Hello,

Any more advice or info on this one?

Can anyone say if these characters ( � ) are added because the original data is not encoded correctly (missing the surrogate pairs or whatever) or if this is a bug in the SAX2Print utility? It seams odd to me that these characters are added later on in the statement and not right next to the Unicode characters (𝖢𝖧) that are fouling up the process.

Thanks again,
josh


At 12:26 PM 4/8/2004 -0400, Joshua Santelli wrote:

Processing this file with the encoding as ASCII produces these strange character (see the � characters below). Are these the "surrogate pairs?" I guess I need to look into this a little more.

Can anyone advise how to encode these characters, if the original is not encoded correctly? If SAX2Print puts these in during the first pass, why does it not like them after the first pass?


    # SAX2Print -v=always -x=ASCII test1.xml
    [ snip ]
Assuming 𝖢𝖧, Hindman [ht1]� �showed that the existence
    [ snip ]


Thanks for the help,
josh


At 03:29 PM 4/7/2004 -0700, Christopher Ebert wrote:

        See:
http://uk.geocities.com/BabelStone1357/Unicode/surrogates.html

        Unicode values such as &#x1D5A7 are encoded as 'surrogate pairs'
(Unicode has a slightly messy history, consequently there's a slight
messiness about it). There may be a bug in SAX2Print's handling of
these: I would suggest checking the file in other ways: see if Xerces
will parse it correctly; see if SAX2Print will output it correctly
encoded in UTF-16 or ASCII (which should encode anything that's not
ASCII with an escape sequence). There might be something wrong with the
document, but the error sounds like its somewhere in the handling of
surrogate pairs.

        Checking the bytes in memory is often useful.

        Chris


-----Original Message-----
From: Joshua Santelli [mailto:js434@xxxxxxxxxxx]
Sent: Wednesday, April 07, 2004 14:00
To: xerces-j-user@xxxxxxxxxxxxxx
Subject: high value unicode characters


Hello,

We're using Xerces SAX2Print, version 2.5.0
(xerces-c_2_5_0-solaris_27-cc_62) and have run into a problem with a few

"high value" unicode characters.  What we would like to do is validate
the
file and convert it to UTF-8.  The SAX2Print process completes with no
error but there appears to be some strange characters after the high
value
unicode characters (𝖢, 𝖧 and 𝒫) in the output.

     The command is: # SAX2Print -v=always -x=UTF-8 test1.xml

The error that I get using SAX2Print on the output XML file is:

     Fatal Error at file test1-out.xml, line 5, char 35
       Message: Got an unexpected trailing surrogate character


Any idea what is going wrong here?

Thanks in advance,
josh


=========================
<?xml version="1.0"?>
<!DOCTYPE test SYSTEM "test.dtd">
<test>
         <testPara>
                 <head>1. high value Unicode characters and some
punctuation as entities</head>
                 <p>Assuming &#x1D5A2;&#x1D5A7;, Hindman [ht1] showed
that
the existence of certain ultrafilters on the power set of the natural
numbers is equivalent to Hindman&#x2019;s Theorem.  Adapting this work
to a
countable setting formalized in RCA<sub>0</sub>, this article proves the

equivalence of the existence of certain ultrafilters on countable
Boolean
algebras and an iterated form of Hindman&#x2019;s Theorem, which is
closely
related to Milliken&#x2019;s Theorem.</p>
         </testPara>
         <testPara>
                 <head>2. high value Unicode char and some Greek as
entities</head>
                 <p>This article is a continuation of our search for
tautologies that are hard even for strong propositional proof systems
like
EF, cf. [Kra-wphp,Kra-tau].  The particular tautologies we study, the
&#x03C4;-formulas, are obtained from any &#x1D4AB;/poly map g; they
express
that a string is outside of the range of g. Maps g considered here are
particular pseudorandom generators. The ultimate goal is to deduce the
hardness of the &#x03C4;-formulas for at least EF from some general,
plausible computational hardness hypothesis.</p>
         </testPara>
</test>
=========================
<!ELEMENT test (testPara+) >
<!ELEMENT testPara (head, p) >
<!ELEMENT head (#PCDATA) >
<!ELEMENT p (#PCDATA | b | i | sub)* >
<!ELEMENT b (#PCDATA) >
<!ELEMENT i (#PCDATA) >
<!ELEMENT sub (#PCDATA) >
=========================



Try Searching:
servers, voip, java, networking, microsoft ...
<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo