logo       
Google Custom Search
    AddThis Social Bookmark Button

RE: RFC: REST-RPC: msg#00012

Subject: RE: RFC: REST-RPC
> ...
> > > My recommendation to implementers is generally to emit XML with no
> > > encoding declaration and to use numeric character references for
> > > those Unicode characters with code points > 0X7F. This seems to
> > > maximise the interoperability.
> >
> > This is the best solution I could come up with, too.
> > The only drawback is that there is no guarantee that the charcters  
> > sent will be 'understood' succesfully by the receiver (e.g. when  
> > sending a japanese unicode char to some program that uses iso-8859  
> > internally).
>
> Yes but the advantage is that if they are not understood the other  
> end will break. Other options run the risk of the other end silently  
> accepting garbled data.

Having agreed on the best solution, its advantage is exaclty the opposite, imho:

when the character sent is not supported in the charset native of the receiving 
end
- if the other end does not decode xml entity references, it will send to the 
xmlrpc toolkit/app strings with é instead of e, which is garbled - but 
valid - data
- if the other end does decode the entity reference and finds it is not 
existent in the toolkit/app charset, it will usually send to it some weird 
looking (empty square or question mark in a diamond - looking :) character that 
stands for 'a character not supported in your charset'

otoh if some utf8 non-entity-encoded characters are found by an xml parser 
inside an xml document, when the document is assumed to be ascii or iso-8859, 
the parser will blow up with an error-invalid-xml.

Note that this was my experience with PHP 4 (expat based) and 5 (libxml based), 
Java/Perl based libraries might behave differently.

> ...
> That looks a pretty good way of working

Thanks!

>
> > for receiving:
> >
> > 'guestimate' the charset of the incoming payload by looking 
> > at http  
> > content-type and xml prologue. If none is found assume UTF-8 by  
> > default, because, despite what the specs say, it is what most  
> > implemantations out there will be sending (but the user can  
> > configure it to use iso-8859-1 as default if he prefers)
> 
> I presume you use the heuristics described in the XMl spec to guess  
> the encoding of the document with sufficient precision to read the  
> xml declaration (i.e. you know that the first byte is either part of  
> the BOM or all or part of the encoding of '<')

I tried my best to follow that and the advice found in 
http://www.yale.edu/pclt/encoding/, BOM and all.

Unfortunately I am not 100% confident in the results, since PHP seems to use 
different ereg engines on unix and windows and some users reported strange 
results...

Bye
Gaetano

ps: here's da code - trying to figure out the charset of the incoming message

        /**
        * xml charset encoding guessing helper function.
        * Tries to determine the charset encoding of an XML chunk
        * received over HTTP.

        * NB: according to the spec (RFC 3023, if text/xml content-type is 
received over HTTP without a content-type,

        * we SHOULD assume it is strictly US-ASCII. But we try to be more 
tolerant of unconforming (legacy?) clients/servers,

        * which will be most probably using UTF-8 anyway...
        *
        * @param string $httpheaders the http Content-type header
        * @param string $xmlchunk xml content buffer
        * @param string $encoding_prefs comma separated list of character 
encodings to be used as default (when mb extension is enabled)
        *
        * @todo explore usage of mb_http_input(): does it detect http headers + 
post data? if so, use it instead of hand-detection!!!
        */
        function guess_encoding($httpheader='', $xmlchunk='', 
$encoding_prefs=null)
        {
                // discussion: see http://www.yale.edu/pclt/encoding/
                // 1 - test if encoding is specified in HTTP HEADERS

                //Details:
                // LWS:           (\13\10)?( |\t)+
                // token:         (any char but excluded stuff)+
                // header:        Content-type = ...; charset=value(; ...)*
                //   where value is of type token, no LWS allowed between 
'charset' and value
                // Note: we do not check for invalid chars in VALUE:
                //   this had better be done using pure ereg as below

                /// @todo this test will pass if ANY header has charset 
specification, not only Content-Type. Fix it?
                if(eregi(";((\\xD\\xA)?[ \\x9]+)*charset=", $httpheader))
                {
                        /// @BUG if charset is received uppercase, this line 
will fail!
                        $in = strpos($httpheader, 'charset=')+8;
                        $out = strpos($httpheader, ';', $in) ? 
strpos($httpheader, ';', $in) : strlen($httpheader);
                        return strtoupper(trim(substr($httpheader, $in, 
$out-$in)));
                }

                // 2 - scan the first bytes of the data for a UTF-16 (or other) 
BOM pattern
                //     (source: http://www.w3.org/TR/2000/REC-xml-20001006)
                //     NOTE: actually, according to the spec, even if we find 
the BOM and determine
                //     an encoding, we should check if there is an encoding 
specified
                //     in the xml declaration, and verify if they match.
                /// @todo implement check as described above?
                /// @todo implement check for first bytes of string even 
without a BOM? (It sure looks harder than for cases WITH a BOM)
                
if(@ereg("^(\\x00\\x00\\xFE\\xFF|\\xFF\\xFE\\x00\\x00|\\x00\\x00\\xFF\\xFE|\\xFE\\xFF\\x00\\x00)",
 $xmlchunk))
                //  if 
(preg_match("/^(\\x00\\x00\\xFE\\xFF|\\xFF\\xFE\\x00\\x00|\\x00\\x00\\xFF\\xFE|\\xFE\\xFF\\x00\\x00)/",
 $xmlchunk))
                {
                        return 'UCS-4';
                }
                elseif(ereg("^(\\xFE\\xFF|\\xFF\\xFE)", $xmlchunk))
                {
                        return 'UTF-16';
                }
                elseif(ereg("^(\\xEF\\xBB\\xBF)", $xmlchunk))
                {
                        return 'UTF-8';
                }

                // 3 - test if encoding is specified in the xml declaration
                // Details:
                // SPACE:         (#x20 | #x9 | #xD | #xA)+ === [ \x9\xD\xA]+
                // EQ:            SPACE?=SPACE? === [ \x9\xD\xA]*=[ \x9\xD\xA]*
                if (ereg("^<\?xml".
                        "[ \\x9\\xD\\xA]+" . "version"  . "[ \\x9\\xD\\xA]*=[ 
\\x9\\xD\\xA]*" . "((\"[a-zA-Z0-9_.:-]+\")|('[a-zA-Z0-9_.:-]+'))".
                        "[ \\x9\\xD\\xA]+" . "encoding" . "[ \\x9\\xD\\xA]*=[ 
\\x9\\xD\\xA]*" . "((\"[A-Za-z][A-Za-z0-9._-]*\")|('[A-Za-z][A-Za-z0-9._-]*'))",
                        $xmlchunk, $regs))
                {
                        return strtoupper(substr($regs[4], 1, 
strlen($regs[4])-2));
                }

                // 4 - if mbstring is available, let it do the guesswork
                // NB: we favour finding an encoding that is compatible with 
what we can process
                if(extension_loaded('mbstring'))
                {
                        if($encoding_prefs)
                        {
                                $enc = mb_detect_encoding($xmlchunk, 
$encoding_prefs);
                        }
                        else
                        {
                                $enc = mb_detect_encoding($xmlchunk);
                        }
                        // NB: mb_detect likes to call it ascii, xml parser 
likes to call it US_ASCII...
                        // IANA also likes better US-ASCII, so go with it
                        if($enc == 'ASCII')
                        {
                                $enc = 'US-'.$enc;
                        }
                        return $enc;
                }
                else
                {
                        // no encoding specified: as per HTTP1.1 assume it is 
iso-8859-1?
                        // Both RFC 2616 (HTTP 1.1) and 1945(http 1.0) clearly 
state that for text/xxx content types
                        // this should be the standard. And we should be 
getting text/xml as request and response.
                        // BUT we have to be backward compatible with the lib, 
which always used UTF-8 as default...
                        return $GLOBALS['xmlrpc_defencoding'];
                }
        }



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/xml-rpc/

<*> To unsubscribe from this group, send an email to:
    xml-rpc-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 






Try Searching:
servers, voip, java, networking, microsoft ...
<Prev in Thread] Current Thread [Next in Thread>