> ...
> > > My recommendation to implementers is generally to emit XML with no
> > > encoding declaration and to use numeric character references for
> > > those Unicode characters with code points > 0X7F. This seems to
> > > maximise the interoperability.
> >
> > This is the best solution I could come up with, too.
> > The only drawback is that there is no guarantee that the charcters
> > sent will be 'understood' succesfully by the receiver (e.g. when
> > sending a japanese unicode char to some program that uses iso-8859
> > internally).
>
> Yes but the advantage is that if they are not understood the other
> end will break. Other options run the risk of the other end silently
> accepting garbled data.
Having agreed on the best solution, its advantage is exaclty the opposite, imho:
when the character sent is not supported in the charset native of the receiving
end
- if the other end does not decode xml entity references, it will send to the
xmlrpc toolkit/app strings with é instead of e, which is garbled - but
valid - data
- if the other end does decode the entity reference and finds it is not
existent in the toolkit/app charset, it will usually send to it some weird
looking (empty square or question mark in a diamond - looking :) character that
stands for 'a character not supported in your charset'
otoh if some utf8 non-entity-encoded characters are found by an xml parser
inside an xml document, when the document is assumed to be ascii or iso-8859,
the parser will blow up with an error-invalid-xml.
Note that this was my experience with PHP 4 (expat based) and 5 (libxml based),
Java/Perl based libraries might behave differently.
> ...
> That looks a pretty good way of working
Thanks!
>
> > for receiving:
> >
> > 'guestimate' the charset of the incoming payload by looking
> > at http
> > content-type and xml prologue. If none is found assume UTF-8 by
> > default, because, despite what the specs say, it is what most
> > implemantations out there will be sending (but the user can
> > configure it to use iso-8859-1 as default if he prefers)
>
> I presume you use the heuristics described in the XMl spec to guess
> the encoding of the document with sufficient precision to read the
> xml declaration (i.e. you know that the first byte is either part of
> the BOM or all or part of the encoding of '<')
I tried my best to follow that and the advice found in
http://www.yale.edu/pclt/encoding/, BOM and all.
Unfortunately I am not 100% confident in the results, since PHP seems to use
different ereg engines on unix and windows and some users reported strange
results...
Bye
Gaetano
ps: here's da code - trying to figure out the charset of the incoming message
/**
* xml charset encoding guessing helper function.
* Tries to determine the charset encoding of an XML chunk
* received over HTTP.
* NB: according to the spec (RFC 3023, if text/xml content-type is
received over HTTP without a content-type,
* we SHOULD assume it is strictly US-ASCII. But we try to be more
tolerant of unconforming (legacy?) clients/servers,
* which will be most probably using UTF-8 anyway...
*
* @param string $httpheaders the http Content-type header
* @param string $xmlchunk xml content buffer
* @param string $encoding_prefs comma separated list of character
encodings to be used as default (when mb extension is enabled)
*
* @todo explore usage of mb_http_input(): does it detect http headers +
post data? if so, use it instead of hand-detection!!!
*/
function guess_encoding($httpheader='', $xmlchunk='',
$encoding_prefs=null)
{
// discussion: see http://www.yale.edu/pclt/encoding/
// 1 - test if encoding is specified in HTTP HEADERS
//Details:
// LWS: (\13\10)?( |\t)+
// token: (any char but excluded stuff)+
// header: Content-type = ...; charset=value(; ...)*
// where value is of type token, no LWS allowed between
'charset' and value
// Note: we do not check for invalid chars in VALUE:
// this had better be done using pure ereg as below
/// @todo this test will pass if ANY header has charset
specification, not only Content-Type. Fix it?
if(eregi(";((\\xD\\xA)?[ \\x9]+)*charset=", $httpheader))
{
/// @BUG if charset is received uppercase, this line
will fail!
$in = strpos($httpheader, 'charset=')+8;
$out = strpos($httpheader, ';', $in) ?
strpos($httpheader, ';', $in) : strlen($httpheader);
return strtoupper(trim(substr($httpheader, $in,
$out-$in)));
}
// 2 - scan the first bytes of the data for a UTF-16 (or other)
BOM pattern
// (source: http://www.w3.org/TR/2000/REC-xml-20001006)
// NOTE: actually, according to the spec, even if we find
the BOM and determine
// an encoding, we should check if there is an encoding
specified
// in the xml declaration, and verify if they match.
/// @todo implement check as described above?
/// @todo implement check for first bytes of string even
without a BOM? (It sure looks harder than for cases WITH a BOM)
if(@ereg("^(\\x00\\x00\\xFE\\xFF|\\xFF\\xFE\\x00\\x00|\\x00\\x00\\xFF\\xFE|\\xFE\\xFF\\x00\\x00)",
$xmlchunk))
// if
(preg_match("/^(\\x00\\x00\\xFE\\xFF|\\xFF\\xFE\\x00\\x00|\\x00\\x00\\xFF\\xFE|\\xFE\\xFF\\x00\\x00)/",
$xmlchunk))
{
return 'UCS-4';
}
elseif(ereg("^(\\xFE\\xFF|\\xFF\\xFE)", $xmlchunk))
{
return 'UTF-16';
}
elseif(ereg("^(\\xEF\\xBB\\xBF)", $xmlchunk))
{
return 'UTF-8';
}
// 3 - test if encoding is specified in the xml declaration
// Details:
// SPACE: (#x20 | #x9 | #xD | #xA)+ === [ \x9\xD\xA]+
// EQ: SPACE?=SPACE? === [ \x9\xD\xA]*=[ \x9\xD\xA]*
if (ereg("^<\?xml".
"[ \\x9\\xD\\xA]+" . "version" . "[ \\x9\\xD\\xA]*=[
\\x9\\xD\\xA]*" . "((\"[a-zA-Z0-9_.:-]+\")|('[a-zA-Z0-9_.:-]+'))".
"[ \\x9\\xD\\xA]+" . "encoding" . "[ \\x9\\xD\\xA]*=[
\\x9\\xD\\xA]*" . "((\"[A-Za-z][A-Za-z0-9._-]*\")|('[A-Za-z][A-Za-z0-9._-]*'))",
$xmlchunk, $regs))
{
return strtoupper(substr($regs[4], 1,
strlen($regs[4])-2));
}
// 4 - if mbstring is available, let it do the guesswork
// NB: we favour finding an encoding that is compatible with
what we can process
if(extension_loaded('mbstring'))
{
if($encoding_prefs)
{
$enc = mb_detect_encoding($xmlchunk,
$encoding_prefs);
}
else
{
$enc = mb_detect_encoding($xmlchunk);
}
// NB: mb_detect likes to call it ascii, xml parser
likes to call it US_ASCII...
// IANA also likes better US-ASCII, so go with it
if($enc == 'ASCII')
{
$enc = 'US-'.$enc;
}
return $enc;
}
else
{
// no encoding specified: as per HTTP1.1 assume it is
iso-8859-1?
// Both RFC 2616 (HTTP 1.1) and 1945(http 1.0) clearly
state that for text/xxx content types
// this should be the standard. And we should be
getting text/xml as request and response.
// BUT we have to be backward compatible with the lib,
which always used UTF-8 as default...
return $GLOBALS['xmlrpc_defencoding'];
}
}
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/xml-rpc/
<*> To unsubscribe from this group, send an email to:
xml-rpc-unsubscribe@xxxxxxxxxxxxxxx
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
|