logo       

Re: FWD: I-D ACTION:draft-klensin-unicode-escapes-00.txt: msg#00013

ietf.apps-discuss

Subject: Re: FWD: I-D ACTION:draft-klensin-unicode-escapes-00.txt

John C Klensin said:
>> - it should be clear that this is for newly-designed protocols
>> only. it shouldn't be interpreted as a request to change
>> existing protocols (including deployed and nonstandard
>> protocols being standardized by IETF), as this would generally
>> break backward compatibility by changing the meaning of '\'
>
> That was intended to be clear already. If it is not
> sufficiently so, suggested text, or at least a place to put it,
> would be welcome.

How about adding "new" before "protocols" in the middle paragraph of 1.1
and the abstract?

>> - it should be clear that this is for occasional use of
>> non-ASCII characters within a protocol field that is
>> constrained to contain only ASCII characters (or a subset),
>> rather than a recommendation for how to represent non-ASCII
>> characters in a protocol field that is capable of carrying,
>> say, UTF-8.

> I don't know if it is clear enough or not. At some level, if
> you didn't conclude that it was clear on reading the draft, then
> that is evidence that it isn't clear enough... but I don't know
> how carefully you read it.

I don't think it would hurt to add something in 1.1. I'm not sure how to
word it, but something about "Some protocols already accept native UTF-8 or
some other encoding of Unicode, and this recommendation does not apply to
such protocols.".

> I've looked at several RFCs
> and U+NNNN seems to be the preferred format for character
> literals and, more commonly, for identifying the code point
> associated with a named character. It is also, fwiw, the one I
> prefer for that purpose. But it is fairly poor for inline use
> in a protocol. The authoritative definition and reference for
> that form is the "Code Points" section of "Appendix A:
> Notational Conventions" of Unicode 5.0 (the reference to the
> book is the I-D).

I don't have that book. The online version 4.1 suggests the notation
<U+0061, U+0300>, which can be abbreviated to <0061, 0030>. This would
still need some kind of introductory indicator (like \u) to show that it's
a Unicode escape.

>> one more caveat: protocol specifications need to specify this
>> notation explicitly (either directly or by reference to the
>> published RFC) if they are going to use it. conversely, this
>> notation SHOULD NOT (maybe MUST NOT) be used unless it is part
>> of the protocol specification.
> Please suggest text for specifying those rules. I constructed
> this rather more as advice to protocol designers and, to a
> lesser extent, to document authors, rather than a base for
> notational definitions to be included by reference. That could
> be changed, but I'd welcome textual suggestions.

"This specification is a recommendation to protocol designers and document
authors. A protocol or other specification MUST NOT be interpreted as
using it unless it explicitly copies this syntax or refers to this RFC
as normative."

> But it is also, if I have done
> the calculation correctly, %C3%83 and that form (used in URIs
> and IRIs) is seriously non-intuitive and certainly can't be
> converted visually.

I certainly agree that encoding of UTF-8 sequences is the wrong thing to
do.

Oh: you should explicitly forbid the use of surrogates to encode characters
above U+FFFF.

> But I
> have no particularly strong commitment to any particular
> recommendation as long as we establish a recommendation.

(1) I agree that anything is better than nothing.

(2) While \uXXXX is better than encoded UTF-8, it's far worse than
something explicitly delimited.

--
Clive D.W. Feather | Work: <clive@xxxxxxxxx> | Tel: +44 20 8495 6138
Internet Expert | Home: <clive@xxxxxxxxxx> | Fax: +44 870 051 9937
Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc | |




<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise