osdir.com
mailing list archive

Subject: Re: [charmodReview-17] replacing all URIs with IRIs - msg#00150

List: org.w3c.tag

Date: Prev Next Index Thread: Prev Next Index
On Friday, May 24, 2002, at 05:04 PM, Misha.Wolf@xxxxxxxxxxx wrote:

Are you aware that a number of W3C specifications already support IRIs,
though not under that name? Examples are XML, XML Schema, XPointer and
XLink. For example, the XML specification states[1]:

Yes, I was going to mention that...

break many utilities which have made the assumption that RDF identifiers
Which utilities?

All the current RDF tools, I think. I don't think any of them have been updated to support normalization or Unicode storage. Certainly all the tools I've written don't support it. If you take a look at the RDF Validator[1] you'll find that it %-encodes characters like Ã, as most of the RDF tools I know do.

I can understand presenting strings this way for user-display and
user-entry but storing them this way and making them the official
encoding seems to be going too far. I would think that simply using
UTF-8 %-encoding would be fine for these purposes.

How do you propose to display these strings in a meaningful manner?
%HH encoding is not invertible, except in the case of ASCII characters.
This is because the character encoding is not, in general, known.

That is why I said UTF-8. I am fine with requiring a specific character encoding to make the process reversible.

--
Aaron Swartzâ [http://www.aaronsw.com/]




Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: replacing all URIs with IRIs [charmodReview-17]

On Friday, May 24, 2002, at 06:11 PM, Martin Duerst wrote: Hello Aaron, Hi there. First some procedural points, starting with the end of your mail: I'm considering appealing this decision, The Character Model is in last call, so you can raise a comment. Oops, I should have been more clear. It was the RDF decision I was thinking of appealing. I assume that charmod will be decided in its own way. I can understand presenting strings this way for user-display and user-entry but storing them this way and making them the official encoding seems to be going too far. XML can 'store' them without problems. N3 also should be able to do it. XML and N3 are interchange formats, I meant storage in the sense of databases and APIs. I would think that simply using UTF-8 %-encoding would be fine for these purposes. Why do you think so? Would you think it would make sense to replace mailto:me@xxxxxxxxxxx with something like mailto:%6d%65@%a1%a1%72%6f%6e%73%77.%63%6f%6d or maybe even more appropriately, with something like the above but using Greek letters instead of Latin ones? This is just about how people using another script than Latin in their day-to-day work would feel. Why should they have to use special tools (having to do syntax analysis so that they can figure out where a % is an escape character and when not,...) just to be able to read the text, just because some tools make too restrictive assumptions? I totally understand the feeling and agree with it. It's silly to have to enter something in like that. But that's why I have a computer to convert it for me. I already have my computer convert "Aar" to "mailto:me@xxxxxxxxxxx" and "DÃr" mailto:duerst@xxxxxxx I don't expect them folks to use any special tools. In fact, requiring Unicode would require me to go and replace a lot of my software with special i18nized tools. -- Aaron Swartzâ [http://www.aaronsw.com/]

Next Message by Date: click to view message preview

Re: [charmodReview-17] replacing all URIs with IRIs

On 25/05/2002 01:18:20 Aaron Swartz wrote: > On Friday, May 24, 2002, at 05:04 PM, Misha.Wolf@xxxxxxxxxxx wrote: [...] > >> break many utilities which have made the assumption that RDF > >> identifiers > > Which utilities? > > All the current RDF tools, I think. I don't think any of them have been > updated to support normalization or Unicode storage. Certainly all the > tools I've written don't support it. If you take a look at the RDF > Validator[1] you'll find that it %-encodes characters like ü, as most of > the RDF tools I know do. On the other hand, the description of N-Triples says (in section 3.3 URI References)[1]: | Characters above the US-ASCII range are made available by the | \u or \U escapes as described in section Strings for ranges | [#x80-#xFFFF] and [#x10000-#x10FFFF] respectively. > >> I can understand presenting strings this way for user-display and > >> user-entry but storing them this way and making them the official > >> encoding seems to be going too far. I would think that simply using > >> UTF-8 %-encoding would be fine for these purposes. > > > > How do you propose to display these strings in a meaningful manner? > > %HH encoding is not invertible, except in the case of ASCII characters. > > This is because the character encoding is not, in general, known. > > That is why I said UTF-8. I am fine with requiring a specific character > encoding to make the process reversible. RFC 2396, in specifying the use of %HH escaping, does not confine its use to UTF-8. There are plenty of URIs out there which use %HH to escape other character encodings. Once you have a %HH-escaped URI, there is no way back, unless you know how it was created. If an RDF database contains some %HH-escaped URIs, how can anyone know whether they arrived %HH-escaped, or whether the %HH-escaping was applied just before their insertion in the database? [1] http://www.w3.org/TR/rdf-testcases/#sec-uri-encoding Misha Wolf I18N WG Chair > -- > Aaron Swartz· [http://www.aaronsw.com/] > -------------------------------------------------------------- -- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.

Previous Message by Thread: click to view message preview

Re: [charmodReview-17] replacing all URIs with IRIs

On 24/05/2002 20:58:57 Aaron Swartz wrote: > I would like to draw the TAG's attention to this requirement in charmod: > > """ > W3C specifications that define protocol or format elements (e.g. HTTP > headers, XML attributes, etc.) which are to be interpreted as URI > references (or specific subsets of URI references, such as absolute URI > references, URIs, etc.) SHOULD use Internationalized Resource > Identifiers (IRI) [I-D IRI] (or an appropriate subset thereof). > """ > - http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-URIs > > RDF, for example, has recently moved to replace URIs with IRIs (or > something like them). I find this seriously problematic since it will Are you aware that a number of W3C specifications already support IRIs, though not under that name? Examples are XML, XML Schema, XPointer and XLink. For example, the XML specification states[1]: | System identifiers (and other XML strings meant to be used as URI | references) may contain characters that, according to [IETF RFC 2396] | and [IETF RFC 2732], must be escaped before a URI can be used to | retrieve the referenced resource. The characters to be escaped are the | contol characters #x0 to #x1F and #x7F (most of which cannot appear in | XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the | unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and | '`' #x60, as well as all characters above #x7F. Since escaping is not | always a fully reversible process, it must be performed only when | absolutely necessary and as late as possible in a processing chain. The last sentence above is very important. Note both the "only when absolutely necessary" and the "as late as possible in a processing chain". The XML specification continues: | In particular, neither the process of converting a relative URI to an | absolute one nor the process of passing a URI reference to a process or | software component responsible for dereferencing it should trigger | escaping. When escaping does occur, it must be performed as follows: | | 1. Each disallowed character to be escaped is represented in UTF-8 | [IETF RFC 2279] as one or more bytes. | | 2. The resulting bytes are escaped with the URI escaping mechanism | (that is, converted to %HH, where HH is the hexadecimal notation of | the byte value). | | 3. The original character is replaced by the resulting character sequence. > break many utilities which have made the assumption that RDF identifiers Which utilities? > are ASCII strings with no spaces, etc. I suppose spaces could safely be converted to %20, as they are invertible. > I can understand presenting strings this way for user-display and > user-entry but storing them this way and making them the official > encoding seems to be going too far. I would think that simply using > UTF-8 %-encoding would be fine for these purposes. How do you propose to display these strings in a meaningful manner? %HH encoding is not invertible, except in the case of ASCII characters. This is because the character encoding is not, in general, known. RFC 2396 says[2]: | In the simplest case, the original character sequence contains only | characters that are defined in US-ASCII, and the two levels of | mapping are simple and easily invertible: each 'original character' | is represented as the octet for the US-ASCII code for it, which is, | in turn, represented as either the US-ASCII character, or else the | "%" escape sequence for that octet. | | For original character sequences that contain non-ASCII characters, | however, the situation is more difficult. Internet protocols that | transmit octet sequences intended to represent character sequences | are expected to provide some way of identifying the charset used, if | there might be more than one [RFC2277]. However, there is currently | no provision within the generic URI syntax to accomplish this | identification. An individual URI scheme may require a single | charset, define a default charset, or provide a way to indicate the | charset used. > What does the TAG think about changing the standard Web identifier from > URIs to IRIs, essentially allowing arbitrary Unicode characters into the > body of these identifiers. An example from the RDF test cases shows an > HTTP URI with embedded accented characters in Unicode. > > I'm considering appealing this decision, but I wanted to hear the TAG's > position first, > > Thanks, > -- > Aaron Swartz [http://www.aaronsw.com/] [1] http://www.w3.org/XML/xml-V10-2e-errata#E26 [2] http://www.ietf.org/rfc/rfc2396.txt Misha Wolf I18N WG Chair ------------------------------------------------------------- --- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.

Next Message by Thread: click to view message preview

Re: [charmodReview-17] replacing all URIs with IRIs

[I'm copying www-rdf-validator@xxxxxx, because there is an error report for the validator, and some suggestions of how to fix it.] At 19:18 02/05/24 -0500, Aaron Swartz wrote: On Friday, May 24, 2002, at 05:04 PM, Misha.Wolf@xxxxxxxxxxx wrote: Which utilities? All the current RDF tools, I think. I don't think any of them have been updated to support normalization or Unicode storage. Certainly all the tools I've written don't support it. If you take a look at the RDF Validator[1] you'll find that it %-encodes characters like 端, as most of the RDF tools I know do. How much work would it be for the RDF Validator to change this? My guess is that it would be quite easy, and it would result in overall less code. I would be very glad to help. By the way, I just tested the RDF Validator with some simple input. While it gets to the correct %hh escaping in URIs, it messes up the literals. That's because the validator input page is labeled as being in iso-8859-1, and the output is labeled as being in UTF-8, but for literals, there is no coversion in between. To fix it, the following steps are needed: - Set the encoding of http://www.w3.org/RDF/Validator/Overview.html to UTF-8. I can do that in about one minute. Please tell me when to do it. - Find the place in the code where the URIs are converted from iso-8859-1 to UTF-8. Remove that conversion. This should be rather easy. Please tell me if you need help. - Fix graphVis. This seems to currently run under the assumption that everything (.dot files,...) is in iso-8859-1. In the short run, it could be called by converting from UTF-8 to iso-8859-1 and replacing characters not representable in iso-8859-1 with something like a ? or so. In the long term, it should be changed so that it can correctly render more than just iso-8859-1. This applies only to PNG and GIF; for SVG, graphVis currently does gigo (garbage in, garbage out), but feeding it UTF-8 would do the right thing. For the others, the easiest would be to use a batch SVG renderer. - Go through the collection of RDF saved for test purposes, and change the first line of anything that contains bytes higher than 0x7F from <?xml version="1.0"?> to <?xml version="1.0" encoding='iso-8859-1'?> and additionally check the data for garbage cases. I may be able to help with this, too. My conclusions from this are: - Yes, there are indeed problems with RDF tools and i18n. - Such problems should be fixed asap. - The problems start with literals, not with resource identifiers. - Fixing the problems with literals will fix the problems with resource identifiers too, in most cases. - For most part, fixing the problems probably takes less time than this discussion. Regards, Martin.
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by