|
Re: Pre-PEP: Easy Text File Decoding: msg#00142python.python-3000.devel
"Martin v. Löwis" <martin@xxxxxxxxxxx> writes: > Marcin 'Qrczak' Kowalczyk schrieb: >> It is true that it can change the interpretation of file contents. >> This is unavoidable. Unless someone uses unpaired surrogates for this >> purpose (or code points above U+10FFFF) - I've seen such proposals, >> but IMHO they are abusing rules too far. > > It's not exactly unavoidable: any escaping mechanism can support the > full range of valid input. In your escaping mechanism, you could > duplicate 0 bytes on decoding, and write a null byte if you have two > subsequent NUL characters on encoding. This is exactly what I am doing. The encoding is able to decode arbitrary byte sequences, including '\0' bytes, and encodes them back losslessly. The point is that it differs from true UTF-8 for strings which contain '\0' or U+0000. It's unavoidable that it differs from UTF-8 for some strings, unless code points not encodable in UTF-8 are used. It doesn't differ from true UTF-8 when there is no '\0' or U+0000. The fact that it doesn't differ from UTF-8 for some strings means that for such strings it fires only when UTF-8 decoder would have reported an error, i.e. that it only changes the behavior of code which would fail otherwise, that it doesn't break what would work in UTF-8. My encoder is injective: it accepts U+0000 prefixes only in sequences which would have been invalid UTF-8. I agree that it's not suitable for showing the filename for a user. > I still think that PUA characters would be a better use What if the filename contains the correct UTF-8 encoding of such PUA character? -- __("< Marcin Kowalczyk \__/ qrczak@xxxxxxxxxx ^^ http://qrnik.knm.org.pl/~qrczak/ |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Proposal: No more standard library additions: 00142, Bob Ippolito |
|---|---|
| Next by Date: | Re: Proposal: No more standard library additions: 00142, "Martin v. Löwis" |
| Previous by Thread: | Re: Pre-PEP: Easy Text File Decodingi: 00142, "Martin v. Löwis" |
| Next by Thread: | Re: Pre-PEP: Easy Text File Decoding: 00142, "Martin v. Löwis" |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |