osdir.com
mailing list archive F.A.Q. -since 2001!



Subject: RFC: Standard for automatic insertion of bidi
control characters - msg#00031

List: linux.region.israel.ivrix.discuss

Mail Archive Navigation:
by Date: Prev Next Date Index by Thread: Prev Next Thread Index

Hi,

As you all know, BiDi text calls for occasional use of control characters. The
preferred option is to have them inserted automatically as the user types. I
have described methods to insert control characters automatically in my
previous mail[1].

In short, we need controls characters to:

1. Force base direction.
Base direction is determined by the first strong character. If different base
direction is desired, an LRM / RLM can be inserted as the first character.

2. Force direction of neutral characters.
Can be accomplished either with LRM / RLM or RLE / LRE / PDF. See my previous
mail[1] for examples.


If all toolkits would implement "automatic control characters", it'd be a good
idea for them to use the same algorithm. Here's why:
1. Control characters can be used in file names and similar identifiers. If
one toolkit would use RLE/LREs while the other would use LRM/LRMs, it'd be
impossible to type one toolkit's file name in another toolkit's edit box. If
you think filenames should be forbidden from having control characters, tell
me: why should a file name read "GNINRAEL c++" or "++c GNINRAEL" instead of
proper "C++ GNINRAEL" (caps for RTL text)?
2. Even in text documents, where it might seem the method doesn't matter,
interoperability is also an issue, since the same text file can be opened in
different editors or copied and pasted.

My goal: Forming a standard for automatic insertion of BiDi control characters
which'd be published on FreeDesktop.org and/or Unicode.org.

Decisions to make:
1. Should an initial RLM/LRM be inserted when the base direction is already
correct thanks to the first strong character policy?
2. Neutral characters direction' can be changed by either adding LRM/RLM after
them or by wrapping them in LRE/RLE - PDF. Which should we use? (Using
LRM/RLM has the benefit of avoiding the need for PDF, but LRE/RLE fit better
logically for what we're doing and generally feel like The Right Thing(tm) to
do to me.)
3. Should the standard require using control characters in most compact way,
e.g. requiring *not* to use LREs when base direction is LTR, requiring
joining adjacent embedding blocks ?
4. "Normalization" of control characters. Drop out control characters when
comparing identifiers? Compare identifiers in their "visual" form?

*** I'm looking forward for your comments. ***

One approach might be to implement automatic control characters in rich text
widgets only (like Microsoft did) but I'm against copying the worst solutions
from Microsoft.

[1] http://toast.unwind.co.il/docs/Automatic%20addition%20of%20BiDi
control%20characters
----
Ivrix-discuss list. See http://ivrix.org.il.
To unsubscribe, please send mail to
ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with
only the following line in the message body (NOT SUBJECT!): unsubscribe



Thread at a glance:

Previous Message by Date:

Automatic addition of BiDi control characters

If you ever used Microsoft's RichEdit control for some time, in Wordpad or in Outlook, you might have noticed it handles BiDi smarter than the regular edit control. One particular difference is that it takes note of the keyboard language. e.g. if you type "C" "+" "+", in Hebrew keyboard it'll look like "++C" while in English keyboard it'll look like "C++". That allows the user to hint what direction he really wants by picking the keyboard language -- a pretty natural thing to do. I wish to implement this functionality on top of Qt's QTextEdit. Here I propose a few approaches for automatically adding BiDi control characters. Your input is welcome. I picked the strings "C++" and "C#" as my pet peeves in the following examples. Strings of code can serve an even better example, being full of punctuation. Strict mode vs. loose mode ==================== All approaches can choose to insert control chars only if keyboard direction != paragraph direction. I'd call this mode of operation "loose mode". The opposite of this is "strict mode". "Strict mode", although containing redundant control chars, guarantees that even if the base direction changes, the string's meaning would remain the same. For example, in loose mode (assume LTR base direction): <RLE><BET><PDF> C++ when switched to RTL base direction, would render: ++C <BET> while in strict mode: <RLE><BET><PDF><LRE> C++<PDF> when switched to RTL base direction, would render: C++ <BET> * Microsoft's RichEdit works in "strict mode". Approaches ========= 1. Primitive LRM/RLM approach. Rules: 1. When inserting neutral CHAR, insert MARKER(keyboard direction) after CHAR. <BET> C+<LRM>+<LRM>. Considerations: 1. Prevent cursor from being placed between CHAR and its MARKER 2. Erasing a CHAR should erase the adjacent MARKER. 3. Erasing an MARKER should erase the adjacent CHAR. 4. Some neutral chars will reorder when paragraph dir. changes, since we only set explicit dir. on chars which are entered with keyboard dir != paragraph dir. 5. No optimization of redundant MARKERs possible. e.g. if we optimize C+<LRM>+<LRM> to C++<LRM>, someone inserting an <ALEF> between the "+"s will make the first "+" move to the left of the "C". 2. Optimized LRM/RLM approach: <BET> C++, Java and C#<LRM>. (note: no markers were added after the "++" since "J" is a strong LTR char) Rules: 1. When inserting neutral CHAR, search forward to find the first strong directionality char (markers included): 1.1. If no such char was found, assume founding char with direction = paragraph direction. 1.2. If the found char dir = current keyboard dir, do nothing. 1.3. If the found char dir != current keyboard dir, insert MARKER(keyboard dir) after CHAR. Considerations: 1. Same as approach [1], without item 5. 3. LRE/RLE/PDF approach: * This is the closest to Microsoft RichEdit's approach. The difference is that RTF has only 1 embedding level and thus no need for a <PDF> command. Strict mode: <RLE><BET> <PDF><LRE>C++<PDF><RLE>.<PDF> Loose mode: <BET> <LRE>C++<PDF>. Rules: 1. When inserting any CHAR, scan backwards for the nearest RLE/LRE (noting PDFs along the way, to assure we won't find an RLE/LRE which was already popped). 1.1. If found embedding marker matches keyboard direction, do nothing. 1.2. Otherwise, add embedding marker (RLE/LRE) before the CHAR and PDF after the CHAR. Considerations: 1. Prevent cursor from being placed between RLE/LRE and CHAR. 2. Prevent cursor from being placed between CHAR and PDF. 3. Erasing a RLE/LRE will erase the CHAR after it. 4. Erasing a PDF will erase the CHAR before it. 5. If after applying 3 or 4, the RLE/LRE and PDF are adjacent, remove them. 6. As a final cleanup (after erasing and applying 3-5, if required), in we have a PDF before the insertion point and an RLE/LRE after the insertion point, try to join them (search back from the PDF to make sure the embedding markers match). ---- Ivrix-discuss list. See http://ivrix.org.il. To unsubscribe, please send mail to ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with only the following line in the message body (NOT SUBJECT!): unsubscribe

Next Message by Date:

Re: RFC: Standard for automatic insertion of bidi control characters

On Tue, May 13, 2003, Ilya Konstantinov wrote about "RFC: Standard for automatic insertion of bidi control characters": > If all toolkits would implement "automatic control characters", it'd be a > good > idea for them to use the same algorithm. Here's why: Please correct me if I'm wrong, but these control characters only work with UTF-8, right? For somebody that uses ISO8859-8, all this discussion will be irrelevant. > 1. Control characters can be used in file names and similar identifiers. If > one toolkit would use RLE/LREs while the other would use LRM/LRMs, it'd be > impossible to type one toolkit's file name in another toolkit's edit box. If > you think filenames should be forbidden from having control characters, tell > me: why should a file name read "GNINRAEL c++" or "++c GNINRAEL" instead of > proper "C++ GNINRAEL" (caps for RTL text)? You are making a valid point, but perhaps the conclusion should be stronger - no invisible characters (including control characters, LRM's, etc.) or even overlaying characters (like niqqud) should be comfortably used in user-typed file names. I mean, surely these characters could "legally" be found in file names, just like in the good-old-ascii days you could create a file name with ^A, TAB, newline, and other weird characters in their name (at one time, even a space was considered a weird character). When you had such filenames with "invisible" or "control" characters, your could usually only use them through a file menu or globbing - you could not hope to be able to type such a name again in a shell to access the file directly, or understand the exact characters in the file name just from "ls" output. Since all this mixed-language filenames issues normally apply only to GUI users anyway, and these users always use menus anyway, the users might not even notice this problem. > 2. Even in text documents, where it might seem the method doesn't matter, > interoperability is also an issue, since the same text file can be opened in > different editors or copied and pasted. Indeed, I believe that in this case, the control-character-insertion method is not *the* problem. The bigger issue is 100% compatibility of the bidi algorithm. If both editors were 100% compatible in their bidi algorithm, then if you create a file in one (no matter what sort of control characters the editor adds automatically) and open it in another editor, the result will look 100% the same. However, here your control character insertion might play a role. The problem that currently prevents 100% bidi compatibility (beside bugs in various bidi implementations that don't do what the unicode standard says) of plain text is the differing heuristics for deciding base directions. If you can devise a way to add control characters (say, in the beginning of every line or every paragraph) that will cause the result to be identical in all known bidi implementations, this might be worthwhile. But personally, I'd rather see a standard heuristic for the base-direction issue to emerge... > My goal: Forming a standard for automatic insertion of BiDi control > characters > which'd be published on FreeDesktop.org and/or Unicode.org. > > Decisions to make: > 1. Should an initial RLM/LRM be inserted when the base direction is already > correct thanks to the first strong character policy? This might be useful as I pointed out above, to account for different base-direction heuristics in different applications. > 2. Neutral characters direction' can be changed by either adding LRM/RLM > after > them or by wrapping them in LRE/RLE - PDF. Which should we use? (Using > LRM/RLM has the benefit of avoiding the need for PDF, but LRE/RLE fit better > logically for what we're doing and generally feel like The Right Thing(tm) to > do to me.) Shouldn't we go for "be lenient in what you accept"? A good unicode-supporting editor will need to support all ways of doing the same thing in its input. When it writes output, it will be allowed to "canonize" these things to the form it likes best, or try to leave them as similar as possible to what was originally read (compare, for example, Windows editors that can work with either LF or CR-LF line seperators). > 3. Should the standard require using control characters in most compact way, > e.g. requiring *not* to use LREs when base direction is LTR, requiring > joining adjacent embedding blocks ? Wouldn't that contradict the unicode standard, which does allow one to write any sort of bizarre sequence of unicode characters? > 4. "Normalization" of control characters. Drop out control characters when > comparing identifiers? Compare identifiers in their "visual" form? See what I wrote above about canonization, and why I believe that "identifiers" (like file names, variable names, etc.) should not contain any control characters - or if they do, the user should not expect to be able to type their name by hand, only to choose them from a menu. -- Nadav Har'El | Tuesday, May 13 2003, 12 Iyyar 5763 nyh-TS7m/3hpY0sOpacJJkBjfT4kX+cae0hd@xxxxxxxxxxxxxxxx |----------------------------------------- Phone: +972-53-245868, ICQ 13349191 |If con is the opposite of pro, is http://nadav.harel.org.il |congress the opposite of progress? ---- Ivrix-discuss list. See http://ivrix.org.il. To unsubscribe, please send mail to ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with only the following line in the message body (NOT SUBJECT!): unsubscribe

Previous Message by Thread:

Automatic addition of BiDi control characters

If you ever used Microsoft's RichEdit control for some time, in Wordpad or in Outlook, you might have noticed it handles BiDi smarter than the regular edit control. One particular difference is that it takes note of the keyboard language. e.g. if you type "C" "+" "+", in Hebrew keyboard it'll look like "++C" while in English keyboard it'll look like "C++". That allows the user to hint what direction he really wants by picking the keyboard language -- a pretty natural thing to do. I wish to implement this functionality on top of Qt's QTextEdit. Here I propose a few approaches for automatically adding BiDi control characters. Your input is welcome. I picked the strings "C++" and "C#" as my pet peeves in the following examples. Strings of code can serve an even better example, being full of punctuation. Strict mode vs. loose mode ==================== All approaches can choose to insert control chars only if keyboard direction != paragraph direction. I'd call this mode of operation "loose mode". The opposite of this is "strict mode". "Strict mode", although containing redundant control chars, guarantees that even if the base direction changes, the string's meaning would remain the same. For example, in loose mode (assume LTR base direction): <RLE><BET><PDF> C++ when switched to RTL base direction, would render: ++C <BET> while in strict mode: <RLE><BET><PDF><LRE> C++<PDF> when switched to RTL base direction, would render: C++ <BET> * Microsoft's RichEdit works in "strict mode". Approaches ========= 1. Primitive LRM/RLM approach. Rules: 1. When inserting neutral CHAR, insert MARKER(keyboard direction) after CHAR. <BET> C+<LRM>+<LRM>. Considerations: 1. Prevent cursor from being placed between CHAR and its MARKER 2. Erasing a CHAR should erase the adjacent MARKER. 3. Erasing an MARKER should erase the adjacent CHAR. 4. Some neutral chars will reorder when paragraph dir. changes, since we only set explicit dir. on chars which are entered with keyboard dir != paragraph dir. 5. No optimization of redundant MARKERs possible. e.g. if we optimize C+<LRM>+<LRM> to C++<LRM>, someone inserting an <ALEF> between the "+"s will make the first "+" move to the left of the "C". 2. Optimized LRM/RLM approach: <BET> C++, Java and C#<LRM>. (note: no markers were added after the "++" since "J" is a strong LTR char) Rules: 1. When inserting neutral CHAR, search forward to find the first strong directionality char (markers included): 1.1. If no such char was found, assume founding char with direction = paragraph direction. 1.2. If the found char dir = current keyboard dir, do nothing. 1.3. If the found char dir != current keyboard dir, insert MARKER(keyboard dir) after CHAR. Considerations: 1. Same as approach [1], without item 5. 3. LRE/RLE/PDF approach: * This is the closest to Microsoft RichEdit's approach. The difference is that RTF has only 1 embedding level and thus no need for a <PDF> command. Strict mode: <RLE><BET> <PDF><LRE>C++<PDF><RLE>.<PDF> Loose mode: <BET> <LRE>C++<PDF>. Rules: 1. When inserting any CHAR, scan backwards for the nearest RLE/LRE (noting PDFs along the way, to assure we won't find an RLE/LRE which was already popped). 1.1. If found embedding marker matches keyboard direction, do nothing. 1.2. Otherwise, add embedding marker (RLE/LRE) before the CHAR and PDF after the CHAR. Considerations: 1. Prevent cursor from being placed between RLE/LRE and CHAR. 2. Prevent cursor from being placed between CHAR and PDF. 3. Erasing a RLE/LRE will erase the CHAR after it. 4. Erasing a PDF will erase the CHAR before it. 5. If after applying 3 or 4, the RLE/LRE and PDF are adjacent, remove them. 6. As a final cleanup (after erasing and applying 3-5, if required), in we have a PDF before the insertion point and an RLE/LRE after the insertion point, try to join them (search back from the PDF to make sure the embedding markers match). ---- Ivrix-discuss list. See http://ivrix.org.il. To unsubscribe, please send mail to ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with only the following line in the message body (NOT SUBJECT!): unsubscribe

Next Message by Thread:

Re: RFC: Standard for automatic insertion of bidi control characters

On Tue, May 13, 2003, Ilya Konstantinov wrote about "RFC: Standard for automatic insertion of bidi control characters": > If all toolkits would implement "automatic control characters", it'd be a > good > idea for them to use the same algorithm. Here's why: Please correct me if I'm wrong, but these control characters only work with UTF-8, right? For somebody that uses ISO8859-8, all this discussion will be irrelevant. > 1. Control characters can be used in file names and similar identifiers. If > one toolkit would use RLE/LREs while the other would use LRM/LRMs, it'd be > impossible to type one toolkit's file name in another toolkit's edit box. If > you think filenames should be forbidden from having control characters, tell > me: why should a file name read "GNINRAEL c++" or "++c GNINRAEL" instead of > proper "C++ GNINRAEL" (caps for RTL text)? You are making a valid point, but perhaps the conclusion should be stronger - no invisible characters (including control characters, LRM's, etc.) or even overlaying characters (like niqqud) should be comfortably used in user-typed file names. I mean, surely these characters could "legally" be found in file names, just like in the good-old-ascii days you could create a file name with ^A, TAB, newline, and other weird characters in their name (at one time, even a space was considered a weird character). When you had such filenames with "invisible" or "control" characters, your could usually only use them through a file menu or globbing - you could not hope to be able to type such a name again in a shell to access the file directly, or understand the exact characters in the file name just from "ls" output. Since all this mixed-language filenames issues normally apply only to GUI users anyway, and these users always use menus anyway, the users might not even notice this problem. > 2. Even in text documents, where it might seem the method doesn't matter, > interoperability is also an issue, since the same text file can be opened in > different editors or copied and pasted. Indeed, I believe that in this case, the control-character-insertion method is not *the* problem. The bigger issue is 100% compatibility of the bidi algorithm. If both editors were 100% compatible in their bidi algorithm, then if you create a file in one (no matter what sort of control characters the editor adds automatically) and open it in another editor, the result will look 100% the same. However, here your control character insertion might play a role. The problem that currently prevents 100% bidi compatibility (beside bugs in various bidi implementations that don't do what the unicode standard says) of plain text is the differing heuristics for deciding base directions. If you can devise a way to add control characters (say, in the beginning of every line or every paragraph) that will cause the result to be identical in all known bidi implementations, this might be worthwhile. But personally, I'd rather see a standard heuristic for the base-direction issue to emerge... > My goal: Forming a standard for automatic insertion of BiDi control > characters > which'd be published on FreeDesktop.org and/or Unicode.org. > > Decisions to make: > 1. Should an initial RLM/LRM be inserted when the base direction is already > correct thanks to the first strong character policy? This might be useful as I pointed out above, to account for different base-direction heuristics in different applications. > 2. Neutral characters direction' can be changed by either adding LRM/RLM > after > them or by wrapping them in LRE/RLE - PDF. Which should we use? (Using > LRM/RLM has the benefit of avoiding the need for PDF, but LRE/RLE fit better > logically for what we're doing and generally feel like The Right Thing(tm) to > do to me.) Shouldn't we go for "be lenient in what you accept"? A good unicode-supporting editor will need to support all ways of doing the same thing in its input. When it writes output, it will be allowed to "canonize" these things to the form it likes best, or try to leave them as similar as possible to what was originally read (compare, for example, Windows editors that can work with either LF or CR-LF line seperators). > 3. Should the standard require using control characters in most compact way, > e.g. requiring *not* to use LREs when base direction is LTR, requiring > joining adjacent embedding blocks ? Wouldn't that contradict the unicode standard, which does allow one to write any sort of bizarre sequence of unicode characters? > 4. "Normalization" of control characters. Drop out control characters when > comparing identifiers? Compare identifiers in their "visual" form? See what I wrote above about canonization, and why I believe that "identifiers" (like file names, variable names, etc.) should not contain any control characters - or if they do, the user should not expect to be able to type their name by hand, only to choose them from a menu. -- Nadav Har'El | Tuesday, May 13 2003, 12 Iyyar 5763 nyh-TS7m/3hpY0sOpacJJkBjfT4kX+cae0hd@xxxxxxxxxxxxxxxx |----------------------------------------- Phone: +972-53-245868, ICQ 13349191 |If con is the opposite of pro, is http://nadav.harel.org.il |congress the opposite of progress? ---- Ivrix-discuss list. See http://ivrix.org.il. To unsubscribe, please send mail to ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with only the following line in the message body (NOT SUBJECT!): unsubscribe
blog comments powered by Disqus

Home | News | Sitemap | FAQ | advertise | OSDir is an Inevitable website. GBiz is too!