|
|
Subject: RFC: Standard for automatic insertion of bidi control characters - msg#00031
Hi,
As you all know, BiDi text calls for occasional use of control characters. The
preferred option is to have them inserted automatically as the user types. I
have described methods to insert control characters automatically in my
previous mail[1].
In short, we need controls characters to:
1. Force base direction.
Base direction is determined by the first strong character. If different base
direction is desired, an LRM / RLM can be inserted as the first character.
2. Force direction of neutral characters.
Can be accomplished either with LRM / RLM or RLE / LRE / PDF. See my previous
mail[1] for examples.
If all toolkits would implement "automatic control characters", it'd be a good
idea for them to use the same algorithm. Here's why:
1. Control characters can be used in file names and similar identifiers. If
one toolkit would use RLE/LREs while the other would use LRM/LRMs, it'd be
impossible to type one toolkit's file name in another toolkit's edit box. If
you think filenames should be forbidden from having control characters, tell
me: why should a file name read "GNINRAEL c++" or "++c GNINRAEL" instead of
proper "C++ GNINRAEL" (caps for RTL text)?
2. Even in text documents, where it might seem the method doesn't matter,
interoperability is also an issue, since the same text file can be opened in
different editors or copied and pasted.
My goal: Forming a standard for automatic insertion of BiDi control characters
which'd be published on FreeDesktop.org and/or Unicode.org.
Decisions to make:
1. Should an initial RLM/LRM be inserted when the base direction is already
correct thanks to the first strong character policy?
2. Neutral characters direction' can be changed by either adding LRM/RLM after
them or by wrapping them in LRE/RLE - PDF. Which should we use? (Using
LRM/RLM has the benefit of avoiding the need for PDF, but LRE/RLE fit better
logically for what we're doing and generally feel like The Right Thing(tm) to
do to me.)
3. Should the standard require using control characters in most compact way,
e.g. requiring *not* to use LREs when base direction is LTR, requiring
joining adjacent embedding blocks ?
4. "Normalization" of control characters. Drop out control characters when
comparing identifiers? Compare identifiers in their "visual" form?
*** I'm looking forward for your comments. ***
One approach might be to implement automatic control characters in rich text
widgets only (like Microsoft did) but I'm against copying the worst solutions
from Microsoft.
[1] http://toast.unwind.co.il/docs/Automatic%20addition%20of%20BiDi
control%20characters
----
Ivrix-discuss list. See http://ivrix.org.il.
To unsubscribe, please send mail to
ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with
only the following line in the message body (NOT SUBJECT!): unsubscribe
Thread at a glance:
Previous Message by Date:
Automatic addition of BiDi control characters
If you ever used Microsoft's RichEdit control for some time, in Wordpad or in
Outlook, you might have noticed it handles BiDi smarter than the regular edit
control. One particular difference is that it takes note of the keyboard
language. e.g. if you type "C" "+" "+", in Hebrew keyboard it'll look like
"++C" while in English keyboard it'll look like "C++". That allows the user
to hint what direction he really wants by picking the keyboard language -- a
pretty natural thing to do.
I wish to implement this functionality on top of Qt's QTextEdit. Here I
propose a few approaches for automatically adding BiDi control characters.
Your input is welcome.
I picked the strings "C++" and "C#" as my pet peeves in the following
examples. Strings of code can serve an even better example, being full of
punctuation.
Strict mode vs. loose mode
====================
All approaches can choose to insert control chars only if keyboard direction
!= paragraph direction. I'd call this mode of operation "loose mode". The
opposite of this is "strict mode".
"Strict mode", although containing redundant control chars, guarantees that
even if the base direction changes, the string's meaning would remain the
same.
For example, in loose mode (assume LTR base direction):
<RLE><BET><PDF> C++
when switched to RTL base direction, would render:
++C <BET>
while in strict mode:
<RLE><BET><PDF><LRE> C++<PDF>
when switched to RTL base direction, would render:
C++ <BET>
* Microsoft's RichEdit works in "strict mode".
Approaches
=========
1. Primitive LRM/RLM approach.
Rules:
1. When inserting neutral CHAR, insert MARKER(keyboard direction) after
CHAR.
<BET> C+<LRM>+<LRM>.
Considerations:
1. Prevent cursor from being placed between CHAR and its MARKER
2. Erasing a CHAR should erase the adjacent MARKER.
3. Erasing an MARKER should erase the adjacent CHAR.
4. Some neutral chars will reorder when paragraph dir. changes, since
we only
set explicit dir. on chars which are entered with keyboard dir != paragraph
dir.
5. No optimization of redundant MARKERs possible. e.g. if we optimize
C+<LRM>+<LRM> to C++<LRM>, someone inserting an <ALEF> between the "+"s will
make the first "+" move to the left of the "C".
2. Optimized LRM/RLM approach:
<BET> C++, Java and C#<LRM>.
(note: no markers were added after the "++" since "J" is a strong LTR char)
Rules:
1. When inserting neutral CHAR, search forward to find the first strong
directionality char (markers included):
1.1. If no such char was found, assume founding char with
direction =
paragraph direction.
1.2. If the found char dir = current keyboard dir, do nothing.
1.3. If the found char dir != current keyboard dir, insert
MARKER(keyboard
dir) after CHAR.
Considerations:
1. Same as approach [1], without item 5.
3. LRE/RLE/PDF approach:
* This is the closest to Microsoft RichEdit's approach. The difference is that
RTF has only 1 embedding level and thus no need for a <PDF> command.
Strict mode: <RLE><BET> <PDF><LRE>C++<PDF><RLE>.<PDF>
Loose mode: <BET> <LRE>C++<PDF>.
Rules:
1. When inserting any CHAR, scan backwards for the nearest RLE/LRE
(noting
PDFs along the way, to assure we won't find an RLE/LRE which was already
popped).
1.1. If found embedding marker matches keyboard direction, do
nothing.
1.2. Otherwise, add embedding marker (RLE/LRE) before the CHAR
and PDF after
the CHAR.
Considerations:
1. Prevent cursor from being placed between RLE/LRE and CHAR.
2. Prevent cursor from being placed between CHAR and PDF.
3. Erasing a RLE/LRE will erase the CHAR after it.
4. Erasing a PDF will erase the CHAR before it.
5. If after applying 3 or 4, the RLE/LRE and PDF are adjacent, remove
them.
6. As a final cleanup (after erasing and applying 3-5, if required), in
we
have a PDF before the insertion point and an RLE/LRE after the insertion
point, try to join them (search back from the PDF to make sure the embedding
markers match).
----
Ivrix-discuss list. See http://ivrix.org.il.
To unsubscribe, please send mail to
ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with
only the following line in the message body (NOT SUBJECT!): unsubscribe
Next Message by Date:
Re: RFC: Standard for automatic insertion of bidi control characters
On Tue, May 13, 2003, Ilya Konstantinov wrote about "RFC: Standard for
automatic insertion of bidi control characters":
> If all toolkits would implement "automatic control characters", it'd be a
> good
> idea for them to use the same algorithm. Here's why:
Please correct me if I'm wrong, but these control characters only work
with UTF-8, right? For somebody that uses ISO8859-8, all this discussion
will be irrelevant.
> 1. Control characters can be used in file names and similar identifiers. If
> one toolkit would use RLE/LREs while the other would use LRM/LRMs, it'd be
> impossible to type one toolkit's file name in another toolkit's edit box. If
> you think filenames should be forbidden from having control characters, tell
> me: why should a file name read "GNINRAEL c++" or "++c GNINRAEL" instead of
> proper "C++ GNINRAEL" (caps for RTL text)?
You are making a valid point, but perhaps the conclusion should be stronger -
no invisible characters (including control characters, LRM's, etc.) or even
overlaying characters (like niqqud) should be comfortably used in user-typed
file names.
I mean, surely these characters could "legally" be found in file names, just
like in the good-old-ascii days you could create a file name with ^A, TAB,
newline, and other weird characters in their name (at one time, even a space
was considered a weird character). When you had such filenames with
"invisible" or "control" characters, your could usually only use them through
a file menu or globbing - you could not hope to be able to type such a name
again in a shell to access the file directly, or understand the exact
characters in the file name just from "ls" output.
Since all this mixed-language filenames issues normally apply only to
GUI users anyway, and these users always use menus anyway, the users
might not even notice this problem.
> 2. Even in text documents, where it might seem the method doesn't matter,
> interoperability is also an issue, since the same text file can be opened in
> different editors or copied and pasted.
Indeed, I believe that in this case, the control-character-insertion method
is not *the* problem. The bigger issue is 100% compatibility of the bidi
algorithm. If both editors were 100% compatible in their bidi algorithm,
then if you create a file in one (no matter what sort of control characters
the editor adds automatically) and open it in another editor, the result
will look 100% the same.
However, here your control character insertion might play a role. The problem
that currently prevents 100% bidi compatibility (beside bugs in various bidi
implementations that don't do what the unicode standard says) of plain text
is the differing heuristics for deciding base directions. If you can devise
a way to add control characters (say, in the beginning of every line or
every paragraph) that will cause the result to be identical in all known
bidi implementations, this might be worthwhile. But personally, I'd rather
see a standard heuristic for the base-direction issue to emerge...
> My goal: Forming a standard for automatic insertion of BiDi control
> characters
> which'd be published on FreeDesktop.org and/or Unicode.org.
>
> Decisions to make:
> 1. Should an initial RLM/LRM be inserted when the base direction is already
> correct thanks to the first strong character policy?
This might be useful as I pointed out above, to account for different
base-direction heuristics in different applications.
> 2. Neutral characters direction' can be changed by either adding LRM/RLM
> after
> them or by wrapping them in LRE/RLE - PDF. Which should we use? (Using
> LRM/RLM has the benefit of avoiding the need for PDF, but LRE/RLE fit better
> logically for what we're doing and generally feel like The Right Thing(tm) to
> do to me.)
Shouldn't we go for "be lenient in what you accept"? A good unicode-supporting
editor will need to support all ways of doing the same thing in its input.
When it writes output, it will be allowed to "canonize" these things to
the form it likes best, or try to leave them as similar as possible to what
was originally read (compare, for example, Windows editors that can work
with either LF or CR-LF line seperators).
> 3. Should the standard require using control characters in most compact way,
> e.g. requiring *not* to use LREs when base direction is LTR, requiring
> joining adjacent embedding blocks ?
Wouldn't that contradict the unicode standard, which does allow one to
write any sort of bizarre sequence of unicode characters?
> 4. "Normalization" of control characters. Drop out control characters when
> comparing identifiers? Compare identifiers in their "visual" form?
See what I wrote above about canonization, and why I believe that
"identifiers" (like file names, variable names, etc.) should not contain any
control characters - or if they do, the user should not expect to be able
to type their name by hand, only to choose them from a menu.
--
Nadav Har'El | Tuesday, May 13 2003, 12 Iyyar 5763
nyh-TS7m/3hpY0sOpacJJkBjfT4kX+cae0hd@xxxxxxxxxxxxxxxx
|-----------------------------------------
Phone: +972-53-245868, ICQ 13349191 |If con is the opposite of pro, is
http://nadav.harel.org.il |congress the opposite of progress?
----
Ivrix-discuss list. See http://ivrix.org.il.
To unsubscribe, please send mail to
ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with
only the following line in the message body (NOT SUBJECT!): unsubscribe
Previous Message by Thread:
Automatic addition of BiDi control characters
If you ever used Microsoft's RichEdit control for some time, in Wordpad or in
Outlook, you might have noticed it handles BiDi smarter than the regular edit
control. One particular difference is that it takes note of the keyboard
language. e.g. if you type "C" "+" "+", in Hebrew keyboard it'll look like
"++C" while in English keyboard it'll look like "C++". That allows the user
to hint what direction he really wants by picking the keyboard language -- a
pretty natural thing to do.
I wish to implement this functionality on top of Qt's QTextEdit. Here I
propose a few approaches for automatically adding BiDi control characters.
Your input is welcome.
I picked the strings "C++" and "C#" as my pet peeves in the following
examples. Strings of code can serve an even better example, being full of
punctuation.
Strict mode vs. loose mode
====================
All approaches can choose to insert control chars only if keyboard direction
!= paragraph direction. I'd call this mode of operation "loose mode". The
opposite of this is "strict mode".
"Strict mode", although containing redundant control chars, guarantees that
even if the base direction changes, the string's meaning would remain the
same.
For example, in loose mode (assume LTR base direction):
<RLE><BET><PDF> C++
when switched to RTL base direction, would render:
++C <BET>
while in strict mode:
<RLE><BET><PDF><LRE> C++<PDF>
when switched to RTL base direction, would render:
C++ <BET>
* Microsoft's RichEdit works in "strict mode".
Approaches
=========
1. Primitive LRM/RLM approach.
Rules:
1. When inserting neutral CHAR, insert MARKER(keyboard direction) after
CHAR.
<BET> C+<LRM>+<LRM>.
Considerations:
1. Prevent cursor from being placed between CHAR and its MARKER
2. Erasing a CHAR should erase the adjacent MARKER.
3. Erasing an MARKER should erase the adjacent CHAR.
4. Some neutral chars will reorder when paragraph dir. changes, since
we only
set explicit dir. on chars which are entered with keyboard dir != paragraph
dir.
5. No optimization of redundant MARKERs possible. e.g. if we optimize
C+<LRM>+<LRM> to C++<LRM>, someone inserting an <ALEF> between the "+"s will
make the first "+" move to the left of the "C".
2. Optimized LRM/RLM approach:
<BET> C++, Java and C#<LRM>.
(note: no markers were added after the "++" since "J" is a strong LTR char)
Rules:
1. When inserting neutral CHAR, search forward to find the first strong
directionality char (markers included):
1.1. If no such char was found, assume founding char with
direction =
paragraph direction.
1.2. If the found char dir = current keyboard dir, do nothing.
1.3. If the found char dir != current keyboard dir, insert
MARKER(keyboard
dir) after CHAR.
Considerations:
1. Same as approach [1], without item 5.
3. LRE/RLE/PDF approach:
* This is the closest to Microsoft RichEdit's approach. The difference is that
RTF has only 1 embedding level and thus no need for a <PDF> command.
Strict mode: <RLE><BET> <PDF><LRE>C++<PDF><RLE>.<PDF>
Loose mode: <BET> <LRE>C++<PDF>.
Rules:
1. When inserting any CHAR, scan backwards for the nearest RLE/LRE
(noting
PDFs along the way, to assure we won't find an RLE/LRE which was already
popped).
1.1. If found embedding marker matches keyboard direction, do
nothing.
1.2. Otherwise, add embedding marker (RLE/LRE) before the CHAR
and PDF after
the CHAR.
Considerations:
1. Prevent cursor from being placed between RLE/LRE and CHAR.
2. Prevent cursor from being placed between CHAR and PDF.
3. Erasing a RLE/LRE will erase the CHAR after it.
4. Erasing a PDF will erase the CHAR before it.
5. If after applying 3 or 4, the RLE/LRE and PDF are adjacent, remove
them.
6. As a final cleanup (after erasing and applying 3-5, if required), in
we
have a PDF before the insertion point and an RLE/LRE after the insertion
point, try to join them (search back from the PDF to make sure the embedding
markers match).
----
Ivrix-discuss list. See http://ivrix.org.il.
To unsubscribe, please send mail to
ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with
only the following line in the message body (NOT SUBJECT!): unsubscribe
Next Message by Thread:
Re: RFC: Standard for automatic insertion of bidi control characters
On Tue, May 13, 2003, Ilya Konstantinov wrote about "RFC: Standard for
automatic insertion of bidi control characters":
> If all toolkits would implement "automatic control characters", it'd be a
> good
> idea for them to use the same algorithm. Here's why:
Please correct me if I'm wrong, but these control characters only work
with UTF-8, right? For somebody that uses ISO8859-8, all this discussion
will be irrelevant.
> 1. Control characters can be used in file names and similar identifiers. If
> one toolkit would use RLE/LREs while the other would use LRM/LRMs, it'd be
> impossible to type one toolkit's file name in another toolkit's edit box. If
> you think filenames should be forbidden from having control characters, tell
> me: why should a file name read "GNINRAEL c++" or "++c GNINRAEL" instead of
> proper "C++ GNINRAEL" (caps for RTL text)?
You are making a valid point, but perhaps the conclusion should be stronger -
no invisible characters (including control characters, LRM's, etc.) or even
overlaying characters (like niqqud) should be comfortably used in user-typed
file names.
I mean, surely these characters could "legally" be found in file names, just
like in the good-old-ascii days you could create a file name with ^A, TAB,
newline, and other weird characters in their name (at one time, even a space
was considered a weird character). When you had such filenames with
"invisible" or "control" characters, your could usually only use them through
a file menu or globbing - you could not hope to be able to type such a name
again in a shell to access the file directly, or understand the exact
characters in the file name just from "ls" output.
Since all this mixed-language filenames issues normally apply only to
GUI users anyway, and these users always use menus anyway, the users
might not even notice this problem.
> 2. Even in text documents, where it might seem the method doesn't matter,
> interoperability is also an issue, since the same text file can be opened in
> different editors or copied and pasted.
Indeed, I believe that in this case, the control-character-insertion method
is not *the* problem. The bigger issue is 100% compatibility of the bidi
algorithm. If both editors were 100% compatible in their bidi algorithm,
then if you create a file in one (no matter what sort of control characters
the editor adds automatically) and open it in another editor, the result
will look 100% the same.
However, here your control character insertion might play a role. The problem
that currently prevents 100% bidi compatibility (beside bugs in various bidi
implementations that don't do what the unicode standard says) of plain text
is the differing heuristics for deciding base directions. If you can devise
a way to add control characters (say, in the beginning of every line or
every paragraph) that will cause the result to be identical in all known
bidi implementations, this might be worthwhile. But personally, I'd rather
see a standard heuristic for the base-direction issue to emerge...
> My goal: Forming a standard for automatic insertion of BiDi control
> characters
> which'd be published on FreeDesktop.org and/or Unicode.org.
>
> Decisions to make:
> 1. Should an initial RLM/LRM be inserted when the base direction is already
> correct thanks to the first strong character policy?
This might be useful as I pointed out above, to account for different
base-direction heuristics in different applications.
> 2. Neutral characters direction' can be changed by either adding LRM/RLM
> after
> them or by wrapping them in LRE/RLE - PDF. Which should we use? (Using
> LRM/RLM has the benefit of avoiding the need for PDF, but LRE/RLE fit better
> logically for what we're doing and generally feel like The Right Thing(tm) to
> do to me.)
Shouldn't we go for "be lenient in what you accept"? A good unicode-supporting
editor will need to support all ways of doing the same thing in its input.
When it writes output, it will be allowed to "canonize" these things to
the form it likes best, or try to leave them as similar as possible to what
was originally read (compare, for example, Windows editors that can work
with either LF or CR-LF line seperators).
> 3. Should the standard require using control characters in most compact way,
> e.g. requiring *not* to use LREs when base direction is LTR, requiring
> joining adjacent embedding blocks ?
Wouldn't that contradict the unicode standard, which does allow one to
write any sort of bizarre sequence of unicode characters?
> 4. "Normalization" of control characters. Drop out control characters when
> comparing identifiers? Compare identifiers in their "visual" form?
See what I wrote above about canonization, and why I believe that
"identifiers" (like file names, variable names, etc.) should not contain any
control characters - or if they do, the user should not expect to be able
to type their name by hand, only to choose them from a menu.
--
Nadav Har'El | Tuesday, May 13 2003, 12 Iyyar 5763
nyh-TS7m/3hpY0sOpacJJkBjfT4kX+cae0hd@xxxxxxxxxxxxxxxx
|-----------------------------------------
Phone: +972-53-245868, ICQ 13349191 |If con is the opposite of pro, is
http://nadav.harel.org.il |congress the opposite of progress?
----
Ivrix-discuss list. See http://ivrix.org.il.
To unsubscribe, please send mail to
ivrix-discuss-request-0GYx4uFImdMQhDRD3CWCHA@xxxxxxxxxxxxxxxx with
only the following line in the message body (NOT SUBJECT!): unsubscribe
|
|