logo       

Re: multi-byte character sets in sed: msg#00032

editors.sed.user

Subject: Re: multi-byte character sets in sed


On Thu, 08 Dec 2005 01:19:34 +0800, Paolo Bonzini wrote:

> Yes, starting from version 4.1 on sed has full support for MBCS. The
> environment variables are LC_CTYPE and LC_COLLATE. Though, if you use
> them you may encounter weird behavior when a script expects an
> environment with the default values of these variables, i.e. LC_CTYPE=C
> LC_COLLATE=C: for example some locales demand that ranges (e.g. [A-Z])
> match case-insensitively, and this is by now the most reported sed
> non-bug (this behavior is mandated by POSIX).

I set the LC_CTYPE to GBK but I cannot find the different on my example.
Considering the following example:

$ od -t x1 sample
0000000 61 b0 a2 0a
0000004

; the first byte '\x61' is letter 'a'
; the second and third byte '\xa2b0' is a chinese character.

By 'set encoding=cp936', this script 's/\(.\)\(.\)/\2\1/' in gVim
(windows version) yields:
:.!od -t x1
0000000 b0 a2 61 0a
0000004

By setting the encoding, gVim knows the second and the third byte
represent a Chinese character and take them as a whole, so that the
second '.' on LHS actually match two bytes (yes this is what i want).
As you can see the '\xa2b0' was move to the first place, while I use
sed in Cygwin:
----------------------
hq00e@fzpdbl3icwkpg7a ~
$ export LC_CTYPE=zh_CN.GBK

hq00e@fzpdbl3icwkpg7a ~
$ sed 's/\(.\)\(.\)/\2\1/' sample |od -t x1
0000000 b0 61 a2 0a
0000004

hq00e@fzpdbl3icwkpg7a ~
$ sed --version
GNU sed version 4.1.4
...
----------------------
In sed the doublebyte char '\xa2b0' was broken in this way. I keep
getting the same result by changing the LC_CTYPE to zh_CN.GB2312 and
zh_CN.UTF-8. I can't see the MBCS actually do something here. Should
it be a Cygwin problem or a Windows problem or my own problem?

As Ruud suggested, I can firstly convert the file to UTF16 then use
'..' to match a character, but it unnecessarily add up the complexity.

Note: Giving a utf-16le file, I can use swap two letters by,
's/\(..\)\(..\)/\2\1' with native portion of gsed under Dos box. But
the same command doesn't work on my Cygwin with Cygwin-version gsed.
What's the problem?

I wonder what exactly the MBCS is? how can i find the related docs on it
(or what are the key words in google)? I tried man setlocale but find
little info. With MBCS support on, which command or metacharacter of
sed would act differently in different encodings?

--
Regard,
hq00e



------------------------ Yahoo! Groups Sponsor --------------------~-->
Get Bzzzy! (real tools to help you find a job). Welcome to the Sweet Life.
http://us.click.yahoo.com/KIlPFB/vlQLAA/TtwFAA/dkFolB/TM
--------------------------------------------------------------------~->

--

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/sed-users/

<*> To unsubscribe from this group, send an email to:
sed-users-unsubscribe@xxxxxxxxxxxxxxx

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/






<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise