|
Re: multi-byte character sets in sed: msg#00032editors.sed.user
On Thu, 08 Dec 2005 01:19:34 +0800, Paolo Bonzini wrote: > Yes, starting from version 4.1 on sed has full support for MBCS. The > environment variables are LC_CTYPE and LC_COLLATE. Though, if you use > them you may encounter weird behavior when a script expects an > environment with the default values of these variables, i.e. LC_CTYPE=C > LC_COLLATE=C: for example some locales demand that ranges (e.g. [A-Z]) > match case-insensitively, and this is by now the most reported sed > non-bug (this behavior is mandated by POSIX). I set the LC_CTYPE to GBK but I cannot find the different on my example. Considering the following example: $ od -t x1 sample 0000000 61 b0 a2 0a 0000004 ; the first byte '\x61' is letter 'a' ; the second and third byte '\xa2b0' is a chinese character. By 'set encoding=cp936', this script 's/\(.\)\(.\)/\2\1/' in gVim (windows version) yields: :.!od -t x1 0000000 b0 a2 61 0a 0000004 By setting the encoding, gVim knows the second and the third byte represent a Chinese character and take them as a whole, so that the second '.' on LHS actually match two bytes (yes this is what i want). As you can see the '\xa2b0' was move to the first place, while I use sed in Cygwin: ---------------------- hq00e@fzpdbl3icwkpg7a ~ $ export LC_CTYPE=zh_CN.GBK hq00e@fzpdbl3icwkpg7a ~ $ sed 's/\(.\)\(.\)/\2\1/' sample |od -t x1 0000000 b0 61 a2 0a 0000004 hq00e@fzpdbl3icwkpg7a ~ $ sed --version GNU sed version 4.1.4 ... ---------------------- In sed the doublebyte char '\xa2b0' was broken in this way. I keep getting the same result by changing the LC_CTYPE to zh_CN.GB2312 and zh_CN.UTF-8. I can't see the MBCS actually do something here. Should it be a Cygwin problem or a Windows problem or my own problem? As Ruud suggested, I can firstly convert the file to UTF16 then use '..' to match a character, but it unnecessarily add up the complexity. Note: Giving a utf-16le file, I can use swap two letters by, 's/\(..\)\(..\)/\2\1' with native portion of gsed under Dos box. But the same command doesn't work on my Cygwin with Cygwin-version gsed. What's the problem? I wonder what exactly the MBCS is? how can i find the related docs on it (or what are the key words in google)? I tried man setlocale but find little info. With MBCS support on, which command or metacharacter of sed would act differently in different encodings? -- Regard, hq00e ------------------------ Yahoo! Groups Sponsor --------------------~--> Get Bzzzy! (real tools to help you find a job). Welcome to the Sweet Life. http://us.click.yahoo.com/KIlPFB/vlQLAA/TtwFAA/dkFolB/TM --------------------------------------------------------------------~-> -- Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/sed-users/ <*> To unsubscribe from this group, send an email to: sed-users-unsubscribe@xxxxxxxxxxxxxxx <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/ |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Text Substitution and Script: 00032, shamrul ismawi |
|---|---|
| Next by Date: | Re: can sed take take several line to produce multiple output: 00032, shamrul ismawi |
| Previous by Thread: | Re: multi-byte character sets in sedi: 00032, Paolo Bonzini |
| Next by Thread: | GnuWin32 sed can handle double-byte character: 00032, hq00e |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |