|
Re: Grep for Unicode (was: Grep for Windows): msg#00120science.linguistics.corpora
Hi, On Dec 16, 2006, at 7:26 PM, Mike Maxwell wrote: Gnu grep 2.5.1 supports Unicode, though I guess it's debatable just how useful it is. The next version is supposed to be much better on that front. It doesn't do anything special with unicode itself, but if the locale is set to a multibyte encoding it uses the wide character support routines in libc. So, for example, if the LANG environment variable is set to en_US.utf8, it treats the input as UTF-8. It works, in the sense that "." matches a single character rather than a single byte, the character classes like "[:alpha:]" and "[:lower:]" are handled correctly, and so on, but it's not as flexible one might like. I did google some Red Hat info on updates to grep, which do speak about a Unicode issue (apparently an earlier version had an extreme inefficiency in the way it searched UTF-8 streams). Using mbstowcs and co. is much, much slower than grep's internal byte matching, which makes grep somethng like 100 times slower if the locale is set to use wide characters. I just tried this on a machine running Fedora Core 5: bulba% egrep -V egrep (GNU grep) 2.5.1 Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. bulba% export LANG=en_US.iso8859-1 bulba% time egrep "a[^a]+b" nyt9601.txt >/dev/null real 0m0.068s user 0m0.062s sys 0m0.006s bulba% export LANG=en_US.utf8 bulba% time egrep "a[^a]+b" nyt9601.txt >/dev/null real 0m2.695s user 0m2.688s sys 0m0.007s This is supposed to be fixed in the next version. The other is to search for a particular character sequence. For that, two things seem to be necessary: it needs to know the encoding of the incoming stream (UTF-8, UTF-16 big-end/little- end,...), and it needs to handle normalization. It doesn't really do either of these, unfortunately. It gets the encoding from the locale, not the input file, and as far as I know it doesn't do any normalization at all. As I say, it's debatable just how useful it is. --- Rob Malouf <rmalouf@xxxxxxxxxxxxx> Department of Linguistics and Asian/Middle Eastern Languages San Diego State University |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Grep for Unicode (was: Grep for Windows): 00120, Tony Abou-Assaleh |
|---|---|
| Next by Date: | IAS'07 - the First Call for Papers: 00120, Ajith Abraham |
| Previous by Thread: | Re: Grep for Unicode (was: Grep for Windows)i: 00120, Tony Abou-Assaleh |
| Next by Thread: | Re: Grep for Unicode (was: Grep for Windows): 00120, Brett Powley |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |