|
|
Choosing A Webhost: |
Re: a simple algorithm problem: msg#00002search.snowball
Olly, I would answer 'yes' to all the points made in your last email. Thanks for reporting the error in Snowball manual. It will be fixed in the next cvs commit. There are various ways to proceed ... In the 2-byte character version of Snowball (standard for the Java codegenerator) you can define characters as decimal or hex numbers in the range 257 to 64K. These characters can go into character tables, which are implemented as bitmaps. Of course, working with 256 characters, the bitmap never exceeds 32 bytes -- and will frequently be less, since the bitmap is truncated at both ends by removing runs of zeros. Working with 64K characters, a bitmap might go up to 8K in size, which is not an intolerable overhead. In practice they are much smaller, since the codes we need in the stemmers do not have high Unicode values. So one idea is to declare 'utf8' in the Snowball script, allowing character defs in the range 0-64K, as in the 2-byte character version. Characters could be written with their Unicode values, and encoded in utf-8 form in strings. Looking again at the cursor movement issues: If I've got my thinking right: in goto and gopast, cursor movement can be done one byte at a time, (cursor++; or cursor--;). If expression C is made up of well-formed utf-8 strings. 'goto C' must either fail, or end on a valid character boundary. 'next' requires its own implementation, and as you say, backward movement is not a problem. 'next' is implicit in character tests (vowel, non-vowel), and hop N (= 'next' done N times). I'm not sure 'size' is used in the Snowball scripts: it could be defined to give an approximate answer in the utf-8 case, or implemented exactly. Presumably a 'utf8' declaration would simply be ignored by the Java codegenerator (Richard B to confirm). Obviously we are working towards a standard header: utf8 define GREEK_CAPITAL_LETTER_OMICRON hex 039F . . . . of Unicode characters, and it would be nice to use the Unicode names, were they not (as this example shows) so very long. - - - Another idea I had was just to create modified versions of the existing scripts so they will work with utf-8 encoded strings, even while Snowball knows nothing about utf-8. That could be done with no further changes to Snowball. Incidentally, do you have a view on the use of free-floating accents? (Unicode 0300-036F) Martin
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: a simple algorithm problem, Olly Betts |
|---|---|
| Next by Date: | Re: a simple algorithm problem, James Aylett |
| Previous by Thread: | Re: a simple algorithm problem, Olly Betts |
| Next by Thread: | Re: a simple algorithm problem, James Aylett |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |