|
|
| <prev next> |
Choosing A Webhost: |
Re: a simple algorithm problem: msg#00000search.snowball
Ayhan, Thank you for the sample Snowball script. (I would have defined the utf-8 characters through stringdefs, following section 3 of the Snowball manual). You ask if there is any way around 'size' being wrong when utf-8 characters are included. This is all part of the problem of character representation. The best solution would be an addition compilation mode for utf-8, but of course that would mean yet another elaboration of Snowball itself ... Right now there is 8-bit working, and 16-bit working. 16-bit working fits in well with the Java codegenerator scheme, and can be used with the ANSI C codegenerator, although it has given rise to confusion on at least one occasion. In Snowball the concept of 'character' only turn up in a few contexts: a)hop N - to hop forward N characters b)next = hop 1 c)goto C d)gopast C - where you keep doing a 'next' until C is successful e)size - counts the number of characters and in 'groupings'. In retrospect, I occasionally wish groupings were not in the language. Instead of A) define vowel 'aeiou' one could have B) define vowel as among('a' 'e' 'i' 'o' 'u') (A) is implemented as a bitmap, and (B) as a fast table lookup, and (A) is faster than (B), but optimisation in the codegenerator could turn (B) into a bitmap as well. There are other differences however: (B) needs to be defined in a 'forward' or 'backward' context; non-vowel is a neat test that works with style (A) but not (B). If groupings were NOT in the language, you could reduce the difference between utf-8 and single character working to the definition of a couple of macros PREV and NEXT (thinking of ANSI C codegeneration) that move the character cursor left or right by one place, and that only turn up in the definitions of (a) to (e) above. --- I have rather mixed feelings about utf-8. It is of course in widespread use. It is especially convenient for languages using Roman letters with certain extensions (for example, your Turkish following the 1928 reforms of Ataturk). But it seems to me to be singularly clumsy for languages based on other alphabets (Russian, Greek, Arabic). Martin At 21:31 31/12/2004 +0000, ayhan peker wrote: >Martin hi, >I have made some changes. >It looks like if everything is in utf-8 you dont need to do string >definitions at all. >The algorithm works as it is except that size is wrong. As you said >"Snowball thinks it is two characters". >Is there a way round it? >About turkish stemming in mtu. I knew they were working on it. I wish >they put something up more concrete (the code). It might very well be >all in theory. > >Ayhan >btw. Happy Christmas and happy new year. > >The code: > > >routines ( > mark_regions > R1 > common_suffix > >) >externals ( stem ) >integers ( p1 p3) >groupings ( v all ) >stringescapes {} > >/* special characters (in turkish) */ > >stringdef u" hex 'FC' // u w�th d�aer�es >stringdef i^ hex 'FD' // >stringdef o" hex 'F6' // >stringdef s, hex 'FE' // >stringdef c, hex 'E7' // >stringdef g^ hex 'F0' // > >define v 'aeiouüöı'//{u"}{o"}{i^}' >define all >'aeiouüöıÅ?çÄ?qwrtyplkjhgfdszxcvbnm1234567890!£$%^&*()-_=+[]@~;:/?><#.' >define mark_regions as ( > $p1 = limit > > $p3=size > do ( > ( gopast v gopast non-v) setmark p1 > > > ) > >) >backwardmode ( > define R1 as $p1 <= cursor > > > > define common_suffix as ( > [substring] among( > 'ler' 'lar' 'diler' 'dular' 'dılar' 'düler' > 'tiler' 'tular' 'tılar' 'tüler' 'dir' 'dır' 'miÅ?' 'mıÅ?' 'müÅ?' >'muÅ?' 'miÅ?ler' 'mıÅ?lar' 'müÅ?ler' 'muÅ?lar' > (R1 delete) > ) > ) >)
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Next by Date: | Re: a simple algorithm problem, Olly Betts |
|---|---|
| Next by Thread: | Re: a simple algorithm problem, Olly Betts |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |