|
|
Choosing A Webhost: |
Re: a simple algorithm problem: msg#00001search.snowball
On Tue, Jan 04, 2005 at 10:35:13AM +0000, Martin Porter wrote: > In retrospect, I occasionally wish groupings were not in > the language. Instead of > > A) define vowel 'aeiou' > > one could have > > B) define vowel as among('a' 'e' 'i' 'o' 'u') > > (A) is implemented as a bitmap, and (B) as a fast table lookup, and (A) is > faster than (B), but optimisation in the codegenerator could turn (B) into a > bitmap as well. > > There are other differences however: (B) needs to be defined > in a 'forward' or 'backward' context; non-vowel is a neat test that works > with style (A) but not (B). Could (A) simply be handled internally as a shorthand for defining form (B) in both forward and backward context, without changing the meaning of existing code? As you say, the code generator can optimise this to give the same generated code as at present (although it may not be worth the complications of using a bitmap for multi-byte utf-8 characters). The manual defines "non-vowel" as the same as "(not vowel next)". Wouldn't that work for the (B) version too? In which case just extending where "non" can be used solves that issue. > If groupings were NOT in the language, you could reduce the difference > between utf-8 and single character working to the definition of a couple of > macros PREV and NEXT (thinking of ANSI C codegeneration) that move the > character cursor left or right by one place, and that only turn up in the > definitions of (a) to (e) above. It would be great if snowball could process utf-8 directly. Although characters are variable width, you can at least write a simple and efficient "PREV" macro for utf-8 (because the first byte of a character is always in a particular range which isn't used for subsequent bytes). We want utf-8 stemming for Xapian, so I'm going to have to address this somehow... [ Incidentally, I think there's an error in the manual where it talks about among. Look at http://snowball.tartarus.org/p/snowman.html which says: The effect of obeying substring when the preceding among is not obeyed is undefined. This would happen for example here, try($x != 617 substring) among(...) // 'substring' is bypassed in the exceptional case where x == 617 I think substring and among are switched in the first sentence, and that should be: "The effect of obeying *among* when the preceding *substring*" is not obeyed is undefined." ] Cheers, Olly
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: a simple algorithm problem, Martin Porter |
|---|---|
| Next by Date: | Re: a simple algorithm problem, Martin Porter |
| Previous by Thread: | Re: a simple algorithm problem, Martin Porter |
| Next by Thread: | Re: a simple algorithm problem, Martin Porter |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |