|
|
Choosing A Webhost: |
Re: a simple algorithm problem: msg#00006search.snowball
On Thu, Jan 06, 2005 at 09:12:34AM +0000, Martin Porter wrote: > Working with 64K characters, a bitmap might go up to 8K in size, which is > not an intolerable overhead. In practice they are much smaller, since the > codes we need in the stemmers do not have high Unicode values. I don't really see this as a big problem. You can always just dump out a C switch statement and let the C compiler's code generator decide on the best way to represent it. > So one idea is to declare 'utf8' in the Snowball script, allowing character > defs in the range 0-64K, as in the 2-byte character version. Characters > could be written with their Unicode values, and encoded in utf-8 form in > strings. I doubt most of the stemmers care about characters beyond a 2 byte utf-8 representation. So the answer is likely to be the same for any sequence with the same first 2 bytes, and you don't even need to decode the character fully. Look at the first byte to see if this is a one byte, two byte, or longer sequence, and have tables for the first two and a fixed answer for the third. > Looking again at the cursor movement issues: > > If I've got my thinking right: in goto and gopast, cursor movement can be > done one byte at a time, (cursor++; or cursor--;). If expression C is made > up of well-formed utf-8 strings. 'goto C' must either fail, or end on a > valid character boundary. That's true. > 'next' requires its own implementation, and as you say, backward movement is > not a problem. > > 'next' is implicit in character tests (vowel, non-vowel), and hop N (= > 'next' done N times). > > I'm not sure 'size' is used in the Snowball scripts: it could be defined to > give an approximate answer in the utf-8 case, or implemented exactly. Hmm, I don't like the idea that the encoding would affect the stemming algorithm. Thinking about how size might be used, it'll be in comparisons. The number of bytes is an upper bound on the number of characters. You can also get a lower bound by dividing it by 6 (IIRC). The code generator can probably use these two bounds to avoid actually walking the string completely in many cases. E.g. if you want to test if the stem is least 10 characters, and it's 8 bytes long, you don't need to walk the string. > Incidentally, do you have a view on the use of free-floating accents? > (Unicode 0300-036F) Ideally you want to normalise text before stemming. There's information on unicode normalisation here: http://www.unicode.org/unicode/reports/tr15/ I've not explored this in great depth, but for latinate languages, I suspect pragmatically you can just strip out combining accents for stemming purposes, and the results will be pretty much what you'd get for text that was written without access to accented forms - umlauts in German are the main exception I'm aware of ("ä" (a-umlaut) -> "ae", etc). But I believe that in Swedish, you'd just write "a" in this situation. Cheers, Olly
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: a simple algorithm problem, James Aylett |
|---|---|
| Next by Date: | Hungarian stemmer, A. Tordai |
| Previous by Thread: | Re: a simple algorithm problem, James Aylett |
| Next by Thread: | Re: a simple algorithm problem, Martin Porter |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |