Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Re: a simple algorithm problem: msg#00001

search.snowball

Subject: Re: a simple algorithm problem

On Tue, Jan 04, 2005 at 10:35:13AM +0000, Martin Porter wrote:
> In retrospect, I occasionally wish groupings were not in
> the language. Instead of
>
> A) define vowel 'aeiou'
>
> one could have
>
> B) define vowel as among('a' 'e' 'i' 'o' 'u')
>
> (A) is implemented as a bitmap, and (B) as a fast table lookup, and (A) is
> faster than (B), but optimisation in the codegenerator could turn (B) into a
> bitmap as well.
>
> There are other differences however: (B) needs to be defined
> in a 'forward' or 'backward' context; non-vowel is a neat test that works
> with style (A) but not (B).

Could (A) simply be handled internally as a shorthand for defining form (B) in
both forward and backward context, without changing the meaning of existing
code? As you say, the code generator can optimise this to give the same
generated code as at present (although it may not be worth the complications of
using a bitmap for multi-byte utf-8 characters).

The manual defines "non-vowel" as the same as "(not vowel next)". Wouldn't
that work for the (B) version too? In which case just extending where "non"
can be used solves that issue.

> If groupings were NOT in the language, you could reduce the difference
> between utf-8 and single character working to the definition of a couple of
> macros PREV and NEXT (thinking of ANSI C codegeneration) that move the
> character cursor left or right by one place, and that only turn up in the
> definitions of (a) to (e) above.

It would be great if snowball could process utf-8 directly. Although
characters are variable width, you can at least write a simple and efficient
"PREV" macro for utf-8 (because the first byte of a character is always in
a particular range which isn't used for subsequent bytes).

We want utf-8 stemming for Xapian, so I'm going to have to address this
somehow...

[
Incidentally, I think there's an error in the manual where it talks about
among. Look at http://snowball.tartarus.org/p/snowman.html which says:

The effect of obeying substring when the preceding among is not obeyed is
undefined. This would happen for example here,

try($x != 617 substring)
among(...) // 'substring' is bypassed in the exceptional case where x == 617

I think substring and among are switched in the first sentence, and that should
be: "The effect of obeying *among* when the preceding *substring*" is not
obeyed is undefined."
]

Cheers,
Olly


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    gcc.libstdc++.c...    solaris.opensol...    information-ret...    misc.misterhous...    web.catalyst.ge...    apache.webservi...    redhat.release....    hardware.lirc/2...    kernel.autofs/2...    technology.sust...    linux.vdr/2003-...    editors.lyx.gen...    org.user-groups...    netbsd.devel.pk...    xdg.devel/2004-...    version-control...    jakarta.slide.d...    debian.packages...    creativecommons...    ports.ppc.embed...    bug-tracking.bu...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe