Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Re: a simple algorithm problem: msg#00002

search.snowball

Subject: Re: a simple algorithm problem


Olly,

I would answer 'yes' to all the points made in your last email. Thanks for
reporting the error in Snowball manual. It will be fixed in the next cvs commit.

There are various ways to proceed ...

In the 2-byte character version of Snowball (standard for the Java
codegenerator) you can define characters as decimal or hex numbers in the
range 257 to 64K. These characters can go into character tables, which are
implemented as bitmaps. Of course, working with 256 characters, the bitmap
never exceeds 32 bytes -- and will frequently be less, since the bitmap is
truncated at both ends by removing runs of zeros.

Working with 64K characters, a bitmap might go up to 8K in size, which is
not an intolerable overhead. In practice they are much smaller, since the
codes we need in the stemmers do not have high Unicode values.

So one idea is to declare 'utf8' in the Snowball script, allowing character
defs in the range 0-64K, as in the 2-byte character version. Characters
could be written with their Unicode values, and encoded in utf-8 form in
strings.

Looking again at the cursor movement issues:

If I've got my thinking right: in goto and gopast, cursor movement can be
done one byte at a time, (cursor++; or cursor--;). If expression C is made
up of well-formed utf-8 strings. 'goto C' must either fail, or end on a
valid character boundary.

'next' requires its own implementation, and as you say, backward movement is
not a problem.

'next' is implicit in character tests (vowel, non-vowel), and hop N (=
'next' done N times).

I'm not sure 'size' is used in the Snowball scripts: it could be defined to
give an approximate answer in the utf-8 case, or implemented exactly.

Presumably a 'utf8' declaration would simply be ignored by the Java
codegenerator (Richard B to confirm).

Obviously we are working towards a standard header:

utf8
define GREEK_CAPITAL_LETTER_OMICRON hex 039F
. . . .

of Unicode characters, and it would be nice to use the Unicode names, were
they not (as this example shows) so very long.

- - -

Another idea I had was just to create modified versions of the existing
scripts so they will work with utf-8 encoded strings, even while Snowball
knows nothing about utf-8. That could be done with no further changes to
Snowball.

Incidentally, do you have a view on the use of free-floating accents?
(Unicode 0300-036F)

Martin


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    gcc.libstdc++.c...    solaris.opensol...    information-ret...    misc.misterhous...    web.catalyst.ge...    apache.webservi...    redhat.release....    hardware.lirc/2...    kernel.autofs/2...    technology.sust...    linux.vdr/2003-...    editors.lyx.gen...    org.user-groups...    netbsd.devel.pk...    xdg.devel/2004-...    version-control...    jakarta.slide.d...    debian.packages...    creativecommons...    ports.ppc.embed...    bug-tracking.bu...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe