Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Re: Spanish stemmer with accents stripped before stemming: msg#00005

search.snowball

Subject: Re: Spanish stemmer with accents stripped before stemming

Hi, Martin,

Thank you very much for your replies...

>Obviously to us, it a bit easier to look at the problem
> from the snowball angle, rather than think about the generated java
> after it's been put inside lucene! As far as the snowball script is
> concerned, I believe you could strip out accents from the source,
> eliminate the duplicate strings in the amongs(..) that would result, and
> recompile, getting the effect you want.

OK... We just tried that, and it works very well so far! Thanks a bunch.

> (Incidentally, I have hit this problem with Spanish stemming before, but
> it was a long while ago -- before the development of snowball.)

Accents are the boogeyman of day-to-day written Spanish usage and it's
hard to imagine an effective search engine that obliges users to type
them correctly.

> Thinking about it further, this will not work, since the strings are
> placed in tables which would need to be fully reorganised if any of
> the
> characters in the strings were readjusted. (it is the way a snowball
> 'among' is implemented).
>
> The only way to do this is to modify the stem.sbl file for Spanish,
> regenerate the java code with the snowball compiler (which you can
> download) and replace the old java with the new in your application.

Ah... So now we know why our previous attempt failed.

It occurs to me that perhaps it would be a good idea to modify
Snowball's Spanish stemmer to accept both accented and accent-stripped
input.

Greetings,
Andrew Green

P.S. Our little server was down during the weekend--sorry--it's back
online again--though now a commit error to the repository has made the
relevant files difficult to access--though now they seem less relevant,
I suppose.


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    gcc.libstdc++.c...    solaris.opensol...    information-ret...    misc.misterhous...    web.catalyst.ge...    apache.webservi...    redhat.release....    hardware.lirc/2...    kernel.autofs/2...    technology.sust...    linux.vdr/2003-...    editors.lyx.gen...    org.user-groups...    netbsd.devel.pk...    xdg.devel/2004-...    version-control...    jakarta.slide.d...    debian.packages...    creativecommons...    ports.ppc.embed...    bug-tracking.bu...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe