Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Re: Spanish stemmer with accents stripped before stemming: msg#00006

search.snowball

Subject: Re: Spanish stemmer with accents stripped before stemming


Andrew,

> It occurs to me that perhaps it would be a good idea to modify
> Snowball's Spanish stemmer to accept both accented and accent-stripped
> input.

I think that is a good point. As you say, "accents are the bogeyman of
day-to-day written Spanish usage and it's hard to imagine an effective
search engine that obliges users to type them correctly."

The occasion when I came across this problem before was news data in
Spanish where the placing of accents was very untrustworthy. There is a
variant of the Snowball German stemmer in which umlaut is represented by
following e, but there are no variants for the Romance language
stemmers.

I'm not sure what the deal is for Portuguese, but Spanish is as you
describe it. In French, the application of accents is quite rigorously
applied, except that they can be omitted when the text is entirely in
upper case. (But is that stylistic feature less prevalent than it was a
century ago? I'm not sure ...) Anyway, keeping accents in place with
French does not seem to be problematic.

Italian presents an interesting case. They use acute and grave, but not
by any consistent rule. There are different schemes for how acute/grave
is applied, which varies (or used to vary) among publishing houses. This
is why the Italian stemmer begins with the strange operation of
replacing all acutes with graves. A critical ending is then -o+accent,
but even if the accent is absent, -o is a similar ending, and will be
removed by the same rule (compare porto`, he carried, with porto, I
carry). The result is the the Italian stemmer does not behave very
differently on texts with all accents stripped.

We'll keep your suggestion in mind as a Snowball development.

Martin


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    gcc.libstdc++.c...    solaris.opensol...    information-ret...    misc.misterhous...    web.catalyst.ge...    apache.webservi...    redhat.release....    hardware.lirc/2...    kernel.autofs/2...    technology.sust...    linux.vdr/2003-...    editors.lyx.gen...    org.user-groups...    netbsd.devel.pk...    xdg.devel/2004-...    version-control...    jakarta.slide.d...    debian.packages...    creativecommons...    ports.ppc.embed...    bug-tracking.bu...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe