|
|
| <prev next> |
Choosing A Webhost: |
Re: Spanish stemmer with accents stripped before stemming: msg#00006search.snowball
Andrew, > It occurs to me that perhaps it would be a good idea to modify > Snowball's Spanish stemmer to accept both accented and accent-stripped > input. I think that is a good point. As you say, "accents are the bogeyman of day-to-day written Spanish usage and it's hard to imagine an effective search engine that obliges users to type them correctly." The occasion when I came across this problem before was news data in Spanish where the placing of accents was very untrustworthy. There is a variant of the Snowball German stemmer in which umlaut is represented by following e, but there are no variants for the Romance language stemmers. I'm not sure what the deal is for Portuguese, but Spanish is as you describe it. In French, the application of accents is quite rigorously applied, except that they can be omitted when the text is entirely in upper case. (But is that stylistic feature less prevalent than it was a century ago? I'm not sure ...) Anyway, keeping accents in place with French does not seem to be problematic. Italian presents an interesting case. They use acute and grave, but not by any consistent rule. There are different schemes for how acute/grave is applied, which varies (or used to vary) among publishing houses. This is why the Italian stemmer begins with the strange operation of replacing all acutes with graves. A critical ending is then -o+accent, but even if the accent is absent, -o is a similar ending, and will be removed by the same rule (compare porto`, he carried, with porto, I carry). The result is the the Italian stemmer does not behave very differently on texts with all accents stripped. We'll keep your suggestion in mind as a Snowball development. Martin
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Spanish stemmer with accents stripped before stemming, Andrew Green |
|---|---|
| Previous by Thread: | Re: Spanish stemmer with accents stripped before stemming, Andrew Green |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |