|
|
Choosing A Webhost: |
Re: Bug report?: msg#00011search.snowball
Alexander, Yes, I was aware of this, and should explain: The Porter stemmer, as originally defined, reduces "s" to null, and is faithfully implemented in the stemmer at http://snowball.tartarus.org/porter/stemmer.html The version of the Porter stemmer which I distributed for many years stems "s" to "s" however. This is because it has a couple of improvements (points of DEPARTURE) from the published algorithm which everyone has come to accept. These improvements are in the slightly different version of the stemmer at http://www.tartarus.org/~martin/PorterStemmer/ and are clearly marked DEPARTURE in the commments in the ANSI C version of the stemmer - as well being described in the accompanying text. I can't alter this now, bugs or not, because of the status of the Porter stemmer as a described algorithm, but the Snowball Porter2 stemmer fixes these problems and many others besides. I would agree that it is not helpful to stem "s" to null, but would not agree that stemming to null is invariably bad (although none of the Snowball stemmers on current release do so). See the notes introducing the Russian stemmer. I can't explain the problems you had with email I'm afraid. I've certainly received executables, and files containing viruses, as unwanted attachments, within the past few months. Martin > I found a phrase > > "In any case a string of length 1 will be unchanged if passed >through the algorithm". > >Indeed, I always thought a stemmer should NOT produce empty stems, no? This >is very inconvenient in practice since it changes file formats, word counts, >etc. > >However, it seems the algorithm does strip "s" -> "". (This is the only rule >producing empty strings.) In effect, the program at >http://snowball.tartarus.org/porter/stemmer.html does it; I attach the >corresponding files (I found no way to send the executable due to a paranoic >antivirus software at Tartarus). > >Is this correct? Wouldn't you rather change the unconditional rule > > S -> cats -> cat > >to > > (*v or *c) S -> cats -> cat > >Thank you! >Alexander
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: two results, Olly Betts |
|---|---|
| Next by Date: | Re: Re: Bug report?, James Aylett |
| Previous by Thread: | two results, Boštjan Jerko |
| Next by Thread: | Re: Re: Bug report?, James Aylett |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |