Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

RE: Bug report?: msg#00016

search.snowball

Subject: RE: Bug report?


Alexander,

It is a good idea to copy stemming ideas to
snowball-discuss@xxxxxxxxxxxxxxxxxx, which is where all the lively
discussions take place!

I apologise for the confusion over the three forms of the English stemmer.
(It could be worse though.) The separation of the work to two different Web
address areas does not help. I will try to come up with some wording that
explains it more clearly.

In particular, the fact that "s" stems to null in the Porter stemmer with
its 1980 definition ought to mentioned on its web page.

I'll add this when I'm less busy with other work.

I agree with you about not stemming to the null string. The point with
Russian is that many of the endings are also stopwords that one might wish
to eliminate from an indexing process anyway. The first Russian stemmer I
did (more than 10 years ago now) took that approach.

Martin



At 15:33 10/10/2003 +0900, Alexander Gelbukh wrote:
>Dear Martin,
>
>Thank you for your answer! The confusion was due to your page does discuss
>an "original" versus "improved" version, but seems not to indicate very
>clearly which one is which and where is each one. Perhaps you'd consider to
>make it VERY clear on your page for dummies like me.
>
>As to the discussion of Russian empty stems, it did not convince me. I
>wonder what they meant specifically: what are examples of Russian words with
>an empty stem? I know only one very arguable group of such words ("vynut'",
>"perenyat'", "zanyat'", ...) with arguably empty stem or the stem "-n-" (I
>guess historically there was a stem -n- (-im-?) followed by a suffix -n-,
>which then contracted together to one -n-). I cannot think of any other
>linguistically valid example.
>
>But even with this, I think it is a better choice not to allow empty stems
>by definition. Two arguments for this:
>
>- Technical: to alter the file format (two columns --> one column in some
>rows) or word count in a file can lead to subtle errors difficuly to detect,
>as it was in my case.
>
>- Pragmatic: The very purpose of a stemmer is to map "the same" words into
>one symbol but "different" ones into different symbols. This is prone to
>both types of errors: false alarms and misses. Mapping words to an empty
>stem harly can decrease the misses rate but probably will dramatically
>increase the false alarm rate. If this is indeed done for one group of
>words, perhaps it's wiser to map them into someting else, say, into one of
>them: "vynut'", "perenyat'", "zanyat'", ... --> "vynut'". Or to leave them
>alone, as you don't stem "be, are, is, was, were" into a common (empty?!)
>stem but just leave them alone.
>
>Thank you again for your attention!
>
>Alexander
>


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    gcc.libstdc++.c...    solaris.opensol...    information-ret...    misc.misterhous...    web.catalyst.ge...    apache.webservi...    redhat.release....    hardware.lirc/2...    kernel.autofs/2...    technology.sust...    linux.vdr/2003-...    editors.lyx.gen...    org.user-groups...    netbsd.devel.pk...    xdg.devel/2004-...    version-control...    jakarta.slide.d...    debian.packages...    creativecommons...    ports.ppc.embed...    bug-tracking.bu...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe