logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Indexing multiple numbers in one field: msg#00042

lang.perl.modules.plucene

Subject: Indexing multiple numbers in one field

Before I learned all the details of how Plucene works, I devised an indexing scheme for a simple Web application, an application that allows multiple users to take, save, index and share notes. The notes live in a relational database, but I would use Plucene for text indexing. My text indexing schema looked roughly like this:

note / UnStored (search default)
title / Text
created / Text
updated / Text
user_ids / Text

...where user_ids consisted of multiple numbers separated by spaces. I need to track user_ids so that Sally doesn't see Janes notes except in cases where Jane has set up a special sharing relationship. I used Plucene::Plugin::Analyzer::PorterAnalyzer as the analyzer so I would get stemming for the notes.

Of course, this does not work. First of all, Plucene::Analysis::LetterTokenizer defines a token regex of /[[:alpha:]]+/, which does not match numbers. Secondly, Lingua::Stem::En::stem deletes anything that is not [A-Za-z].

I can define user_ids to be a Plucene::Document::Field of type Keyword, which works fine if there is only one user_id. But this fails to handle multiple user_ids.

I am curious for thoughts on the best course of action.

I can change user_ids to user_id (of type Keyword) and simply enter an instance of the document into the index for each associated user.

I can write my own analyzer, but this will entail replacing/editing not only LetterTokenizer but also Lingue::Stem::En. (Nevertheless, I'm leaning in this direction. I'd like people to be able to run a search like "Windows 3.0" or "perl5" and get a meaningful result.)

I can write a new field class, called Keywords, which would be just like Keyword except tokenized based on whitespace or some other optional field separator.

Or I could use a separate index for each user, but I have ruled out this approach because eventually some notes will be shared among quite a few users.

Thoughts?

Cheers
Ryan


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
qnx.openqnx.dev...    gcc.libstdc++.c...    solaris.opensol...    information-ret...    misc.misterhous...    web.catalyst.ge...    apache.webservi...    redhat.release....    hardware.lirc/2...    kernel.autofs/2...    technology.sust...    linux.vdr/2003-...    editors.lyx.gen...    org.user-groups...    netbsd.devel.pk...    xdg.devel/2004-...    version-control...    jakarta.slide.d...    debian.packages...    creativecommons...    ports.ppc.embed...    bug-tracking.bu...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe

Navigation