Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Re: Searching: msg#00003

audio.freedb.devel

Subject: Re: Searching

Hello,

Monday, August 26, 2002, 1:34:13 AM, Yuri wrote:

> The last week or two have been rather hectic, so I haven't gotten around to
> slogging through the indexing code for the current system (besides which, I
> find perl scrypt amazingly annoying to decrypt :0 - the lack of any and all
> comments in the source doesn't help...) but from the html doc, the whole
> thing looks a little clunky (no offense to anyone/anything, that's
> just how it looks)... if someone can just tell me the index's table structure
> the I'd be quite grateful ;)

I guess the only person who can tell you is Gerhard Gonter, the
author. He is also on this list - let's hope he answers to your mail.

> On the upside, I've put together a near-optimal solution to this particular
> problem (I think ;) )

The solution you described on the board looks great for me - if you
can put it to code and it works as intended ;)

> About the issues raised about the current search:
> Firstly, the ascii-issue (also the alternate spellings issue):
> This is trivial for things like accents and missing/extra apostrophies:
> strip the character down to it's base letter for the former, and discard the
> shorter string for the latter ( Michael's -> michael, L'Industrie ->
> indtustrie ). If this is done on the requested keyword as well as the index,
> there won't be a problem.

Sounds good. But what will we do about entries in UTF-8, once that's
implemented?

> About the duplicate elimination:
> I had kind of assumed that this would be one of the main administrative
> uses of the full text search, as it is a trivial task, especially with a
> hashed search as I've described on the dev board. You don't even need any
> misspelling correction: the hashed search as described is quick enough to
> check the similarity of the whole file, and > 75% word matches are almost
> certainly the same thing (assuming the user didn't misspell more than 25% of
> the words that is) so we do
> for each record in the database:
> if there are matches with more than 50% relevance:
> print a list of matches in order of relevance
> let the user decide what to do
> and thats practically a python program - perl users take note :)

The problem with duplicates is, that even though something is a
duplicate regarding the titles etc. it is most likely not regarding
the discid and track offsets. This may be because of different
pressings of a CD being available or because someone submitted info
for a CD which he burned from MP3s himself. We cannot delete such
duplicates, if we want the original CD to be recognized as an exact
match. That's the problem. A good idea would be a possibility to link
several entries (but not with a hardlink in the filesystem, like the
current server software allows - the track offsets of each entry
should be preserved, but titles should be the same for all "linked"
entries. I'd imagine, that this could be solved quite well if we move
to a relational database.
As for real duplicates: we have a script that checks for them and we
run it from time to time. Currently we would be able to remove about
3500 dupliactes from the database by running this script - but that's
just a fraction of the duplicates with different discids and track
offsets...

btw: you didn't send the email from the address you registered with on
this list - therefore I had to approve it. Approving every post to a
mailinglist manually is quite annoying, so I'd like to ask everyone to
make sure that he uses the right sender address.

Joerg


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
video.h264.deve...    technology.erps...    drivers.hostap/...    user-groups.lin...    games.railroad-...    handhelds.linux...    lang.harbour.de...    recreation.radi...    culture.publica...    xfree86.devel/2...    music.john-cage...    otrs.cvs/2003-0...    network.e-smith...    asplinux.suppor...    qnx.openqnx.dev...    ietf.nfsv4/2005...    editors.vim/200...    kde.devel.kopet...    web.zope.zwiki....    freebsd.devel.m...    java.xdoclet.de...    php.simpletest....    bacula.user/200...    security.virus....   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe

Navigation