|
|
Choosing A Webhost: |
Re: Searching: msg#00003audio.freedb.devel
Hello, Monday, August 26, 2002, 1:34:13 AM, Yuri wrote: > The last week or two have been rather hectic, so I haven't gotten around to > slogging through the indexing code for the current system (besides which, I > find perl scrypt amazingly annoying to decrypt :0 - the lack of any and all > comments in the source doesn't help...) but from the html doc, the whole > thing looks a little clunky (no offense to anyone/anything, that's > just how it looks)... if someone can just tell me the index's table structure > the I'd be quite grateful ;) I guess the only person who can tell you is Gerhard Gonter, the author. He is also on this list - let's hope he answers to your mail. > On the upside, I've put together a near-optimal solution to this particular > problem (I think ;) ) The solution you described on the board looks great for me - if you can put it to code and it works as intended ;) > About the issues raised about the current search: > Firstly, the ascii-issue (also the alternate spellings issue): > This is trivial for things like accents and missing/extra apostrophies: > strip the character down to it's base letter for the former, and discard the > shorter string for the latter ( Michael's -> michael, L'Industrie -> > indtustrie ). If this is done on the requested keyword as well as the index, > there won't be a problem. Sounds good. But what will we do about entries in UTF-8, once that's implemented? > About the duplicate elimination: > I had kind of assumed that this would be one of the main administrative > uses of the full text search, as it is a trivial task, especially with a > hashed search as I've described on the dev board. You don't even need any > misspelling correction: the hashed search as described is quick enough to > check the similarity of the whole file, and > 75% word matches are almost > certainly the same thing (assuming the user didn't misspell more than 25% of > the words that is) so we do > for each record in the database: > if there are matches with more than 50% relevance: > print a list of matches in order of relevance > let the user decide what to do > and thats practically a python program - perl users take note :) The problem with duplicates is, that even though something is a duplicate regarding the titles etc. it is most likely not regarding the discid and track offsets. This may be because of different pressings of a CD being available or because someone submitted info for a CD which he burned from MP3s himself. We cannot delete such duplicates, if we want the original CD to be recognized as an exact match. That's the problem. A good idea would be a possibility to link several entries (but not with a hardlink in the filesystem, like the current server software allows - the track offsets of each entry should be preserved, but titles should be the same for all "linked" entries. I'd imagine, that this could be solved quite well if we move to a relational database. As for real duplicates: we have a script that checks for them and we run it from time to time. Currently we would be able to remove about 3500 dupliactes from the database by running this script - but that's just a fraction of the duplicates with different discids and track offsets... btw: you didn't send the email from the address you registered with on this list - therefore I had to approve it. Approving every post to a mailinglist manually is quite annoying, so I'd like to ask everyone to make sure that he uses the right sender address. Joerg
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Searching, Yuri |
|---|---|
| Next by Date: | RE: Searching - duplicates, tom-main |
| Previous by Thread: | Searching, Yuri |
| Next by Thread: | RE: Searching - duplicates, tom-main |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |