|
|
Choosing A Webhost: |
Re: Searching: msg#00006audio.freedb.devel
Hi, Wednesday, August 28, 2002, 12:57:08 AM, vilm0001@xxxxxxxxxxxxxxxxxxxxxxx wrote: > OK, I hope this has the correct "From" field... Yes, was correct :) > About UTF-8 : it doesn't affect the index, but it does affect the hashing > function. Also, there has to be a decision as to whether to strip some UTF-8 > characters down to ascii for misspelling/ascii client reasons, or leave them > as-is : eg a good part of the latin-1 suplement ( 0080 - 00FF) and most of > extended latin A ( 0100 - 017F ) consist of "modified" letters which might > often be replaced by their "base" letters in user input. IMHO these should be > converted to their base characters in the fulltext search index, to improve > the number of 'valid' matches. This doesn't mean we try to convert everything > to ascii (that would be silly at best) just that eg. u with umlauts (00FC) > ends up as a straight u. The diacritical marks (0300-036F I think) can just > be dropped. I'm not sure what to do about things like the control pictures > (2400-243F)- should these be 'applied' and the result 'read' or are they > intended to be 'read' verbatim? I don't know enough about UTF-8 to be able to comment on this. :( > As regards the existing html search, I tried to give it a spin, but it dumps > out with "Can't locate Net/freedb/file.pm" (which I assume is a freedb file > parser from a freedb perl binding?), and I couldn't find any reference to > this in CVS or on the freedb site, and it doesn't seem to be part of the > CDDB::File or Net::freedb perl modules ... (help?) Seems like you forgot to get the required stuff from the hyx-tools at http://sourceforge.net/projects/hyx-tools The p5-net-freedb package contains the necessary modules. lmd is also needed - for generating the index. > Linking files with different diskids would probably be a good idea (eg. file > could consist of one line: "LINK=<discid>" or maybe include track offsets too Yes, we should _definitely_ keep the track offsets of the entries to be linked. > The "hard" ;) bit is what to do > about it ... let the user confirm the match, link automatically if it is a > good match, or set some sort of certainty threshold below which the user > chooses? Anyway, this kind of basic database admin is a slightly different > problem, it just happens to be a lot easier with fulltext searching... I'd say link automatically if the match is "good enough" and the track offsets are at least "fuzzy matching". > For the time being, I'll just write the module in C for ascii as a standalone > app for easy testing. I'll set it up so that integration into the server sw > will consist merely of adding a couple of function calls, and including the > module sources. You can probably expect a working prototype in a few days to > a few weeks (I just started work experience, 9-5 5 days/week, 20 weeks, and > I'm still getting used to it after the light hours at Uni) which I'll put up > on CVS or whatever so you can have a tinker with it. Great :) If you want to use our CVS repository, please give me your Sourceforge nick and I'll add you as a developer. Well, it would be great to hear some comments from other people as well ;) - Jörg
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Searching, vilm0001 |
|---|---|
| Next by Date: | RE: Searching, tom-main |
| Previous by Thread: | Searching, vilm0001 |
| Next by Thread: | RE: Searching, tom-main |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |