|
|
Choosing A Webhost: |
Searching: msg#00005audio.freedb.devel
OK, I hope this has the correct "From" field... About UTF-8 : it doesn't affect the index, but it does affect the hashing function. Also, there has to be a decision as to whether to strip some UTF-8 characters down to ascii for misspelling/ascii client reasons, or leave them as-is : eg a good part of the latin-1 suplement ( 0080 - 00FF) and most of extended latin A ( 0100 - 017F ) consist of "modified" letters which might often be replaced by their "base" letters in user input. IMHO these should be converted to their base characters in the fulltext search index, to improve the number of 'valid' matches. This doesn't mean we try to convert everything to ascii (that would be silly at best) just that eg. u with umlauts (00FC) ends up as a straight u. The diacritical marks (0300-036F I think) can just be dropped. I'm not sure what to do about things like the control pictures (2400-243F)- should these be 'applied' and the result 'read' or are they intended to be 'read' verbatim? Anyway, for the time being I'll just read 8 bit characters in the hashing, and I'll make the hashing function a seperate module (which is almost obligatory anyway) so that it can be changed without fuss. Joerg is quite right about database fulltext searches being slow... this is why fulltext search benchmarks are rare as hens teeth, and indexed search speed is on the wishlist of MySQL. Speedwise, like I said, my method should be near-optimal :P - there certainly shouldn't be a speed decrease from the existing html search (if there is, well, :-X I'll just go stick my head in a bucket of water ... I've got one ready ... hang on a minute ...) As regards the existing html search, I tried to give it a spin, but it dumps out with "Can't locate Net/freedb/file.pm" (which I assume is a freedb file parser from a freedb perl binding?), and I couldn't find any reference to this in CVS or on the freedb site, and it doesn't seem to be part of the CDDB::File or Net::freedb perl modules ... (help?) Linking files with different diskids would probably be a good idea (eg. file could consist of one line: "LINK=<discid>" or maybe include track offsets too ... also, two-way linking might be useful (ie the master containing a record of it's 'copies' ) ) - finding the duplicates is the easy bit once the user has supplied title, artist and track info. The "hard" ;) bit is what to do about it ... let the user confirm the match, link automatically if it is a good match, or set some sort of certainty threshold below which the user chooses? Anyway, this kind of basic database admin is a slightly different problem, it just happens to be a lot easier with fulltext searching... For the time being, I'll just write the module in C for ascii as a standalone app for easy testing. I'll set it up so that integration into the server sw will consist merely of adding a couple of function calls, and including the module sources. You can probably expect a working prototype in a few days to a few weeks (I just started work experience, 9-5 5 days/week, 20 weeks, and I'm still getting used to it after the light hours at Uni) which I'll put up on CVS or whatever so you can have a tinker with it. Migration to c++ should be trivial - just a bit of preprocessor work - and I can do a java version for the java server, but that'll be much further down the track. Well, keep the suggestions/comments coming... G'night, -Yuri
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | RE: Searching - duplicates, tom-main |
|---|---|
| Next by Date: | Re: Searching, Joerg Hevers |
| Previous by Thread: | RE: Searching - duplicates, tom-main |
| Next by Thread: | Re: Searching, Joerg Hevers |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |