Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

Searching: msg#00005

audio.freedb.devel

Subject: Searching

OK, I hope this has the correct "From" field...

About UTF-8 : it doesn't affect the index, but it does affect the hashing
function. Also, there has to be a decision as to whether to strip some UTF-8
characters down to ascii for misspelling/ascii client reasons, or leave them
as-is : eg a good part of the latin-1 suplement ( 0080 - 00FF) and most of
extended latin A ( 0100 - 017F ) consist of "modified" letters which might
often be replaced by their "base" letters in user input. IMHO these should be
converted to their base characters in the fulltext search index, to improve
the number of 'valid' matches. This doesn't mean we try to convert everything
to ascii (that would be silly at best) just that eg. u with umlauts (00FC)
ends up as a straight u. The diacritical marks (0300-036F I think) can just
be dropped. I'm not sure what to do about things like the control pictures
(2400-243F)- should these be 'applied' and the result 'read' or are they
intended to be 'read' verbatim?
Anyway, for the time being I'll just read 8 bit characters in the hashing,
and I'll make the hashing function a seperate module (which is almost
obligatory anyway) so that it can be changed without fuss.

Joerg is quite right about database fulltext searches being slow... this is
why fulltext search benchmarks are rare as hens teeth, and indexed search
speed is on the wishlist of MySQL. Speedwise, like I said, my method should
be near-optimal :P - there certainly shouldn't be a speed decrease from the
existing html search (if there is, well, :-X I'll just go stick my head in a
bucket of water ... I've got one ready ... hang on a minute ...)

As regards the existing html search, I tried to give it a spin, but it dumps
out with "Can't locate Net/freedb/file.pm" (which I assume is a freedb file
parser from a freedb perl binding?), and I couldn't find any reference to
this in CVS or on the freedb site, and it doesn't seem to be part of the
CDDB::File or Net::freedb perl modules ... (help?)

Linking files with different diskids would probably be a good idea (eg. file
could consist of one line: "LINK=<discid>" or maybe include track offsets too
... also, two-way linking might be useful (ie the master containing a record
of it's 'copies' ) ) - finding the duplicates is the easy bit once the user
has supplied title, artist and track info. The "hard" ;) bit is what to do
about it ... let the user confirm the match, link automatically if it is a
good match, or set some sort of certainty threshold below which the user
chooses? Anyway, this kind of basic database admin is a slightly different
problem, it just happens to be a lot easier with fulltext searching...

For the time being, I'll just write the module in C for ascii as a standalone
app for easy testing. I'll set it up so that integration into the server sw
will consist merely of adding a couple of function calls, and including the
module sources. You can probably expect a working prototype in a few days to
a few weeks (I just started work experience, 9-5 5 days/week, 20 weeks, and
I'm still getting used to it after the light hours at Uni) which I'll put up
on CVS or whatever so you can have a tinker with it. Migration to c++ should
be trivial - just a bit of preprocessor work - and I can do a java version
for the java server, but that'll be much further down the track.

Well, keep the suggestions/comments coming...

G'night,
-Yuri


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
video.h264.deve...    technology.erps...    drivers.hostap/...    user-groups.lin...    games.railroad-...    handhelds.linux...    lang.harbour.de...    recreation.radi...    culture.publica...    xfree86.devel/2...    music.john-cage...    otrs.cvs/2003-0...    network.e-smith...    asplinux.suppor...    qnx.openqnx.dev...    ietf.nfsv4/2005...    editors.vim/200...    kde.devel.kopet...    web.zope.zwiki....    freebsd.devel.m...    java.xdoclet.de...    php.simpletest....    bacula.user/200...    security.virus....   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe

Navigation