logo       
Google Custom Search
    AddThis Social Bookmark Button

Proposal: FULLTEXT overhaul in MySQL 5.0: msg#00407

Subject: Proposal: FULLTEXT overhaul in MySQL 5.0
I'd like to put forth a proposal on an overhaul of MySQL's fulltext search
(FTS) indexing system for MySQL version 5.0. The main goals of this proposal
are: a) to make the FTS more integrated into MySQL by reusing MySQL
capabilities, and my using these capabilites make FTS b) more flexible and
c) faster in certain cases while not degrading its speed from the present
implementation.

For implementation of FTS I propose using (hidden) separate tables to hold
the inverted index and other necessary information. To do this, I propose
for FTS to be implemented internally as a SQL rewriter. This SQL rewriter
would translate a query to another. For example, a FTS query with a MATCH
... AGAINST would be rewritten to use the other tables. 

Benefits: 
- The MySQL engine can use its optimizer. Currently, the FTS engine is so
separated that a boolean mode AND does a full table scan (MATCH (txt)
against ('+word1 +word2' IN BOOLEAN MODE)). 
- During debugging, one can use any MySQL tool to access the FTS tables and
look inside.
- Simpler development of Information Retrieval (IR) algorithms by outside
developers.

Development plan:
- I am currently building a IR/FTS system utilizing SQL tables for searching
Network54's 100GB of messages. Since we require the use of both relevance
and boolean searches either in the whole database or just inside certain
collections, we have six combinations of IR/FTS searches to test (and a lot
of data to test it on). I see this work as a sunk cost on our part, but may
well bear fruit by trying to reintegrate this work back inside MySQL.
- Someone would need to change the parser to rewrite SQL: MATCH queries to
use the new tables; index creation to create them; index deletion to delete
them, etc.

Additional thoughts:

o FTS in separate tables
        - can use MySQL built-in query optimizer
        - FTS tables can be externally hidden
                as well as use hidden foreign key
                system or triggers
        - ALTER table that is unrelated to FTS much 
                faster (as it skips the FTS rebuild
                since it is separate and not using
                internal pointers, instead primary key)
        - easier to develop/debug (reduce source 
                recompilations)
        - will work with InnoDB
        - somewhat less efficient in space as data
                file would contain the indexed word as
                well as the index (would clustered indexes 
                in InnoDB help here?)
        - optional delayed indexing
o FTS ranking system, etc in stored procedures
        - easier to develop/debug (reduce source 
                recompilations as algorithms are in
                stored procedures)
        - flexible (can let user override for a 
                database or table)
        - possible?
o Multi-lingual
        - stemming based on language (either default
                for database, table, or given for a row)


Sincerely,
Steven Roussey
http://Network54.com/ 




---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail internals-thread6688@xxxxxxxxxxxxxxx
To unsubscribe, e-mail <internals-unsubscribe@xxxxxxxxxxxxxxx>





Try Searching:
servers, voip, java, networking, microsoft ...
<Prev in Thread] Current Thread [Next in Thread>