I'd like to put forth a proposal on an overhaul of MySQL's fulltext search
(FTS) indexing system for MySQL version 5.0. The main goals of this proposal
are: a) to make the FTS more integrated into MySQL by reusing MySQL
capabilities, and my using these capabilites make FTS b) more flexible and
c) faster in certain cases while not degrading its speed from the present
implementation.
For implementation of FTS I propose using (hidden) separate tables to hold
the inverted index and other necessary information. To do this, I propose
for FTS to be implemented internally as a SQL rewriter. This SQL rewriter
would translate a query to another. For example, a FTS query with a MATCH
... AGAINST would be rewritten to use the other tables.
Benefits:
- The MySQL engine can use its optimizer. Currently, the FTS engine is so
separated that a boolean mode AND does a full table scan (MATCH (txt)
against ('+word1 +word2' IN BOOLEAN MODE)).
- During debugging, one can use any MySQL tool to access the FTS tables and
look inside.
- Simpler development of Information Retrieval (IR) algorithms by outside
developers.
Development plan:
- I am currently building a IR/FTS system utilizing SQL tables for searching
Network54's 100GB of messages. Since we require the use of both relevance
and boolean searches either in the whole database or just inside certain
collections, we have six combinations of IR/FTS searches to test (and a lot
of data to test it on). I see this work as a sunk cost on our part, but may
well bear fruit by trying to reintegrate this work back inside MySQL.
- Someone would need to change the parser to rewrite SQL: MATCH queries to
use the new tables; index creation to create them; index deletion to delete
them, etc.
Additional thoughts:
o FTS in separate tables
- can use MySQL built-in query optimizer
- FTS tables can be externally hidden
as well as use hidden foreign key
system or triggers
- ALTER table that is unrelated to FTS much
faster (as it skips the FTS rebuild
since it is separate and not using
internal pointers, instead primary key)
- easier to develop/debug (reduce source
recompilations)
- will work with InnoDB
- somewhat less efficient in space as data
file would contain the indexed word as
well as the index (would clustered indexes
in InnoDB help here?)
- optional delayed indexing
o FTS ranking system, etc in stored procedures
- easier to develop/debug (reduce source
recompilations as algorithms are in
stored procedures)
- flexible (can let user override for a
database or table)
- possible?
o Multi-lingual
- stemming based on language (either default
for database, table, or given for a row)
Sincerely,
Steven Roussey
http://Network54.com/
---------------------------------------------------------------------
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail internals-thread6688@xxxxxxxxxxxxxxx
To unsubscribe, e-mail <internals-unsubscribe@xxxxxxxxxxxxxxx>
|
Try Searching:
servers, voip, java, networking, microsoft ...
|
|
|
|