diff --git a/Docs/manual.texi b/Docs/manual.texi index b4efffc4f6c..e0b5039e04f 100644 --- a/Docs/manual.texi +++ b/Docs/manual.texi @@ -33990,8 +33990,8 @@ DELETE FROM t1,t2 USING t1,t2,t3 WHERE t1.id=t2.id AND t2.id=t3.id In the above case we delete matching rows just from tables @code{t1} and @code{t2}. -@code{ORDER BY} and using multiple tables in the @code{DELETE} is supported -in MySQL 4.0. +@code{ORDER BY} and using multiple tables in the @code{DELETE} statement +is supported in MySQL 4.0. If an @code{ORDER BY} clause is used, the rows will be deleted in that order. This is really only useful in conjunction with @code{LIMIT}. For example: @@ -35947,16 +35947,17 @@ You can set the default isolation level for @code{mysqld} with @cindex full-text search @cindex FULLTEXT -Since Version 3.23.23, MySQL has support for full-text indexing +As of Version 3.23.23, MySQL has support for full-text indexing and searching. Full-text indexes in MySQL are an index of type @code{FULLTEXT}. @code{FULLTEXT} indexes can be created from @code{VARCHAR} and @code{TEXT} columns at @code{CREATE TABLE} time or added later with -@code{ALTER TABLE} or @code{CREATE INDEX}. For large datasets, adding -@code{FULLTEXT} index with @code{ALTER TABLE} (or @code{CREATE INDEX}) -would be much faster than inserting rows into the empty table that has -a @code{FULLTEXT} index. +@code{ALTER TABLE} or @code{CREATE INDEX}. For large datasets, it will be +much faster to load your data into a table that has no @code{FULLTEXT} +index, then create the index with @code{ALTER TABLE} (or @code{CREATE +INDEX}). Loading data into a table that already has a @code{FULLTEXT} +index will be slower. -Full-text search is performed with the @code{MATCH} function. +Full-text searching is performed with the @code{MATCH()} function. @example mysql> CREATE TABLE articles ( @@ -35988,24 +35989,35 @@ mysql> SELECT * FROM articles 2 rows in set (0.00 sec) @end example -The function @code{MATCH} matches a natural language (or boolean, -see below) query in case-insensitive fashion @code{AGAINST} -a text collection (which is simply the set of columns covered by a -@code{FULLTEXT} index). For every row in a table it returns relevance - -a similarity measure between the text in that row (in the columns that are -part of the collection) and the query. When it is used in a @code{WHERE} -clause (see example above) the rows returned are automatically sorted with -relevance decreasing. Relevance is a non-negative floating-point number. -Zero relevance means no similarity. Relevance is computed based on the -number of words in the row, the number of unique words in that row, the -total number of words in the collection, and the number of documents (rows) -that contain a particular word. +The @code{MATCH()} function performs a natural language search for a string +against a text collection (a set of of one or more columns included in +a @code{FULLTEXT} index). The search string is given as the argument to +@code{AGAINST()}. The search is performed in case-insensitive fashion. +For every row in the table, @code{MATCH()} returns a relevance value, +that is, a similarity measure between the search string and the text in +that row in the columns named in the @code{MATCH()} list. -The above is a basic example of using @code{MATCH} function. Rows are -returned with relevance decreasing. +When @code{MATCH()} is used in a @code{WHERE} clause (see example above) +the rows returned are automatically sorted with highest relevance first. +Relevance values are non-negative floating-point numbers. Zero relevance +means no similarity. Relevance is computed based on the number of words +in the row, the number of unique words in that row, the total number of +words in the collection, and the number of documents (rows) that contain +a particular word. + +It is also possible to perform a boolean mode search. This is explained +later in the section. + +The preceding example is a basic illustration showing how to use the +@code{MATCH()} function. Rows are returned in order of decreasing +relevance. + +The next example shows how to retrieve the relevance values explicitly. +As neither @code{WHERE} nor @code{ORDER BY} clauses are present, returned +rows are not ordered. @example -mysql> SELECT id,MATCH title,body AGAINST ('Tutorial') FROM articles; +mysql> SELECT id,MATCH (title,body) AGAINST ('Tutorial') FROM articles; +----+-----------------------------------------+ | id | MATCH (title,body) AGAINST ('Tutorial') | +----+-----------------------------------------+ @@ -36019,12 +36031,16 @@ mysql> SELECT id,MATCH title,body AGAINST ('Tutorial') FROM articles; 6 rows in set (0.00 sec) @end example -This example shows how to retrieve the relevances. As neither @code{WHERE} -nor @code{ORDER BY} clauses are present, returned rows are not ordered. +The following example is more complex. The query returns the relevance +and still sorts the rows in order of decreasing relevance. To achieve +this result, you should specify @code{MATCH()} twice. This will cause no +additional overhead, because the MySQL optimiser will notice that the +two @code{MATCH()} calls are identical and invoke the full-text search +code only once. @example -mysql> SELECT id, body, MATCH title,body AGAINST ( - -> 'Security implications of running MySQL as root') AS score +mysql> SELECT id, body, MATCH (title,body) AGAINST + -> ('Security implications of running MySQL as root') AS score -> FROM articles WHERE MATCH (title,body) AGAINST -> ('Security implications of running MySQL as root'); +----+-------------------------------------+-----------------+ @@ -36036,18 +36052,12 @@ mysql> SELECT id, body, MATCH title,body AGAINST ( 2 rows in set (0.00 sec) @end example -This is more complex example - the query returns the relevance and still -sorts the rows with relevance decreasing. To achieve it one should specify -@code{MATCH} twice. Note, that this will cause no additional overhead, as -MySQL optimiser will notice that these two @code{MATCH} calls are -identical and will call full-text search code only once. +MySQL uses a very simple parser to split text into words. A ``word'' +is any sequence of characters consisting of letters, numbers, @samp{'}, +and @samp{_}. Any ``word'' that is present in the stopword list or is just +too short (3 characters or less) is ignored. -MySQL uses a very simple parser to split text into words. A -``word'' is any sequence of letters, numbers, @samp{'}, and @samp{_}. Any -``word'' that is present in the stopword list or just too short (3 -characters or less) is ignored. - -Every correct word in the collection and in the query is weighted, +Every correct word in the collection and in the query is weighted according to its significance in the query or collection. This way, a word that is present in many documents will have lower weight (and may even have a zero weight), because it has lower semantic value in this @@ -36057,28 +36067,28 @@ relevance of the row. Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution -does not reflect adequately their semantical value, and this model -may sometimes produce bisarre results. +does not reflect adequately their semantic value, and this model +may sometimes produce bizarre results. @example mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL'); Empty set (0.00 sec) @end example -Search for the word @code{MySQL} produces no results in the above example. -Word @code{MySQL} is present in more than half of rows, and as such, is -effectively treated as a stopword (that is, with semantical value zero). -It is, really, the desired behavior - a natural language query should not -return every second row in 1GB table. +The search for the word @code{MySQL} produces no results in the above +example, because that word is present in more than half of rows. As such, +it is effectively treated as a stopword (that is, a word with zero semantic +value). This is the most desirable behavior -- a natural language query +should not return every second row from a 1GB table. A word that matches half of rows in a table is less likely to locate relevant documents. In fact, it will most likely find plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that such rows -have been assigned a low semantical value in @strong{this particular dataset}. +have been assigned a low semantic value in @strong{this particular dataset}. -Since version 4.0.1 MySQL can also perform boolean fulltext searches using -@code{IN BOOLEAN MODE} modifier. +As of Version 4.0.1, MySQL can also perform boolean full-text searches using +the @code{IN BOOLEAN MODE} modifier. @example mysql> SELECT * FROM articles WHERE MATCH (title,body) @@ -36095,38 +36105,44 @@ mysql> SELECT * FROM articles WHERE MATCH (title,body) @end example This query retrieved all the rows that contain the word @code{MySQL} -(note: 50% threshold is gone), but does @strong{not} contain the word -@code{YourSQL}. Note, that it does not auto-magically sort rows in -decreasing relevance order (the last row has the highest relevance, -as it contains @code{MySQL} twice). Boolean fulltext search can also -work even without @code{FULLTEXT} index, but it would be @strong{slow}. +(note: the 50% threshold is not used), but that do @strong{not} contain +the word @code{YourSQL}. Note that a boolean mode search does not +auto-magically sort rows in order of decreasing relevance. You can +see this from result of the preceding query, where the row with the +highest relevance (the one that contains @code{MySQL} twice) is listed +last, not first. A boolean full-text search can also work even without +a @code{FULLTEXT} index, although it would be @strong{slow}. -Boolean fulltext search supports the following operators: +The boolean full-text search capability supports the following operators: @table @code @item + -A plus sign prepended to a word indicates that this word @strong{must be} +A leading plus sign indicates that this word @strong{must be} present in every row returned. @item - -A minus sign prepended to a word indicates that this word @strong{must not} -be present in the rows returned. +A leading minus sign indicates that this word @strong{must not be} +present in any row returned. @item -By default - without plus or minus - the word is optional, but the rows that -contain it will be rated higher. This mimicks the behaviour of -@code{MATCH ... AGAINST()} without @code{IN BOOLEAN MODE} modifier. +By default (when neither plus nor minus is specified) the word is optional, +but the rows that contain it will be rated higher. This mimicks the +behaviour of @code{MATCH() ... AGAINST()} without the @code{IN BOOLEAN +MODE} modifier. @item < > -These two operators are used to increase and decrease word's contribution -to the relevance value, assigned to a row. See an example below. +These two operators are used to change a word's contribution to the +relevance value that is assigned to a row. The @code{<} operator +decreases the contribution and the @code{>} operator increases it. +See the example below. @item ( ) -Parentheses are used - as usual - to group words into subexpressions. +Parentheses are used to group words into subexpressions. @item ~ -This is negation operator. It makes word's contribution to the row -relevance negative. It's useful for marking noise words. A row that has -such a word will be rated lower than others, but will not be excluded -altogether, as with @code{-} operator. +A leading tilde acts as a negation operator, causing the word's +contribution to the row relevance to be negative. It's useful for marking +noise words. A row that contains such a word will be rated lower than +others, but will not be excluded altogether, as it would be with the +@code{-} operator. @item * -This is truncation operator. Unlike others it should be @strong{appended} -to the word, not prepended. +An asterisk is the truncation operator. Unlike the other operators, it +should be @strong{appended} to the word, not prepended. @end table And here are some examples: @@ -36148,25 +36164,25 @@ order), but rank ``apple pie'' higher than ``apple strudel''. @end table @menu -* Fulltext Restrictions:: Fulltext Restrictions +* Fulltext Restrictions:: Full-text Restrictions * Fulltext Fine-tuning:: Fine-tuning MySQL Full-text Search * Fulltext TODO:: Full-text Search TODO @end menu @node Fulltext Restrictions, Fulltext Fine-tuning, Fulltext Search, Fulltext Search -@subsection Fulltext Restrictions +@subsection Full-text Restrictions @itemize @bullet @item -All parameters to the @code{MATCH} function must be columns from the -same table that is part of the same fulltext index, unless this -@code{MATCH} is @code{IN BOOLEAN MODE}. +All parameters to the @code{MATCH()} function must be columns from the +same table that is part of the same @code{FULLTEXT} index, unless the +@code{MATCH()} is @code{IN BOOLEAN MODE}. @item -Column list between @code{MATCH} and @code{AGAINST} must match exactly -a column list in the @code{FULLTEXT} index definition, unless this -@code{MATCH} is @code{IN BOOLEAN MODE}. +The @code{MATCH()} column list must exactly match the column list in some +@code{FULLTEXT} index definition for the table, unless this @code{MATCH()} +is @code{IN BOOLEAN MODE}. @item -The argument to @code{AGAINST} must be a constant string. +The argument to @code{AGAINST()} must be a constant string. @end itemize @@ -36176,7 +36192,7 @@ The argument to @code{AGAINST} must be a constant string. Unfortunately, full-text search has few user-tunable parameters yet, although adding some is very high on the TODO. If you have a MySQL source distribution (@pxref{Installing source}), you can -more control on the full-text search behavior. +exert more control over full-text searching behavior. Note that full-text search was carefully tuned for the best searching effectiveness. Modifying the default behavior will, in most cases, @@ -36186,37 +36202,37 @@ unless you know what you are doing! @itemize @bullet @item -Minimal length of word to be indexed is defined by MySQL +The minimum length of words to be indexed is defined by the MySQL variable @code{ft_min_word_length}. @xref{SHOW VARIABLES}. Change it to the value you prefer, and rebuild your @code{FULLTEXT} indexes. @item The stopword list is defined in @file{myisam/ft_static.c} -Modify it to your taste, recompile MySQL and rebuild +Modify it to your taste, recompile MySQL, and rebuild your @code{FULLTEXT} indexes. @item -The 50% threshold is caused by the particular weighting scheme chosen. To -disable it, change the following line in @file{myisam/ftdefs.h}: +The 50% threshold is determined by the particular weighting scheme chosen. +To disable it, change the following line in @file{myisam/ftdefs.h}: @example #define GWS_IN_USE GWS_PROB @end example -to +To: @example #define GWS_IN_USE GWS_FREQ @end example -and recompile MySQL. +Then recompile MySQL. There is no need to rebuild the indexes in this case. -@strong{Note:} by doing this you @strong{severely} decrease MySQL ability -to provide adequate relevance values by @code{MATCH} function. -It means, that if you really need to search for such a common words, -then you should rather search @code{IN BOOLEAN MODE}, which does not -has 50% threshold. +@strong{Note:} by doing this you @strong{severely} decrease MySQL's ability +to provide adequate relevance values for the @code{MATCH()} function. +If you really need to search for such common words, it would be better to +search using @code{IN BOOLEAN MODE} instead, which does not observe the 50% +threshold. @item -Sometimes search engine maintaner would like to change operators used -for boolean fulltext search. They are defined by a +Sometimes the search engine maintainer would like to change the operators used +for boolean fulltext searches. These are defined by the @code{ft_boolean_syntax} variable. @xref{SHOW VARIABLES}. Still, this variable is read-only, its value is set in @file{myisam/ft_static.c}. @@ -36237,7 +36253,7 @@ the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc. @item Support for multi-byte charsets. @item Make stopword list to depend of the language of the data. @item Stemming (dependent of the language of the data, of course). -@item Generic user-supplyable UDF (?) preparser. +@item Generic user-suppliable UDF (?) preparser. @item Make the model more flexible (by adding some adjustable parameters to @code{FULLTEXT} in @code{CREATE/ALTER TABLE}). @end itemize @@ -49697,7 +49713,7 @@ Fixed bug with @code{LOCK TABLE} and BDB tables. @itemize @bullet @item -Fixed a bug when using @code{MATCH} in @code{HAVING} clause. +Fixed a bug when using @code{MATCH()} in @code{HAVING} clause. @item Fixed a bug when using @code{HEAP} tables with @code{LIKE}. @item @@ -50266,7 +50282,7 @@ that caused @code{mysql_install_db} to core dump on some Linux machines. @item Changed @code{mi_create()} to use less stack space. @item -Fixed bug with optimiser trying to over-optimise @code{MATCH} when used +Fixed bug with optimiser trying to over-optimise @code{MATCH()} when used with @code{UNIQUE} key. @item Changed @code{crash-me} and the MySQL benchmarks to also work @@ -50722,7 +50738,7 @@ More variables in @code{SHOW SLAVE STATUS} and @code{SHOW MASTER STATUS}. @item @code{SLAVE STOP} now will not return until the slave thread actually exits. @item -Full text search via the @code{MATCH} function and @code{FULLTEXT} index type +Full text search via the @code{MATCH()} function and @code{FULLTEXT} index type (for MyISAM files). This makes @code{FULLTEXT} a reserved word. @end itemize