Add description of new features

This commit is contained in:
Teodor Sigaev 2006-10-31 16:23:05 +00:00
parent 7e63445d59
commit bf028fa8a6
3 changed files with 502 additions and 89 deletions

View File

@ -427,9 +427,9 @@ concatenation also works with NULL fields.</strong></p>
<p>We need to create the index on the column idxFTI. Keep in mind
that the database will update the index when some action is taken.
In this case we _need_ the index (The whole point of Full Text
INDEXINGi ;-)), so don't worry about any indexing overhead. We will
create an index based on the gist function. GiST is an index
structure for Generalized Search Tree.</p>
INDEXING ;-)), so don't worry about any indexing overhead. We will
create an index based on the gist or gin function. GiST is an index
structure for Generalized Search Tree, GIN is a inverted index (see <a href="tsearch2-ref.html#indexes">The tsearch2 Reference: Indexes</a>).</p>
<pre>
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
VACUUM FULL ANALYZE;

View File

@ -1,7 +1,6 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
<title>tsearch2 guide</title>
</head>
<body>
@ -9,16 +8,13 @@
<p align=center>
Brandon Craig Rhodes<br>30 June 2003
<br>Updated to 8.2 release by Oleg Bartunov, October 2006</br>
<p>
This Guide introduces the reader to the PostgreSQL tsearch2 module,
version&nbsp;2.
More formal descriptions of the module's types and functions
are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
which is a companion to this document.
You can retrieve a beta copy of the tsearch2 module from the
<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
page &mdash; look under the section entitled <i>Development History</i>
for the current version.
<p>
First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
and how they are used to search documents;
@ -32,15 +28,40 @@ you should be able to run the examples here exactly as they are typed.
<hr>
<h2>Table of Contents</h2>
<blockquote>
<a href="#intro">Introduction to FTS with tsearch2</a><br>
<a href="#vectors_queries">Vectors and Queries</a><br>
<a href="#simple_search">A Simple Search Engine</a><br>
<a href="#weights">Ranking and Position Weights</a><br>
<a href="#casting">Casting Vectors and Queries</a><br>
<a href="#parsing_lexing">Parsing and Lexing</a><br>
<a href="#ref">Additional information</a>
</blockquote>
<hr>
<h2><a name="intro">Introduction to FTS with tsearch2</a></h2>
The purpose of FTS is to
find <b>documents</b>, which satisfy <b>query</b> and optionally return
them in some <b>order</b>.
Most common case: Find documents containing all query terms and return them in order
of their similarity to the query. Document in database can be
any text attribute, or combination of text attributes from one or many tables
(using joins).
Text search operators existed for years, in PostgreSQL they are
<tt><b>~,~*, LIKE, ILIKE</b></tt>, but they lack linguistic support,
tends to be slow and have no relevance ranking. The idea behind tsearch2 is
is rather simple - preprocess document at index time to save time at search stage.
Preprocessing includes
<ul>
<li>document parsing onto words
<li>linguistic - normalize words to obtain lexemes
<li>store document in optimized for searching way
</ul>
Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types,
which represent document and query - <tt>tsquery @@ tsvector</tt>.
<P>
<h2><a name=vectors_queries>Vectors and Queries</a></h2>
<blockquote>
@ -79,6 +100,8 @@ Preparing your document index involves two steps:
on the <tt>tsvector</tt> column of a table,
which implements a form of the Berkeley
<a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
Since PostgreSQL 8.2 tsearch2 supports <a href="http://www.sigaev.ru/gin/">Gin</a> index,
which is an inverted index, commonly used in search engines. It adds scalability to tsearch2.
</ul>
Once your documents are indexed,
performing a search involves:
@ -251,7 +274,7 @@ and give you an error to prevent this mistake:
<pre>
=# <b>SELECT to_tsquery('the')</b>
NOTICE: Query contains only stopword(s) or doesn't contain lexeme(s), ignored
NOTICE: Query contains only stopword(s) or doesn't contain lexem(s), ignored
to_tsquery
------------
@ -483,8 +506,8 @@ The <tt>rank()</tt> function existed in older versions of OpenFTS,
and has the feature that you can assign different weights
to words from different sections of your document.
The <tt>rank_cd()</tt> uses a recent technique for weighting results
but does not allow different weight to be given
to different sections of your document.
and also allows different weight to be given
to different sections of your document (since 8.2).
<p>
Both ranking functions allow you to specify,
as an optional last argument,
@ -511,9 +534,6 @@ for details
see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
in the Reference.
<p>
The <tt>rank()</tt> function offers more flexibility
because it pays attention to the <i>weights</i>
with which you have labelled lexeme positions.
Currently tsearch2 supports four different weight labels:
<tt>'D'</tt>, the default weight;
and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
@ -730,7 +750,7 @@ The main problem is that the apostrophe and backslash
are important <i>both</i> to PostgreSQL when it is interpreting a string,
<i>and</i> to the <tt>tsvector</tt> conversion function.
You may want to review section
<a href="http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-syntax.html#SQL-SYNTAX-STRINGS">1.1.2.1,
<a href="http://www.postgresql.org/docs/current/static/sql-syntax.html#SQL-SYNTAX-STRINGS">
&ldquo;String Constants&rdquo;</a>
in the PostgreSQL documentation before proceeding.
<p>
@ -1051,6 +1071,14 @@ using the same scheme to determine the dictionary for each token,
with the difference that the query parser recognizes as special
the boolean operators that separate query words.
<h2><a name="ref">Additional information</a></h2>
More information about tsearch2 is available from
<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2">tsearch2</a> page.
Also, it's worth to check
<a href="http://www.sai.msu.su/~megera/wiki/Tsearch2">tsearch2 wiki</a> pages.
</body>
</html>

View File

@ -1,53 +1,74 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
<link type="text/css" rel="stylesheet" href="tsearch2-ref_files/tsearch.txt"><title>tsearch2 reference</title></head>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>
<title>tsearch2 reference</title></head>
<body>
<h1 align="center">The tsearch2 Reference</h1>
<p align="center">
Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
</p><p>
<br>Massive update for 8.2 release by Oleg Bartunov, October 2006
</p>
<p>
This Reference documents the user types and functions
of the tsearch2 module for PostgreSQL.
An introduction to the module is provided
by the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
by the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
a companion document to this one.
You can retrieve a beta copy of the tsearch2 module from the
<a href="http://www.sai.msu.su/%7Emegera/postgres/gist/">GiST for PostgreSQL</a>
page -- look under the section entitled <i>Development History</i>
for the current version.
</p>
</p><h2><a name="vq">Vectors and Queries</a></h2>
<h2>Table of Contents</h2>
<blockquote>
<a href="#vq">Vectors and Queries</a><br>
<a href="#vqo">Vector Operations</a><br>
<a href="#qo">Query Operations</a><br>
<a href="#fts">Full Text Search Operator</a><br>
<a href="#configurations">Configurations</a><br>
<a href="#testing">Testing</a><br>
<a href="#parsers">Parsers</a><br>
<a href="#dictionaries">Dictionaries</a><br>
<a href="#ranking">Ranking</a><br>
<a href="#headlines">Headlines</a><br>
<a href="#indexes">Indexes</a><br>
<a href="#tz">Thesaurus dictionary</a><br>
</blockquote>
<a name="vq">Vectors and queries both store lexemes,
<h2><a name="vq">Vectors and Queries</a></h2>
Vectors and queries both store lexemes,
but for different purposes.
A <tt>tsvector</tt> stores the lexemes
of the words that are parsed out of a document,
and can also remember the position of each word.
A <tt>tsquery</tt> specifies a boolean condition among lexemes.
</a><p>
<a name="vq">Any of the following functions with a <tt><i>configuration</i></tt> argument
<p>
Any of the following functions with a <tt><i>configuration</i></tt> argument
can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
to select a configuration;
if the option is omitted, then the current configuration is used.
For more information on the current configuration,
read the next section on Configurations.
</p>
</a></p><h3><a name="vq">Vector Operations</a></h3>
<h3><a name="vqo">Vector Operations</a></h3>
<dl><dt>
<a name="vq"> <tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
<i>document</i> TEXT) RETURNS tsvector</tt>
</a></dt><dd>
<a name="vq"> Parses a document into tokens,
<tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
<i>document</i> TEXT) RETURNS TSVECTOR</tt>
</dt><dd>
Parses a document into tokens,
reduces the tokens to lexemes,
and returns a <tt>tsvector</tt> which lists the lexemes
together with their positions in the document.
For the best description of this process,
see the section on </a><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
see the section on <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
in the accompanying tsearch2 Guide.
</dd><dt>
<tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt>
<tt>strip(<i>vector</i> TSVECTOR) RETURNS TSVECTOR</tt>
</dt><dd>
Return a vector which lists the same lexemes
as the given <tt><i>vector</i></tt>,
@ -56,10 +77,10 @@ read the next section on Configurations.
While the returned vector is thus useless for relevance ranking,
it will usually be much smaller.
</dd><dt>
<tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt>
<tt>setweight(<i>vector</i> TSVECTOR, <i>letter</i>) RETURNS TSVECTOR</tt>
</dt><dd>
This function returns a copy of the input vector
in which every location has been labelled
in which every location has been labeled
with either the <tt><i>letter</i></tt>
<tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
or the default label <tt>'D'</tt>
@ -68,11 +89,11 @@ read the next section on Configurations.
These labels are retained when vectors are concatenated,
allowing words from different parts of a document
to be weighted differently by ranking functions.
</dd><dt>
<tt><i>vector1</i> || <i>vector2</i></tt>
</dt><dt class="br">
<tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector)
RETURNS tsvector</tt>
</dd>
<dt>
<tt><i>vector1</i> || <i>vector2</i></tt><BR>
<tt>concat(<i>vector1</i> TSVECTOR, <i>vector2</i> TSVECTOR)
RETURNS TSVECTOR</tt>
</dt><dd>
Returns a vector which combines the lexemes and position information
in the two vectors given as arguments.
@ -95,27 +116,81 @@ read the next section on Configurations.
to the <tt>rank()</tt> function
that assigns different weights to positions with different labels.
</dd><dt>
<tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt>
<tt>length(<i>vector</i> TSVECTOR) RETURNS INT4</tt>
</dt><dd>
Returns the number of lexemes stored in the vector.
</dd><dt>
<tt><i>text</i>::tsvector RETURNS tsvector</tt>
<tt><i>text</i>::TSVECTOR RETURNS TSVECTOR</tt>
</dt><dd>
Directly casting text to a <tt>tsvector</tt>
allows you to directly inject lexemes into a vector,
with whatever positions and position weights you choose to specify.
The <tt><i>text</i></tt> should be formatted
like the vector would be printed by the output of a <tt>SELECT</tt>.
See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
See the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
section in the Guide for details.
</dd></dl>
</dd><dt>
<tt>tsearch2(<i>vector_column_name</i>[, (<i>my_filter_name</i> | <i>text_column_name1</i>) [...] ], <i>text_column_nameN</i>)</tt>
</dt><dd>
<tt>tsearch2()</tt> trigger used to automatically update <i>vector_column_name</i>, <i>my_filter_name</i>
is the function name to preprocess <i>text_column_name</i>. There are can be many
functions and text columns specified in <tt>tsearch2()</tt> trigger.
The following rule used:
function applied to all subsequent text columns until next function occurs.
Example, function <tt>dropatsymbol</tt> replaces all entries of <tt>@</tt>
sign by space.
<pre>
CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;
<h3>Query Operations</h3>
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT
ON tblMessages FOR EACH ROW EXECUTE PROCEDURE
tsearch2(tsvector_column,dropatsymbol, strMessage);
</pre>
</dd>
<dl><dt>
<tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
<i>querytext</i> text) RETURNS tsvector</tt>
<dt>
<tt>stat(<i>sqlquery</i> text [, <i>weight</i> text]) RETURNS SETOF statinfo</tt>
</dt><dd>
Here <tt>statinfo</tt> is a type, defined as
<tt>
CREATE TYPE statinfo as (<i>word</i> text, <i>ndoc</i> int4, <i>nentry</i> int4)
</tt> and <i>sqlquery</i> is a query, which returns column <tt>tsvector</tt>.
<P>
This returns statistics (the number of documents <i>ndoc</i> and total number <i>nentry</i> of <i>word</i>
in the collection) about column <i>vector</i> <tt>tsvector</tt>.
Useful to check how good is your configuration and
to find stop-words candidates.For example, find top 10 most frequent words:
<pre>
=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10;
</pre>
Optionally, one can specify <i>weight</i> to obtain statistics about words with specific weight.
<pre>
=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10;
</pre>
</dd>
<dt>
<tt>TSVECTOR &lt; TSVECTOR</tt><BR>
<tt>TSVECTOR &lt;= TSVECTOR</tt><BR>
<tt>TSVECTOR = TSVECTOR</tt><BR>
<tt>TSVECTOR >= TSVECTOR</tt><BR>
<tt>TSVECTOR > TSVECTOR</tt>
</dt><dd>
All btree operations defined for <tt>tsvector</tt> type. <tt>tsvectors</tt> compares
with each other using lexicographical order.
</dd>
</dl>
<h3><a name="qo">Query Operations</a></h3>
<dl>
<dt>
<tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
<i>querytext</i> text) RETURNS TSQUERY[A</tt>
</dt>
<dd>
Parses a query,
which should be single words separated by the boolean operators
"<tt>&amp;</tt>"&nbsp;and,
@ -123,14 +198,27 @@ read the next section on Configurations.
and&nbsp;"<tt>!</tt>"&nbsp;not,
which can be grouped using parenthesis.
Each word is reduced to a lexeme using the current
or specified configuration.
or specified configuration.
Weight class can be assigned to each lexeme entry
to restrict search region
(see <tt>setweight</tt> for explanation), for example
"<tt>fat:a &amp; rats</tt>".
</dd><dt>
<dt>
<tt>plainto_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
<i>querytext</i> text) RETURNS TSQUERY</tt>
</dt>
<dd>
Transforms unformatted text to tsquery. It is the same as to_tsquery,
but assumes "<tt>&amp;</tt>" boolean operator between words and doesn't
recognizes weight classes.
</dd><dt>
</dd><dt>
<tt>querytree(<i>query</i> tsquery) RETURNS text</tt>
<tt>querytree(<i>query</i> TSQUERY) RETURNS text</tt>
</dt><dd>
This might return a textual representation of the given query.
This returns a query which actually used in searching in GiST index.
</dd><dt>
<tt><i>text</i>::tsquery RETURNS tsquery</tt>
<tt><i>text</i>::TSQUERY RETURNS TSQUERY</tt>
</dt><dd>
Directly casting text to a <tt>tsquery</tt>
allows you to directly inject lexemes into a query,
@ -139,7 +227,117 @@ read the next section on Configurations.
like the query would be printed by the output of a <tt>SELECT</tt>.
See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
section in the Guide for details.
</dd></dl>
</dd>
<dt>
<tt>numnode(<i>query</i> TSQUERY) RETURNS INTEGER</tt>
</dt><dd>
This returns the number of nodes in query tree
</dd><dt>
<tt>TSQUERY && TSQUERY RETURNS TSQUERY</tt>
</dt><dd>
AND-ed TSQUERY
</dd><dt>
<tt>TSQUERY || TSQUERY RETURNS TSQUERY</tt>
</dt> <dd>
OR-ed TSQUERY
</dd><dt>
<tt>!! TSQUERY RETURNS TSQUERY</tt>
</dt> <dd>
negation of TSQUERY
</dd>
<dt>
<tt>TSQUERY &lt; TSQUERY</tt><BR>
<tt>TSQUERY &lt;= TSQUERY</tt><BR>
<tt>TSQUERY = TSQUERY</tt><BR>
<tt>TSQUERY >= TSQUERY</tt><BR>
<tt>TSQUERY > TSQUERY</tt>
</dt><dd>
All btree operations defined for <tt>tsquery</tt> type. <tt>tsqueries</tt> compares
with each other using lexicographical order.
</dd>
</dl>
<h3>Query rewriting</h3>
Query rewriting is a set of functions and operators for tsquery type.
It allows to control search at query time without reindexing (opposite to thesaurus), for example,
expand search using synonyms (new york, big apple, nyc, gotham).
<P>
<tt><b>rewrite()</b></tt> function changes original <i>query</i> by replacing <i>target</i> by <i>sample</i>.
There are three possibilities to use <tt>rewrite()</tt> function. Notice, that arguments of <tt>rewrite()</tt>
function can be column names of type <tt>tsquery</tt>.
<pre>
create table rw (q TSQUERY, t TSQUERY, s TSQUERY);
insert into rw values('a & b','a', 'c');
</pre>
<dl>
<dt> <tt>rewrite (<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY) RETURNS TSQUERY</tt>
</dt>
<dd>
<pre>
=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY);
rewrite
-----------
'c' & 'b'
</pre>
</dd>
<dt> <tt>rewrite (ARRAY[<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY]) RETURNS TSQUERY</tt>
</dt>
<dd>
<pre>
=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw;
rewrite
-----------
'c' & 'b'
</pre>
</dd>
<dt> <tt>rewrite (<i>query</i> TSQUERY,'select <i>target</i> ,<i>sample</i> from test'::text) RETURNS TSQUERY</tt>
</dt>
<dd>
<pre>
=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text);
rewrite
-----------
'c' & 'b'
</pre>
</dd>
</dl>
Two operators defined for <tt>tsquery</tt> type:
<dl>
<dt><tt>TSQUERY @ TSQUERY</tt></dt>
<dd>
Returns <tt>TRUE</tt> if right agrument might contained in left argument.
</dd>
<dt><tt>TSQUERY ~ TSQUERY</tt></dt>
<dd>
Returns <tt>TRUE</tt> if left agrument might contained in right argument.
</dd>
</dl>
To speed up these operators one can use GiST index with <tt>gist_tp_tsquery_ops</tt> opclass.
<pre>
create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
</pre>
<h2><a name="fts">Full Text Search operator</a></h2>
<dl><dt>
<tt>TSQUERY @@ TSVECTOR</tt><br>
<tt>TSVECTOR @@ TSQUERY</tt>
</dt>
<dd>
Returns <tt>TRUE</tt> if <tt>TSQUERY</tt> contained in <tt>TSVECTOR</tt> and
<tt>FALSE</tt> otherwise.
<pre>
=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
?column?
----------
t
=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
?column?
----------
f
</pre>
</dd>
</dl>
<h2><a name="configurations">Configurations</a></h2>
@ -147,7 +345,7 @@ A configuration specifies all of the equipment necessary
to transform a document into a <tt>tsvector</tt>:
the parser that breaks its text into tokens,
and the dictionaries which then transform each token into a lexeme.
Every call to <tt>to_tsvector()</tt> (described above)
Every call to <tt>to_tsvector(), to_tsquery()</tt> (described above)
uses a configuration to perform its processing.
Three configurations come with tsearch2:
@ -157,7 +355,10 @@ Three configurations come with tsearch2:
and the <i>simple</i> dictionary for all others.
</li><li><b>default_russian</b> -- Indexes words and numbers,
using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
and the <i>ru_stem</i> Russian Snowball dictionary for all others.
and the <i>ru_stem</i> Russian Snowball dictionary for all others. It's default
for <tt>ru_RU.KOI8-R</tt> locale.
</li><li><b>utf8_russian</b> -- the same as <b>default_russian</b> but
for <tt>ru_RU.UTF-8</tt> locale.
</li><li><b>simple</b> -- Processes both words and numbers
with the <i>simple</i> dictionary,
which neither discards any stop words nor alters them.
@ -239,7 +440,8 @@ Here:
</li><li>description - human readable name of tok_type
</li><li>token - parser's token
</li><li>dict_name - dictionary used for the token
</li><li>tsvector - final result</li></ul>
</li><li>tsvector - final result</li>
</ul>
<h2><a name="parsers">Parsers</a></h2>
@ -300,20 +502,40 @@ the current parser is used when this argument is omitted.
<h2><a name="dictionaries">Dictionaries</a></h2>
Dictionaries take textual tokens as input,
usually those produced by a parser,
and return lexemes which are usually some reduced form of the token.
Dictionary is a program, which accepts lexeme(s), usually those produced by a parser,
on input and returns:
<ul>
<li>array of lexeme(s) if input lexeme is known to the dictionary
<li>void array - dictionary knows lexeme, but it's stop word.
<li> NULL - dictionary doesn't recognized input lexeme
</ul>
Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries),
but see, for example, <tt>intdict</tt> dictionary (available from
<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">Tsearch2</a> home page,
which controls indexing of integers.
<P>
Among the dictionaries which come installed with tsearch2 are:
<ul>
<li><b>simple</b> simply folds uppercase letters to lowercase
before returning the word.
</li><li><b>en_stem</b> runs an English Snowball stemmer on each word
</li>
<li><b>ispell_template</b> - template for ispell dictionaries.
</li>
<li><b>en_stem</b> runs an English Snowball stemmer on each word
that attempts to reduce the various forms of a verb or noun
to a single recognizable form.
</li><li><b>ru_stem</b> runs a Russian Snowball stemmer on each word.
</li></ul>
</li><li><b>ru_stem_koi8</b>, <b>ru_stem_utf8</b> runs a Russian Snowball stemmer on each word.
</li>
<li><b>synonym</b> - simple lexeme-to-lexeme replacement
</li>
<li><b>thesaurus_template</b> - template for <a href="#tz">thesaurus dictionary</a>. It's
phrase-to-phrase replacement
</li>
</ul>
<P>
Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
<pre>CREATE TABLE pg_ts_dict (
@ -332,6 +554,12 @@ it specifies a file from which stop words should be read.
The <tt>dict_comment</tt> is a human-readable description of the dictionary.
The other fields are internal function identifiers
useful only to developers trying to implement their own dictionaries.
<blockquote>
<b>WARNING:</b> Data files, used by dictionaries, should be in <tt>server_encoding</tt> to
avoid possible problems&nbsp;!
</blockquote>
<p>
The argument named <tt><i>dictionary</i></tt>
in each of the following functions
@ -355,6 +583,27 @@ if omitted then the current dictionary is used.
from which an inflected form could arise.
</dd></dl>
<h3>Using dictionaries template</h3>
Templates used to define new dictionaries, for example,
<pre>
INSERT INTO pg_ts_dict
(SELECT 'en_ispell', dict_init,
'DictFile="/usr/local/share/dicts/ispell/english.dict",'
'AffFile="/usr/local/share/dicts/ispell/english.aff",'
'StopFile="/usr/local/share/dicts/english.stop"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'ispell_template');
</pre>
<h3>Working with stop words</h3>
Ispell and snowball stemmers treat stop words differently:
<ul>
<li>ispell - normalize word and then lookups normalized form in stop-word file
<li>snowball stemmer - first, it lookups word in stop-word file and then does it job.
The reason - to minimize possible 'noise'.
</ul>
<h2><a name="ranking">Ranking</a></h2>
Ranking attempts to measure how relevant documents are to particular queries
@ -364,26 +613,18 @@ Note that this information is only available in unstripped vectors --
ranking functions will only return a useful result
for a <tt>tsvector</tt> which still has position information!
<p>
Both of these ranking functions
take an integer <i>normalization</i> option
that specifies whether a document's length should impact its rank.
This is often desirable,
since a hundred-word document with five instances of a search word
is probably more relevant than a thousand-word document with five instances.
The option can have the values:
</p><ul>
<li><tt>0</tt> (the default) ignores document length.
</li><li><tt>1</tt> divides the rank by the logarithm of the length.
</li><li><tt>2</tt> divides the rank by the length itself.
</li></ul>
Notice, that ranking functions supplied are just an examples and
doesn't belong to the tsearch2 core, you can
write your very own ranking function and/or combine additional
factors to fit your specific interest.
</p>
The two ranking functions currently available are:
<dl><dt>
<tt>CREATE FUNCTION rank(<br>
<em>[</em> <i>weights</i> float4[], <em>]</em>
<i>vector</i> tsvector, <i>query</i> tsquery,
<i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
) RETURNS float4</tt>
</dt><dd>
@ -399,8 +640,8 @@ The two ranking functions currently available are:
and make them more or less important than words in the document body.
</dd><dt>
<tt>CREATE FUNCTION rank_cd(<br>
<em>[</em> <i>K</i> int4, <em>]</em>
<i>vector</i> tsvector, <i>query</i> tsquery,
<em>[</em> <i>weights</i> float4[], <em>]</em>
<i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
) RETURNS float4</tt>
</dt><dd>
@ -409,20 +650,51 @@ The two ranking functions currently available are:
as described in Clarke, Cormack, and Tudhope's
"<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>"
in the 1999 <i>Information Processing and Management</i>.
The value <i>K</i> is one of the values from their formula,
and defaults to&nbsp;<i>K</i>=4.
The examples in their paper <i>K</i>=16;
we can roughly describe the term
as stating how far apart two search terms can fall
before the formula begins penalizing them for lack of proximity.
</dd></dl>
</dd>
<dt>
<tt>CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text</tt>
</dt>
<dd>
Returns <tt>extents</tt>, which are a shortest and non-nested sequences of words, which satisfy a query.
Extents (covers) used in <tt>rank_cd</tt> algorithm for fast calculation of proximity ranking.
In example below there are two extents - <tt><b>{1</b>...<b>}1</b> and <b>{2</b> ...<b>}2</b></tt>.
<pre>
=# select get_covers('1:1,2,10 2:4'::tsvector,'1&amp; 2');
get_covers
----------------------
1 {1 1 {2 2 }1 1 }2
</pre>
</dd>
</dl>
<p>
Both of these (<tt>rank(), rank_cd()</tt>) ranking functions
take an integer <i>normalization</i> option
that specifies whether a document's length should impact its rank.
This is often desirable,
since a hundred-word document with five instances of a search word
is probably more relevant than a thousand-word document with five instances.
The option can have the values, which could be combined using "|" ( 2|4) to
take into account several factors:
</p>
<ul>
<li><tt>0</tt> (the default) ignores document length.</li>
<li><tt>1</tt> divides the rank by the 1 + logarithm of the length </li>
<li><tt>2</tt> divides the rank by the length itself.</li>
<li><tt>4</tt> divides the rank by the mean harmonic distance between extents</li>
<li><tt>8</tt> divides the rank by the number of unique words in document</li>
<li><tt>16</tt> divides the rank by 1 + logarithm of the number of unique words in document
</li>
</ul>
<h2><a name="headlines">Headlines</a></h2>
<dl><dt>
<tt>CREATE FUNCTION headline(<br>
<em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
<i>document</i> text, <i>query</i> tsquery,
<i>document</i> text, <i>query</i> TSQUERY,
<em>[</em> <i>options</i> text <em>]</em><br>
) RETURNS text</tt>
</dt><dd>
@ -448,10 +720,123 @@ The two ranking functions currently available are:
with a word which has this many characters or less.
The default value of <tt>3</tt> should eliminate most English
conjunctions and articles.
</li><li><tt>HighlightAll</tt> --
boolean flag, if TRUE, than the whole document will be highlighted.
</li></ul>
Any unspecified options receive these defaults:
<pre>StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3
<pre>StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
</pre>
</dd></dl>
<h2><a name="indexes">Indexes</a></h2>
Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS !
<ul>
<li> RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree)
<pre>
=# create index fts_idx on apod using gist(fts);
</pre>
<li>GIN - Generalized Inverted Index
<pre>
=# create index fts_idx on apod using gin(fts);
</pre>
</ul>
<b>GiST</b> index is very good for online update, but is not as scalable as <b>GIN</b> index,
which, in turn, isn't good for updates. Both indexes support concurrency and recovery.
<h2><a name="tz">Thesaurus dictionary</a></h2>
<P>
Thesaurus - is a collection of words with included information about the relationships of words and phrases,
i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.</p>
<p>Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally,
preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing.
Tsearch2's <tt>thesaurus</tt> dictionary (TZ) is an extension of <tt>synonym</tt> dictionary
with <b>phrase</b> support. Thesaurus is a plain file of the following format:
<pre>
# this is a comment
sample word(s) : indexed word(s)
...............................
</pre>
<ul>
<li><strong>Colon</strong> (:) symbol used as a delimiter.</li>
<li>Use asterisk (<b>*</b>) at the beginning of <tt>indexed word</tt> to skip subdictionary.
It's still required, that <tt>sample words</tt> should be known.</li>
<li>thesaurus dictionary looks for the most longest match</li></ul>
<P>
TZ uses <strong>subdictionary</strong> (should be defined in tsearch2 configuration)
to normalize thesaurus text. It's possible to define only <strong>one dictionary</strong>.
Notice, that subdictionary produces an error, if it couldn't recognize word.
In that case, you should remove definition line with this word or teach subdictionary to know it.
</p>
<p>Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e.,
important only their position.
To break possible ties thesaurus applies the last definition. For example, consider
thesaurus (with simple subdictionary) rules with pattern 'swsw'
('s' designates stop-word and 'w' - known word): </p>
<pre>
a one the two : swsw
the one a two : swsw2
</pre>
<p>Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary.
Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition
'swsw2'.</p>
<p>As a normal dictionary, it should be assigned to the specific lexeme types.
Since TZ has a capability to recognize phrases it must remember its state and interact with parser.
TZ use these assignments to check if it should handle next word or stop accumulation.
Compiler of TZ should take care about proper configuration to avoid confusion.
For example, if TZ is assigned to handle only <tt>lword</tt> lexeme, then TZ definition like
' one 1:11' will not works, since lexeme type <tt>digit</tt> doesn't assigned to the TZ.</p>
<h3>Configuration</h3>
<dl><dt>tsearch2</dt><dd></dd></dl><p>tsearch2 comes with thesaurus template, which could be used to define new dictionary: </p>
<pre class="real">INSERT INTO pg_ts_dict
(SELECT 'tz_simple', dict_init,
'DictFile="/path/to/tz_simple.txt",'
'Dictionary="en_stem"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'thesaurus_template');
</pre>
<p>Here: </p>
<ul>
<li><tt>tz_simple</tt> - is the dictionary name</li>
<li><tt>DictFile="/path/to/tz_simple.txt"</tt> - is the location of thesaurus file</li>
<li><tt>Dictionary="en_stem"</tt> defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that <em>en_stem</em> dictionary has it's own configuration (stop-words, for example).</li>
</ul>
<p>Now, it's possible to use <tt>tz_simple</tt> in pg_ts_cfgmap, for example: </p>
<pre>
update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and
tok_alias in ('lhword', 'lword', 'lpart_hword');
</pre>
<h3>Examples</h3>
<p>tz_simple: </p>
<pre>
one : 1
two : 2
one two : 12
the one : 1
one 1 : 11
</pre>
<p>To see, how thesaurus works, one could use <tt>to_tsvector</tt>, <tt>to_tsquery</tt> or <tt>plainto_tsquery</tt> functions: </p><pre class="real">=# select plainto_tsquery('default_russian',' one day is oneday');
plainto_tsquery
------------------------
'1' &amp; 'day' &amp; 'oneday'
=# select plainto_tsquery('default_russian','one two day is oneday');
plainto_tsquery
-------------------------
'12' &amp; 'day' &amp; 'oneday'
=# select plainto_tsquery('default_russian','the one');
NOTICE: Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
plainto_tsquery
-----------------
'1'
</pre>
Additional information about thesaurus dictionary is available from
<a href="http://www.sai.msu.su/~megera/wiki/Thesaurus_dictionary">Wiki</a> page.
</body></html>