mirror of
https://git.postgresql.org/git/postgresql.git
synced 2025-01-24 18:55:04 +08:00
3687 lines
129 KiB
Plaintext
3687 lines
129 KiB
Plaintext
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.53 2009/08/14 14:53:20 teodor Exp $ -->
|
|
|
|
<chapter id="textsearch">
|
|
<title id="textsearch-title">Full Text Search</title>
|
|
|
|
<indexterm zone="textsearch">
|
|
<primary>full text search</primary>
|
|
</indexterm>
|
|
|
|
<indexterm zone="textsearch">
|
|
<primary>text search</primary>
|
|
</indexterm>
|
|
|
|
<sect1 id="textsearch-intro">
|
|
<title>Introduction</title>
|
|
|
|
<para>
|
|
Full Text Searching (or just <firstterm>text search</firstterm>) provides
|
|
the capability to identify natural-language <firstterm>documents</> that
|
|
satisfy a <firstterm>query</firstterm>, and optionally to sort them by
|
|
relevance to the query. The most common type of search
|
|
is to find all documents containing given <firstterm>query terms</firstterm>
|
|
and return them in order of their <firstterm>similarity</firstterm> to the
|
|
query. Notions of <varname>query</varname> and
|
|
<varname>similarity</varname> are very flexible and depend on the specific
|
|
application. The simplest search considers <varname>query</varname> as a
|
|
set of words and <varname>similarity</varname> as the frequency of query
|
|
words in the document.
|
|
</para>
|
|
|
|
<para>
|
|
Textual search operators have existed in databases for years.
|
|
<productname>PostgreSQL</productname> has
|
|
<literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
|
|
<literal>ILIKE</literal> operators for textual data types, but they lack
|
|
many essential properties required by modern information systems:
|
|
</para>
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
There is no linguistic support, even for English. Regular expressions
|
|
are not sufficient because they cannot easily handle derived words, e.g.,
|
|
<literal>satisfies</literal> and <literal>satisfy</literal>. You might
|
|
miss documents that contain <literal>satisfies</literal>, although you
|
|
probably would like to find them when searching for
|
|
<literal>satisfy</literal>. It is possible to use <literal>OR</literal>
|
|
to search for multiple derived forms, but this is tedious and error-prone
|
|
(some words can have several thousand derivatives).
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
They provide no ordering (ranking) of search results, which makes them
|
|
ineffective when thousands of matching documents are found.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
They tend to be slow because there is no index support, so they must
|
|
process all documents for every search.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
|
|
and an index saved for later rapid searching. Preprocessing includes:
|
|
</para>
|
|
|
|
<itemizedlist mark="none">
|
|
<listitem>
|
|
<para>
|
|
<emphasis>Parsing documents into <firstterm>tokens</></emphasis>. It is
|
|
useful to identify various classes of tokens, e.g., numbers, words,
|
|
complex words, email addresses, so that they can be processed
|
|
differently. In principle token classes depend on the specific
|
|
application, but for most purposes it is adequate to use a predefined
|
|
set of classes.
|
|
<productname>PostgreSQL</productname> uses a <firstterm>parser</> to
|
|
perform this step. A standard parser is provided, and custom parsers
|
|
can be created for specific needs.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<emphasis>Converting tokens into <firstterm>lexemes</></emphasis>.
|
|
A lexeme is a string, just like a token, but it has been
|
|
<firstterm>normalized</> so that different forms of the same word
|
|
are made alike. For example, normalization almost always includes
|
|
folding upper-case letters to lower-case, and often involves removal
|
|
of suffixes (such as <literal>s</> or <literal>es</> in English).
|
|
This allows searches to find variant forms of the
|
|
same word, without tediously entering all the possible variants.
|
|
Also, this step typically eliminates <firstterm>stop words</>, which
|
|
are words that are so common that they are useless for searching.
|
|
(In short, then, tokens are raw fragments of the document text, while
|
|
lexemes are words that are believed useful for indexing and searching.)
|
|
<productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to
|
|
perform this step. Various standard dictionaries are provided, and
|
|
custom ones can be created for specific needs.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<emphasis>Storing preprocessed documents optimized for
|
|
searching</emphasis>. For example, each document can be represented
|
|
as a sorted array of normalized lexemes. Along with the lexemes it is
|
|
often desirable to store positional information to use for
|
|
<firstterm>proximity ranking</firstterm>, so that a document that
|
|
contains a more <quote>dense</> region of query words is
|
|
assigned a higher rank than one with scattered query words.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
Dictionaries allow fine-grained control over how tokens are normalized.
|
|
With appropriate dictionaries, you can:
|
|
</para>
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
Define stop words that should not be indexed.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Map synonyms to a single word using <application>Ispell</>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Map phrases to a single word using a thesaurus.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Map different variations of a word to a canonical form using
|
|
an <application>Ispell</> dictionary.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Map different variations of a word to a canonical form using
|
|
<application>Snowball</> stemmer rules.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
A data type <type>tsvector</type> is provided for storing preprocessed
|
|
documents, along with a type <type>tsquery</type> for representing processed
|
|
queries (<xref linkend="datatype-textsearch">). There are many
|
|
functions and operators available for these data types
|
|
(<xref linkend="functions-textsearch">), the most important of which is
|
|
the match operator <literal>@@</literal>, which we introduce in
|
|
<xref linkend="textsearch-matching">. Full text searches can be accelerated
|
|
using indexes (<xref linkend="textsearch-indexes">).
|
|
</para>
|
|
|
|
|
|
<sect2 id="textsearch-document">
|
|
<title>What Is a Document?</title>
|
|
|
|
<indexterm zone="textsearch-document">
|
|
<primary>document</primary>
|
|
<secondary>text search</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
A <firstterm>document</> is the unit of searching in a full text search
|
|
system; for example, a magazine article or email message. The text search
|
|
engine must be able to parse documents and store associations of lexemes
|
|
(key words) with their parent document. Later, these associations are
|
|
used to search for documents that contain query words.
|
|
</para>
|
|
|
|
<para>
|
|
For searches within <productname>PostgreSQL</productname>,
|
|
a document is normally a textual field within a row of a database table,
|
|
or possibly a combination (concatenation) of such fields, perhaps stored
|
|
in several tables or obtained dynamically. In other words, a document can
|
|
be constructed from different parts for indexing and it might not be
|
|
stored anywhere as a whole. For example:
|
|
|
|
<programlisting>
|
|
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
|
|
FROM messages
|
|
WHERE mid = 12;
|
|
|
|
SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
|
|
FROM messages m, docs d
|
|
WHERE mid = did AND mid = 12;
|
|
</programlisting>
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
Actually, in these example queries, <function>coalesce</function>
|
|
should be used to prevent a single <literal>NULL</literal> attribute from
|
|
causing a <literal>NULL</literal> result for the whole document.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
Another possibility is to store the documents as simple text files in the
|
|
file system. In this case, the database can be used to store the full text
|
|
index and to execute searches, and some unique identifier can be used to
|
|
retrieve the document from the file system. However, retrieving files
|
|
from outside the database requires superuser permissions or special
|
|
function support, so this is usually less convenient than keeping all
|
|
the data inside <productname>PostgreSQL</productname>. Also, keeping
|
|
everything inside the database allows easy access
|
|
to document metadata to assist in indexing and display.
|
|
</para>
|
|
|
|
<para>
|
|
For text search purposes, each document must be reduced to the
|
|
preprocessed <type>tsvector</> format. Searching and ranking
|
|
are performed entirely on the <type>tsvector</> representation
|
|
of a document — the original text need only be retrieved
|
|
when the document has been selected for display to a user.
|
|
We therefore often speak of the <type>tsvector</> as being the
|
|
document, but of course it is only a compact representation of
|
|
the full document.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-matching">
|
|
<title>Basic Text Matching</title>
|
|
|
|
<para>
|
|
Full text searching in <productname>PostgreSQL</productname> is based on
|
|
the match operator <literal>@@</literal>, which returns
|
|
<literal>true</literal> if a <type>tsvector</type>
|
|
(document) matches a <type>tsquery</type> (query).
|
|
It doesn't matter which data type is written first:
|
|
|
|
<programlisting>
|
|
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery;
|
|
?column?
|
|
----------
|
|
t
|
|
|
|
SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
|
|
?column?
|
|
----------
|
|
f
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
As the above example suggests, a <type>tsquery</type> is not just raw
|
|
text, any more than a <type>tsvector</type> is. A <type>tsquery</type>
|
|
contains search terms, which must be already-normalized lexemes, and
|
|
may combine multiple terms using AND, OR, and NOT operators.
|
|
(For details see <xref linkend="datatype-textsearch">.) There are
|
|
functions <function>to_tsquery</> and <function>plainto_tsquery</>
|
|
that are helpful in converting user-written text into a proper
|
|
<type>tsquery</type>, for example by normalizing words appearing in
|
|
the text. Similarly, <function>to_tsvector</> is used to parse and
|
|
normalize a document string. So in practice a text search match would
|
|
look more like this:
|
|
|
|
<programlisting>
|
|
SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
|
|
?column?
|
|
----------
|
|
t
|
|
</programlisting>
|
|
|
|
Observe that this match would not succeed if written as
|
|
|
|
<programlisting>
|
|
SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat');
|
|
?column?
|
|
----------
|
|
f
|
|
</programlisting>
|
|
|
|
since here no normalization of the word <literal>rats</> will occur.
|
|
The elements of a <type>tsvector</> are lexemes, which are assumed
|
|
already normalized, so <literal>rats</> does not match <literal>rat</>.
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>@@</literal> operator also
|
|
supports <type>text</type> input, allowing explicit conversion of a text
|
|
string to <type>tsvector</type> or <type>tsquery</> to be skipped
|
|
in simple cases. The variants available are:
|
|
|
|
<programlisting>
|
|
tsvector @@ tsquery
|
|
tsquery @@ tsvector
|
|
text @@ tsquery
|
|
text @@ text
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The first two of these we saw already.
|
|
The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
|
|
is equivalent to <literal>to_tsvector(x) @@ y</literal>.
|
|
The form <type>text</type> <literal>@@</literal> <type>text</type>
|
|
is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-intro-configurations">
|
|
<title>Configurations</title>
|
|
|
|
<para>
|
|
The above are all simple text search examples. As mentioned before, full
|
|
text search functionality includes the ability to do many more things:
|
|
skip indexing certain words (stop words), process synonyms, and use
|
|
sophisticated parsing, e.g., parse based on more than just white space.
|
|
This functionality is controlled by <firstterm>text search
|
|
configurations</>. <productname>PostgreSQL</> comes with predefined
|
|
configurations for many languages, and you can easily create your own
|
|
configurations. (<application>psql</>'s <command>\dF</> command
|
|
shows all available configurations.)
|
|
</para>
|
|
|
|
<para>
|
|
During installation an appropriate configuration is selected and
|
|
<xref linkend="guc-default-text-search-config"> is set accordingly
|
|
in <filename>postgresql.conf</>. If you are using the same text search
|
|
configuration for the entire cluster you can use the value in
|
|
<filename>postgresql.conf</>. To use different configurations
|
|
throughout the cluster but the same configuration within any one database,
|
|
use <command>ALTER DATABASE ... SET</>. Otherwise, you can set
|
|
<varname>default_text_search_config</varname> in each session.
|
|
</para>
|
|
|
|
<para>
|
|
Each text search function that depends on a configuration has an optional
|
|
<type>regconfig</> argument, so that the configuration to use can be
|
|
specified explicitly. <varname>default_text_search_config</varname>
|
|
is used only when this argument is omitted.
|
|
</para>
|
|
|
|
<para>
|
|
To make it easier to build custom text search configurations, a
|
|
configuration is built up from simpler database objects.
|
|
<productname>PostgreSQL</>'s text search facility provides
|
|
four types of configuration-related database objects:
|
|
</para>
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
<firstterm>Text search parsers</> break documents into tokens
|
|
and classify each token (for example, as words or numbers).
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>Text search dictionaries</> convert tokens to normalized
|
|
form and reject stop words.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>Text search templates</> provide the functions underlying
|
|
dictionaries. (A dictionary simply specifies a template and a set
|
|
of parameters for the template.)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>Text search configurations</> select a parser and a set
|
|
of dictionaries to use to normalize the tokens produced by the parser.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
Text search parsers and templates are built from low-level C functions;
|
|
therefore it requires C programming ability to develop new ones, and
|
|
superuser privileges to install one into a database. (There are examples
|
|
of add-on parsers and templates in the <filename>contrib/</> area of the
|
|
<productname>PostgreSQL</> distribution.) Since dictionaries and
|
|
configurations just parameterize and connect together some underlying
|
|
parsers and templates, no special privilege is needed to create a new
|
|
dictionary or configuration. Examples of creating custom dictionaries and
|
|
configurations appear later in this chapter.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-tables">
|
|
<title>Tables and Indexes</title>
|
|
|
|
<para>
|
|
The examples in the previous section illustrated full text matching using
|
|
simple constant strings. This section shows how to search table data,
|
|
optionally using indexes.
|
|
</para>
|
|
|
|
<sect2 id="textsearch-tables-search">
|
|
<title>Searching a Table</title>
|
|
|
|
<para>
|
|
It is possible to do a full text search without an index. A simple query
|
|
to print the <structname>title</> of each row that contains the word
|
|
<literal>friend</> in its <structfield>body</> field is:
|
|
|
|
<programlisting>
|
|
SELECT title
|
|
FROM pgweb
|
|
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
|
|
</programlisting>
|
|
|
|
This will also find related words such as <literal>friends</>
|
|
and <literal>friendly</>, since all these are reduced to the same
|
|
normalized lexeme.
|
|
</para>
|
|
|
|
<para>
|
|
The query above specifies that the <literal>english</> configuration
|
|
is to be used to parse and normalize the strings. Alternatively we
|
|
could omit the configuration parameters:
|
|
|
|
<programlisting>
|
|
SELECT title
|
|
FROM pgweb
|
|
WHERE to_tsvector(body) @@ to_tsquery('friend');
|
|
</programlisting>
|
|
|
|
This query will use the configuration set by <xref
|
|
linkend="guc-default-text-search-config">.
|
|
</para>
|
|
|
|
<para>
|
|
A more complex example is to
|
|
select the ten most recent documents that contain <literal>create</> and
|
|
<literal>table</> in the <structname>title</> or <structname>body</>:
|
|
|
|
<programlisting>
|
|
SELECT title
|
|
FROM pgweb
|
|
WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table')
|
|
ORDER BY last_mod_date DESC
|
|
LIMIT 10;
|
|
</programlisting>
|
|
|
|
For clarity we omitted the <function>coalesce</function> function calls
|
|
which would be needed to find rows that contain <literal>NULL</literal>
|
|
in one of the two fields.
|
|
</para>
|
|
|
|
<para>
|
|
Although these queries will work without an index, most applications
|
|
will find this approach too slow, except perhaps for occasional ad-hoc
|
|
searches. Practical use of text searching usually requires creating
|
|
an index.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-tables-index">
|
|
<title>Creating Indexes</title>
|
|
|
|
<para>
|
|
We can create a <acronym>GIN</acronym> index (<xref
|
|
linkend="textsearch-indexes">) to speed up text searches:
|
|
|
|
<programlisting>
|
|
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
|
|
</programlisting>
|
|
|
|
Notice that the 2-argument version of <function>to_tsvector</function> is
|
|
used. Only text search functions that specify a configuration name can
|
|
be used in expression indexes (<xref linkend="indexes-expressional">).
|
|
This is because the index contents must be unaffected by <xref
|
|
linkend="guc-default-text-search-config">. If they were affected, the
|
|
index contents might be inconsistent because different entries could
|
|
contain <type>tsvector</>s that were created with different text search
|
|
configurations, and there would be no way to guess which was which. It
|
|
would be impossible to dump and restore such an index correctly.
|
|
</para>
|
|
|
|
<para>
|
|
Because the two-argument version of <function>to_tsvector</function> was
|
|
used in the index above, only a query reference that uses the 2-argument
|
|
version of <function>to_tsvector</function> with the same configuration
|
|
name will use that index. That is, <literal>WHERE
|
|
to_tsvector('english', body) @@ 'a & b'</> can use the index,
|
|
but <literal>WHERE to_tsvector(body) @@ 'a & b'</> cannot.
|
|
This ensures that an index will be used only with the same configuration
|
|
used to create the index entries.
|
|
</para>
|
|
|
|
<para>
|
|
It is possible to set up more complex expression indexes wherein the
|
|
configuration name is specified by another column, e.g.:
|
|
|
|
<programlisting>
|
|
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
|
|
</programlisting>
|
|
|
|
where <literal>config_name</> is a column in the <literal>pgweb</>
|
|
table. This allows mixed configurations in the same index while
|
|
recording which configuration was used for each index entry. This
|
|
would be useful, for example, if the document collection contained
|
|
documents in different languages. Again,
|
|
queries that are meant to use the index must be phrased to match, e.g.,
|
|
<literal>WHERE to_tsvector(config_name, body) @@ 'a & b'</>.
|
|
</para>
|
|
|
|
<para>
|
|
Indexes can even concatenate columns:
|
|
|
|
<programlisting>
|
|
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || ' ' || body));
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Another approach is to create a separate <type>tsvector</> column
|
|
to hold the output of <function>to_tsvector</>. This example is a
|
|
concatenation of <literal>title</literal> and <literal>body</literal>,
|
|
using <function>coalesce</> to ensure that one field will still be
|
|
indexed when the other is <literal>NULL</>:
|
|
|
|
<programlisting>
|
|
ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
|
|
UPDATE pgweb SET textsearchable_index_col =
|
|
to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
|
|
</programlisting>
|
|
|
|
Then we create a <acronym>GIN</acronym> index to speed up the search:
|
|
|
|
<programlisting>
|
|
CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
|
|
</programlisting>
|
|
|
|
Now we are ready to perform a fast full text search:
|
|
|
|
<programlisting>
|
|
SELECT title
|
|
FROM pgweb
|
|
WHERE textsearchable_index_col @@ to_tsquery('create & table')
|
|
ORDER BY last_mod_date DESC
|
|
LIMIT 10;
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
When using a separate column to store the <type>tsvector</>
|
|
representation,
|
|
it is necessary to create a trigger to keep the <type>tsvector</>
|
|
column current anytime <literal>title</> or <literal>body</> changes.
|
|
<xref linkend="textsearch-update-triggers"> explains how to do that.
|
|
</para>
|
|
|
|
<para>
|
|
One advantage of the separate-column approach over an expression index
|
|
is that it is not necessary to explicitly specify the text search
|
|
configuration in queries in order to make use of the index. As shown
|
|
in the example above, the query can depend on
|
|
<varname>default_text_search_config</>. Another advantage is that
|
|
searches will be faster, since it will not be necessary to redo the
|
|
<function>to_tsvector</> calls to verify index matches. (This is more
|
|
important when using a GiST index than a GIN index; see <xref
|
|
linkend="textsearch-indexes">.) The expression-index approach is
|
|
simpler to set up, however, and it requires less disk space since the
|
|
<type>tsvector</> representation is not stored explicitly.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-controls">
|
|
<title>Controlling Text Search</title>
|
|
|
|
<para>
|
|
To implement full text searching there must be a function to create a
|
|
<type>tsvector</type> from a document and a <type>tsquery</type> from a
|
|
user query. Also, we need to return results in a useful order, so we need
|
|
a function that compares documents with respect to their relevance to
|
|
the query. It's also important to be able to display the results nicely.
|
|
<productname>PostgreSQL</productname> provides support for all of these
|
|
functions.
|
|
</para>
|
|
|
|
<sect2 id="textsearch-parsing-documents">
|
|
<title>Parsing Documents</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> provides the
|
|
function <function>to_tsvector</function> for converting a document to
|
|
the <type>tsvector</type> data type.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>to_tsvector</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
to_tsvector(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>) returns <type>tsvector</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>to_tsvector</function> parses a textual document into tokens,
|
|
reduces the tokens to lexemes, and returns a <type>tsvector</type> which
|
|
lists the lexemes together with their positions in the document.
|
|
The document is processed according to the specified or default
|
|
text search configuration.
|
|
Here is a simple example:
|
|
|
|
<programlisting>
|
|
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
|
|
to_tsvector
|
|
-----------------------------------------------------
|
|
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
In the example above we see that the resulting <type>tsvector</type> does not
|
|
contain the words <literal>a</literal>, <literal>on</literal>, or
|
|
<literal>it</literal>, the word <literal>rats</literal> became
|
|
<literal>rat</literal>, and the punctuation sign <literal>-</literal> was
|
|
ignored.
|
|
</para>
|
|
|
|
<para>
|
|
The <function>to_tsvector</function> function internally calls a parser
|
|
which breaks the document text into tokens and assigns a type to
|
|
each token. For each token, a list of
|
|
dictionaries (<xref linkend="textsearch-dictionaries">) is consulted,
|
|
where the list can vary depending on the token type. The first dictionary
|
|
that <firstterm>recognizes</> the token emits one or more normalized
|
|
<firstterm>lexemes</firstterm> to represent the token. For example,
|
|
<literal>rats</literal> became <literal>rat</literal> because one of the
|
|
dictionaries recognized that the word <literal>rats</literal> is a plural
|
|
form of <literal>rat</literal>. Some words are recognized as
|
|
<firstterm>stop words</> (<xref linkend="textsearch-stopwords">), which
|
|
causes them to be ignored since they occur too frequently to be useful in
|
|
searching. In our example these are
|
|
<literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
|
|
If no dictionary in the list recognizes the token then it is also ignored.
|
|
In this example that happened to the punctuation sign <literal>-</literal>
|
|
because there are in fact no dictionaries assigned for its token type
|
|
(<literal>Space symbols</literal>), meaning space tokens will never be
|
|
indexed. The choices of parser, dictionaries and which types of tokens to
|
|
index are determined by the selected text search configuration (<xref
|
|
linkend="textsearch-configuration">). It is possible to have
|
|
many different configurations in the same database, and predefined
|
|
configurations are available for various languages. In our example
|
|
we used the default configuration <literal>english</literal> for the
|
|
English language.
|
|
</para>
|
|
|
|
<para>
|
|
The function <function>setweight</function> can be used to label the
|
|
entries of a <type>tsvector</type> with a given <firstterm>weight</>,
|
|
where a weight is one of the letters <literal>A</>, <literal>B</>,
|
|
<literal>C</>, or <literal>D</>.
|
|
This is typically used to mark entries coming from
|
|
different parts of a document, such as title versus body. Later, this
|
|
information can be used for ranking of search results.
|
|
</para>
|
|
|
|
<para>
|
|
Because <function>to_tsvector</function>(<literal>NULL</literal>) will
|
|
return <literal>NULL</literal>, it is recommended to use
|
|
<function>coalesce</function> whenever a field might be null.
|
|
Here is the recommended method for creating
|
|
a <type>tsvector</type> from a structured document:
|
|
|
|
<programlisting>
|
|
UPDATE tt SET ti =
|
|
setweight(to_tsvector(coalesce(title,'')), 'A') ||
|
|
setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
|
|
setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
|
|
setweight(to_tsvector(coalesce(body,'')), 'D');
|
|
</programlisting>
|
|
|
|
Here we have used <function>setweight</function> to label the source
|
|
of each lexeme in the finished <type>tsvector</type>, and then merged
|
|
the labeled <type>tsvector</type> values using the <type>tsvector</>
|
|
concatenation operator <literal>||</>. (<xref
|
|
linkend="textsearch-manipulate-tsvector"> gives details about these
|
|
operations.)
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-parsing-queries">
|
|
<title>Parsing Queries</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> provides the
|
|
functions <function>to_tsquery</function> and
|
|
<function>plainto_tsquery</function> for converting a query to
|
|
the <type>tsquery</type> data type. <function>to_tsquery</function>
|
|
offers access to more features than <function>plainto_tsquery</function>,
|
|
but is less forgiving about its input.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>to_tsquery</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>to_tsquery</function> creates a <type>tsquery</> value from
|
|
<replaceable>querytext</replaceable>, which must consist of single tokens
|
|
separated by the Boolean operators <literal>&</literal> (AND),
|
|
<literal>|</literal> (OR) and <literal>!</literal> (NOT). These operators
|
|
can be grouped using parentheses. In other words, the input to
|
|
<function>to_tsquery</function> must already follow the general rules for
|
|
<type>tsquery</> input, as described in <xref
|
|
linkend="datatype-textsearch">. The difference is that while basic
|
|
<type>tsquery</> input takes the tokens at face value,
|
|
<function>to_tsquery</function> normalizes each token to a lexeme using
|
|
the specified or default configuration, and discards any tokens that are
|
|
stop words according to the configuration. For example:
|
|
|
|
<programlisting>
|
|
SELECT to_tsquery('english', 'The & Fat & Rats');
|
|
to_tsquery
|
|
---------------
|
|
'fat' & 'rat'
|
|
</programlisting>
|
|
|
|
As in basic <type>tsquery</> input, weight(s) can be attached to each
|
|
lexeme to restrict it to match only <type>tsvector</> lexemes of those
|
|
weight(s). For example:
|
|
|
|
<programlisting>
|
|
SELECT to_tsquery('english', 'Fat | Rats:AB');
|
|
to_tsquery
|
|
------------------
|
|
'fat' | 'rat':AB
|
|
</programlisting>
|
|
|
|
Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
|
|
|
|
<programlisting>
|
|
SELECT to_tsquery('supern:*A & star:A*B');
|
|
to_tsquery
|
|
--------------------------
|
|
'supern':*A & 'star':*AB
|
|
</programlisting>
|
|
|
|
Such a lexeme will match any word in a <type>tsvector</> that begins
|
|
with the given string.
|
|
</para>
|
|
|
|
<para>
|
|
<function>to_tsquery</function> can also accept single-quoted
|
|
phrases. This is primarily useful when the configuration includes a
|
|
thesaurus dictionary that may trigger on such phrases.
|
|
In the example below, a thesaurus contains the rule <literal>supernovae
|
|
stars : sn</literal>:
|
|
|
|
<programlisting>
|
|
SELECT to_tsquery('''supernovae stars'' & !crab');
|
|
to_tsquery
|
|
---------------
|
|
'sn' & !'crab'
|
|
</programlisting>
|
|
|
|
Without quotes, <function>to_tsquery</function> will generate a syntax
|
|
error for tokens that are not separated by an AND or OR operator.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>plainto_tsquery</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
plainto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>plainto_tsquery</> transforms unformatted text
|
|
<replaceable>querytext</replaceable> to <type>tsquery</type>.
|
|
The text is parsed and normalized much as for <function>to_tsvector</>,
|
|
then the <literal>&</literal> (AND) Boolean operator is inserted
|
|
between surviving words.
|
|
</para>
|
|
|
|
<para>
|
|
Example:
|
|
|
|
<programlisting>
|
|
SELECT plainto_tsquery('english', 'The Fat Rats');
|
|
plainto_tsquery
|
|
-----------------
|
|
'fat' & 'rat'
|
|
</programlisting>
|
|
|
|
Note that <function>plainto_tsquery</> cannot
|
|
recognize Boolean operators, weight labels, or prefix-match labels
|
|
in its input:
|
|
|
|
<programlisting>
|
|
SELECT plainto_tsquery('english', 'The Fat & Rats:C');
|
|
plainto_tsquery
|
|
---------------------
|
|
'fat' & 'rat' & 'c'
|
|
</programlisting>
|
|
|
|
Here, all the input punctuation was discarded as being space symbols.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-ranking">
|
|
<title>Ranking Search Results</title>
|
|
|
|
<para>
|
|
Ranking attempts to measure how relevant documents are to a particular
|
|
query, so that when there are many matches the most relevant ones can be
|
|
shown first. <productname>PostgreSQL</productname> provides two
|
|
predefined ranking functions, which take into account lexical, proximity,
|
|
and structural information; that is, they consider how often the query
|
|
terms appear in the document, how close together the terms are in the
|
|
document, and how important is the part of the document where they occur.
|
|
However, the concept of relevancy is vague and very application-specific.
|
|
Different applications might require additional information for ranking,
|
|
e.g., document modification time. The built-in ranking functions are only
|
|
examples. You can write your own ranking functions and/or combine their
|
|
results with additional factors to fit your specific needs.
|
|
</para>
|
|
|
|
<para>
|
|
The two ranking functions currently available are:
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm>
|
|
<primary>ts_rank</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
ts_rank(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,
|
|
<replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Standard ranking function.<!-- TODO document this better -->
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm>
|
|
<primary>ts_rank_cd</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> <type>float4[]</>, </optional> <replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>,
|
|
<replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">normalization</replaceable> <type>integer</> </optional>) returns <type>float4</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
This function computes the <firstterm>cover density</firstterm>
|
|
ranking for the given document vector and query, as described in
|
|
Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
|
|
Term Queries" in the journal "Information Processing and Management",
|
|
1999.
|
|
</para>
|
|
|
|
<para>
|
|
This function requires positional information in its input.
|
|
Therefore it will not work on <quote>stripped</> <type>tsvector</>
|
|
values — it will always return zero.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
<para>
|
|
For both these functions,
|
|
the optional <replaceable class="PARAMETER">weights</replaceable>
|
|
argument offers the ability to weigh word instances more or less
|
|
heavily depending on how they are labeled. The weight arrays specify
|
|
how heavily to weigh each category of word, in the order:
|
|
|
|
<programlisting>
|
|
{D-weight, C-weight, B-weight, A-weight}
|
|
</programlisting>
|
|
|
|
If no <replaceable class="PARAMETER">weights</replaceable> are provided,
|
|
then these defaults are used:
|
|
|
|
<programlisting>
|
|
{0.1, 0.2, 0.4, 1.0}
|
|
</programlisting>
|
|
|
|
Typically weights are used to mark words from special areas of the
|
|
document, like the title or an initial abstract, so they can be
|
|
treated with more or less importance than words in the document body.
|
|
</para>
|
|
|
|
<para>
|
|
Since a longer document has a greater chance of containing a query term
|
|
it is reasonable to take into account document size, e.g., a hundred-word
|
|
document with five instances of a search word is probably more relevant
|
|
than a thousand-word document with five instances. Both ranking functions
|
|
take an integer <replaceable>normalization</replaceable> option that
|
|
specifies whether and how a document's length should impact its rank.
|
|
The integer option controls several behaviors, so it is a bit mask:
|
|
you can specify one or more behaviors using
|
|
<literal>|</literal> (for example, <literal>2|4</literal>).
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
0 (the default) ignores the document length
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
1 divides the rank by 1 + the logarithm of the document length
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
2 divides the rank by the document length
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
4 divides the rank by the mean harmonic distance between extents
|
|
(this is implemented only by <function>ts_rank_cd</>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
8 divides the rank by the number of unique words in document
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
16 divides the rank by 1 + the logarithm of the number
|
|
of unique words in document
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
32 divides the rank by itself + 1
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
If more than one flag bit is specified, the transformations are
|
|
applied in the order listed.
|
|
</para>
|
|
|
|
<para>
|
|
It is important to note that the ranking functions do not use any global
|
|
information, so it is impossible to produce a fair normalization to 1% or
|
|
100% as sometimes desired. Normalization option 32
|
|
(<literal>rank/(rank+1)</literal>) can be applied to scale all ranks
|
|
into the range zero to one, but of course this is just a cosmetic change;
|
|
it will not affect the ordering of the search results.
|
|
</para>
|
|
|
|
<para>
|
|
Here is an example that selects only the ten highest-ranked matches:
|
|
|
|
<programlisting>
|
|
SELECT title, ts_rank_cd(textsearch, query) AS rank
|
|
FROM apod, to_tsquery('neutrino|(dark & matter)') query
|
|
WHERE query @@ textsearch
|
|
ORDER BY rank DESC
|
|
LIMIT 10;
|
|
title | rank
|
|
-----------------------------------------------+----------
|
|
Neutrinos in the Sun | 3.1
|
|
The Sudbury Neutrino Detector | 2.4
|
|
A MACHO View of Galactic Dark Matter | 2.01317
|
|
Hot Gas and Dark Matter | 1.91171
|
|
The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
|
|
Rafting for Solar Neutrinos | 1.9
|
|
NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
|
|
Hot Gas and Dark Matter | 1.6123
|
|
Ice Fishing for Cosmic Neutrinos | 1.6
|
|
Weak Lensing Distorts the Universe | 0.818218
|
|
</programlisting>
|
|
|
|
This is the same example using normalized ranking:
|
|
|
|
<programlisting>
|
|
SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
|
|
FROM apod, to_tsquery('neutrino|(dark & matter)') query
|
|
WHERE query @@ textsearch
|
|
ORDER BY rank DESC
|
|
LIMIT 10;
|
|
title | rank
|
|
-----------------------------------------------+-------------------
|
|
Neutrinos in the Sun | 0.756097569485493
|
|
The Sudbury Neutrino Detector | 0.705882361190954
|
|
A MACHO View of Galactic Dark Matter | 0.668123210574724
|
|
Hot Gas and Dark Matter | 0.65655958650282
|
|
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
|
|
Rafting for Solar Neutrinos | 0.655172410958162
|
|
NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
|
|
Hot Gas and Dark Matter | 0.617195790024749
|
|
Ice Fishing for Cosmic Neutrinos | 0.615384618911517
|
|
Weak Lensing Distorts the Universe | 0.450010798361481
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Ranking can be expensive since it requires consulting the
|
|
<type>tsvector</type> of each matching document, which can be I/O bound and
|
|
therefore slow. Unfortunately, it is almost impossible to avoid since
|
|
practical queries often result in large numbers of matches.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-headline">
|
|
<title>Highlighting Results</title>
|
|
|
|
<para>
|
|
To present search results it is ideal to show a part of each document and
|
|
how it is related to the query. Usually, search engines show fragments of
|
|
the document with marked search terms. <productname>PostgreSQL</>
|
|
provides a function <function>ts_headline</function> that
|
|
implements this functionality.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>ts_headline</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
ts_headline(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>, <replaceable class="PARAMETER">query</replaceable> <type>tsquery</> <optional>, <replaceable class="PARAMETER">options</replaceable> <type>text</> </optional>) returns <type>text</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>ts_headline</function> accepts a document along
|
|
with a query, and returns an excerpt from
|
|
the document in which terms from the query are highlighted. The
|
|
configuration to be used to parse the document can be specified by
|
|
<replaceable>config</replaceable>; if <replaceable>config</replaceable>
|
|
is omitted, the
|
|
<varname>default_text_search_config</varname> configuration is used.
|
|
</para>
|
|
|
|
<para>
|
|
If an <replaceable>options</replaceable> string is specified it must
|
|
consist of a comma-separated list of one or more
|
|
<replaceable>option</><literal>=</><replaceable>value</> pairs.
|
|
The available options are:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
<literal>StartSel</>, <literal>StopSel</literal>: the strings with
|
|
which to delimit query words appearing in the document, to distinguish
|
|
them from other excerpted words. You must double-quote these strings
|
|
if they contain spaces or commas.
|
|
</para>
|
|
</listitem>
|
|
<listitem >
|
|
<para>
|
|
<literal>MaxWords</>, <literal>MinWords</literal>: these numbers
|
|
determine the longest and shortest headlines to output.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>ShortWord</literal>: words of this length or less will be
|
|
dropped at the start and end of a headline. The default
|
|
value of three eliminates common English articles.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>HighlightAll</literal>: Boolean flag; if
|
|
<literal>true</literal> the whole document will be used as the
|
|
headline, ignoring the preceding three parameters.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>MaxFragments</literal>: maximum number of text excerpts
|
|
or fragments to display. The default value of zero selects a
|
|
non-fragment-oriented headline generation method. A value greater than
|
|
zero selects fragment-based headline generation. This method
|
|
finds text fragments with as many query words as possible and
|
|
stretches those fragments around the query words. As a result
|
|
query words are close to the middle of each fragment and have words on
|
|
each side. Each fragment will be of at most <literal>MaxWords</> and
|
|
words of length <literal>ShortWord</> or less are dropped at the start
|
|
and end of each fragment. If not all query words are found in the
|
|
document, then a single fragment of the first <literal>MinWords</>
|
|
in the document will be displayed.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>FragmentDelimiter</literal>: When more than one fragment is
|
|
displayed, the fragments will be separated by this string.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
Any unspecified options receive these defaults:
|
|
|
|
<programlisting>
|
|
StartSel=<b>, StopSel=</b>,
|
|
MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
|
|
MaxFragments=0, FragmentDelimiter=" ... "
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
For example:
|
|
|
|
<programlisting>
|
|
SELECT ts_headline('english',
|
|
'The most common type of search
|
|
is to find all documents containing given query terms
|
|
and return them in order of their similarity to the
|
|
query.',
|
|
to_tsquery('query & similarity'));
|
|
ts_headline
|
|
------------------------------------------------------------
|
|
containing given <b>query</b> terms
|
|
and return them in order of their <b>similarity</b> to the
|
|
<b>query</b>.
|
|
|
|
SELECT ts_headline('english',
|
|
'The most common type of search
|
|
is to find all documents containing given query terms
|
|
and return them in order of their similarity to the
|
|
query.',
|
|
to_tsquery('query & similarity'),
|
|
'StartSel = <, StopSel = >');
|
|
ts_headline
|
|
-------------------------------------------------------
|
|
containing given <query> terms
|
|
and return them in order of their <similarity> to the
|
|
<query>.
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
<function>ts_headline</> uses the original document, not a
|
|
<type>tsvector</type> summary, so it can be slow and should be used with
|
|
care. A typical mistake is to call <function>ts_headline</function> for
|
|
<emphasis>every</emphasis> matching document when only ten documents are
|
|
to be shown. <acronym>SQL</acronym> subqueries can help; here is an
|
|
example:
|
|
|
|
<programlisting>
|
|
SELECT id, ts_headline(body, q), rank
|
|
FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
|
|
FROM apod, to_tsquery('stars') q
|
|
WHERE ti @@ q
|
|
ORDER BY rank DESC
|
|
LIMIT 10) AS foo;
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-features">
|
|
<title>Additional Features</title>
|
|
|
|
<para>
|
|
This section describes additional functions and operators that are
|
|
useful in connection with text search.
|
|
</para>
|
|
|
|
<sect2 id="textsearch-manipulate-tsvector">
|
|
<title>Manipulating Documents</title>
|
|
|
|
<para>
|
|
<xref linkend="textsearch-parsing-documents"> showed how raw textual
|
|
documents can be converted into <type>tsvector</> values.
|
|
<productname>PostgreSQL</productname> also provides functions and
|
|
operators that can be used to manipulate documents that are already
|
|
in <type>tsvector</> form.
|
|
</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm>
|
|
<primary>tsvector concatenation</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
<type>tsvector</> || <type>tsvector</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
The <type>tsvector</> concatenation operator
|
|
returns a vector which combines the lexemes and positional information
|
|
of the two vectors given as arguments. Positions and weight labels
|
|
are retained during the concatenation.
|
|
Positions appearing in the right-hand vector are offset by the largest
|
|
position mentioned in the left-hand vector, so that the result is
|
|
nearly equivalent to the result of performing <function>to_tsvector</>
|
|
on the concatenation of the two original document strings. (The
|
|
equivalence is not exact, because any stop-words removed from the
|
|
end of the left-hand argument will not affect the result, whereas
|
|
they would have affected the positions of the lexemes in the
|
|
right-hand argument if textual concatenation were used.)
|
|
</para>
|
|
|
|
<para>
|
|
One advantage of using concatenation in the vector form, rather than
|
|
concatenating text before applying <function>to_tsvector</>, is that
|
|
you can use different configurations to parse different sections
|
|
of the document. Also, because the <function>setweight</> function
|
|
marks all lexemes of the given vector the same way, it is necessary
|
|
to parse the text and do <function>setweight</> before concatenating
|
|
if you want to label different parts of the document with different
|
|
weights.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm>
|
|
<primary>setweight</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
setweight(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsvector</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
<function>setweight</> returns a copy of the input vector in which every
|
|
position has been labeled with the given <replaceable>weight</>, either
|
|
<literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
|
|
<literal>D</literal>. (<literal>D</literal> is the default for new
|
|
vectors and as such is not displayed on output.) These labels are
|
|
retained when vectors are concatenated, allowing words from different
|
|
parts of a document to be weighted differently by ranking functions.
|
|
</para>
|
|
|
|
<para>
|
|
Note that weight labels apply to <emphasis>positions</>, not
|
|
<emphasis>lexemes</>. If the input vector has been stripped of
|
|
positions then <function>setweight</> does nothing.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<indexterm>
|
|
<primary>length(tsvector)</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
length(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>integer</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Returns the number of lexemes stored in the vector.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm>
|
|
<primary>strip</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
strip(<replaceable class="PARAMETER">vector</replaceable> <type>tsvector</>) returns <type>tsvector</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Returns a vector which lists the same lexemes as the given vector, but
|
|
which lacks any position or weight information. While the returned
|
|
vector is much less useful than an unstripped vector for relevance
|
|
ranking, it will usually be much smaller.
|
|
</para>
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-manipulate-tsquery">
|
|
<title>Manipulating Queries</title>
|
|
|
|
<para>
|
|
<xref linkend="textsearch-parsing-queries"> showed how raw textual
|
|
queries can be converted into <type>tsquery</> values.
|
|
<productname>PostgreSQL</productname> also provides functions and
|
|
operators that can be used to manipulate queries that are already
|
|
in <type>tsquery</> form.
|
|
</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
<synopsis>
|
|
<type>tsquery</> && <type>tsquery</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Returns the AND-combination of the two given queries.
|
|
</para>
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
<synopsis>
|
|
<type>tsquery</> || <type>tsquery</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Returns the OR-combination of the two given queries.
|
|
</para>
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
<synopsis>
|
|
!! <type>tsquery</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Returns the negation (NOT) of the given query.
|
|
</para>
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm>
|
|
<primary>numnode</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
numnode(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>integer</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Returns the number of nodes (lexemes plus operators) in a
|
|
<type>tsquery</>. This function is useful
|
|
to determine if the <replaceable>query</replaceable> is meaningful
|
|
(returns > 0), or contains only stop words (returns 0).
|
|
Examples:
|
|
|
|
<programlisting>
|
|
SELECT numnode(plainto_tsquery('the any'));
|
|
NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored
|
|
numnode
|
|
---------
|
|
0
|
|
|
|
SELECT numnode('foo & bar'::tsquery);
|
|
numnode
|
|
---------
|
|
3
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm>
|
|
<primary>querytree</primary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>) returns <type>text</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Returns the portion of a <type>tsquery</> that can be used for
|
|
searching an index. This function is useful for detecting
|
|
unindexable queries, for example those containing only stop words
|
|
or only negated terms. For example:
|
|
|
|
<programlisting>
|
|
SELECT querytree(to_tsquery('!defined'));
|
|
querytree
|
|
-----------
|
|
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<sect3 id="textsearch-query-rewriting">
|
|
<title>Query Rewriting</title>
|
|
|
|
<indexterm zone="textsearch-query-rewriting">
|
|
<primary>ts_rewrite</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
The <function>ts_rewrite</function> family of functions search a
|
|
given <type>tsquery</> for occurrences of a target
|
|
subquery, and replace each occurrence with a
|
|
substitute subquery. In essence this operation is a
|
|
<type>tsquery</>-specific version of substring replacement.
|
|
A target and substitute combination can be
|
|
thought of as a <firstterm>query rewrite rule</>. A collection
|
|
of such rewrite rules can be a powerful search aid.
|
|
For example, you can expand the search using synonyms
|
|
(e.g., <literal>new york</>, <literal>big apple</>, <literal>nyc</>,
|
|
<literal>gotham</>) or narrow the search to direct the user to some hot
|
|
topic. There is some overlap in functionality between this feature
|
|
and thesaurus dictionaries (<xref linkend="textsearch-thesaurus">).
|
|
However, you can modify a set of rewrite rules on-the-fly without
|
|
reindexing, whereas updating a thesaurus requires reindexing to be
|
|
effective.
|
|
</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
<synopsis>
|
|
ts_rewrite (<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">target</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">substitute</replaceable> <type>tsquery</>) returns <type>tsquery</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
This form of <function>ts_rewrite</> simply applies a single
|
|
rewrite rule: <replaceable class="PARAMETER">target</replaceable>
|
|
is replaced by <replaceable class="PARAMETER">substitute</replaceable>
|
|
wherever it appears in <replaceable
|
|
class="PARAMETER">query</replaceable>. For example:
|
|
|
|
<programlisting>
|
|
SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
|
|
ts_rewrite
|
|
------------
|
|
'b' & 'c'
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
<synopsis>
|
|
ts_rewrite (<replaceable class="PARAMETER">query</> <type>tsquery</>, <replaceable class="PARAMETER">select</> <type>text</>) returns <type>tsquery</>
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
This form of <function>ts_rewrite</> accepts a starting
|
|
<replaceable>query</> and a SQL <replaceable>select</> command, which
|
|
is given as a text string. The <replaceable>select</> must yield two
|
|
columns of <type>tsquery</> type. For each row of the
|
|
<replaceable>select</> result, occurrences of the first column value
|
|
(the target) are replaced by the second column value (the substitute)
|
|
within the current <replaceable>query</> value. For example:
|
|
|
|
<programlisting>
|
|
CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
|
|
INSERT INTO aliases VALUES('a', 'c');
|
|
|
|
SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases');
|
|
ts_rewrite
|
|
------------
|
|
'b' & 'c'
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Note that when multiple rewrite rules are applied in this way,
|
|
the order of application can be important; so in practice you will
|
|
want the source query to <literal>ORDER BY</> some ordering key.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>
|
|
Let's consider a real-life astronomical example. We'll expand query
|
|
<literal>supernovae</literal> using table-driven rewriting rules:
|
|
|
|
<programlisting>
|
|
CREATE TABLE aliases (t tsquery primary key, s tsquery);
|
|
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
|
|
|
|
SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
|
|
ts_rewrite
|
|
---------------------------------
|
|
'crab' & ( 'supernova' | 'sn' )
|
|
</programlisting>
|
|
|
|
We can change the rewriting rules just by updating the table:
|
|
|
|
<programlisting>
|
|
UPDATE aliases
|
|
SET s = to_tsquery('supernovae|sn & !nebulae')
|
|
WHERE t = to_tsquery('supernovae');
|
|
|
|
SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
|
|
ts_rewrite
|
|
---------------------------------------------
|
|
'crab' & ( 'supernova' | 'sn' & !'nebula' )
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Rewriting can be slow when there are many rewriting rules, since it
|
|
checks every rule for a possible match. To filter out obvious non-candidate
|
|
rules we can use the containment operators for the <type>tsquery</type>
|
|
type. In the example below, we select only those rules which might match
|
|
the original query:
|
|
|
|
<programlisting>
|
|
SELECT ts_rewrite('a & b'::tsquery,
|
|
'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t');
|
|
ts_rewrite
|
|
------------
|
|
'b' & 'c'
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-update-triggers">
|
|
<title>Triggers for Automatic Updates</title>
|
|
|
|
<indexterm>
|
|
<primary>trigger</primary>
|
|
<secondary>for updating a derived tsvector column</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
When using a separate column to store the <type>tsvector</> representation
|
|
of your documents, it is necessary to create a trigger to update the
|
|
<type>tsvector</> column when the document content columns change.
|
|
Two built-in trigger functions are available for this, or you can write
|
|
your own.
|
|
</para>
|
|
|
|
<synopsis>
|
|
tsvector_update_trigger(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
|
|
tsvector_update_trigger_column(<replaceable class="PARAMETER">tsvector_column_name</replaceable>, <replaceable class="PARAMETER">config_column_name</replaceable>, <replaceable class="PARAMETER">text_column_name</replaceable> <optional>, ... </optional>)
|
|
</synopsis>
|
|
|
|
<para>
|
|
These trigger functions automatically compute a <type>tsvector</>
|
|
column from one or more textual columns, under the control of
|
|
parameters specified in the <command>CREATE TRIGGER</> command.
|
|
An example of their use is:
|
|
|
|
<programlisting>
|
|
CREATE TABLE messages (
|
|
title text,
|
|
body text,
|
|
tsv tsvector
|
|
);
|
|
|
|
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
|
|
ON messages FOR EACH ROW EXECUTE PROCEDURE
|
|
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
|
|
|
|
INSERT INTO messages VALUES('title here', 'the body text is here');
|
|
|
|
SELECT * FROM messages;
|
|
title | body | tsv
|
|
------------+-----------------------+----------------------------
|
|
title here | the body text is here | 'bodi':4 'text':5 'titl':1
|
|
|
|
SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');
|
|
title | body
|
|
------------+-----------------------
|
|
title here | the body text is here
|
|
</programlisting>
|
|
|
|
Having created this trigger, any change in <structfield>title</> or
|
|
<structfield>body</> will automatically be reflected into
|
|
<structfield>tsv</>, without the application having to worry about it.
|
|
</para>
|
|
|
|
<para>
|
|
The first trigger argument must be the name of the <type>tsvector</>
|
|
column to be updated. The second argument specifies the text search
|
|
configuration to be used to perform the conversion. For
|
|
<function>tsvector_update_trigger</>, the configuration name is simply
|
|
given as the second trigger argument. It must be schema-qualified as
|
|
shown above, so that the trigger behavior will not change with changes
|
|
in <varname>search_path</>. For
|
|
<function>tsvector_update_trigger_column</>, the second trigger argument
|
|
is the name of another table column, which must be of type
|
|
<type>regconfig</>. This allows a per-row selection of configuration
|
|
to be made. The remaining argument(s) are the names of textual columns
|
|
(of type <type>text</>, <type>varchar</>, or <type>char</>). These
|
|
will be included in the document in the order given. NULL values will
|
|
be skipped (but the other columns will still be indexed).
|
|
</para>
|
|
|
|
<para>
|
|
A limitation of these built-in triggers is that they treat all the
|
|
input columns alike. To process columns differently — for
|
|
example, to weight title differently from body — it is necessary
|
|
to write a custom trigger. Here is an example using
|
|
<application>PL/pgSQL</application> as the trigger language:
|
|
|
|
<programlisting>
|
|
CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
|
|
begin
|
|
new.tsv :=
|
|
setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
|
|
setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
|
|
return new;
|
|
end
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
|
|
ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Keep in mind that it is important to specify the configuration name
|
|
explicitly when creating <type>tsvector</> values inside triggers,
|
|
so that the column's contents will not be affected by changes to
|
|
<varname>default_text_search_config</>. Failure to do this is likely to
|
|
lead to problems such as search results changing after a dump and reload.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-statistics">
|
|
<title>Gathering Document Statistics</title>
|
|
|
|
<indexterm>
|
|
<primary>ts_stat</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
The function <function>ts_stat</> is useful for checking your
|
|
configuration and for finding stop-word candidates.
|
|
</para>
|
|
|
|
<synopsis>
|
|
ts_stat(<replaceable class="PARAMETER">sqlquery</replaceable> <type>text</>, <optional> <replaceable class="PARAMETER">weights</replaceable> <type>text</>, </optional>
|
|
OUT <replaceable class="PARAMETER">word</replaceable> <type>text</>, OUT <replaceable class="PARAMETER">ndoc</replaceable> <type>integer</>,
|
|
OUT <replaceable class="PARAMETER">nentry</replaceable> <type>integer</>) returns <type>setof record</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<replaceable>sqlquery</replaceable> is a text value containing an SQL
|
|
query which must return a single <type>tsvector</type> column.
|
|
<function>ts_stat</> executes the query and returns statistics about
|
|
each distinct lexeme (word) contained in the <type>tsvector</type>
|
|
data. The columns returned are
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
<replaceable>word</> <type>text</> — the value of a lexeme
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>ndoc</> <type>integer</> — number of documents
|
|
(<type>tsvector</>s) the word occurred in
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>nentry</> <type>integer</> — total number of
|
|
occurrences of the word
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
If <replaceable>weights</replaceable> is supplied, only occurrences
|
|
having one of those weights are counted.
|
|
</para>
|
|
|
|
<para>
|
|
For example, to find the ten most frequent words in a document collection:
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_stat('SELECT vector FROM apod')
|
|
ORDER BY nentry DESC, ndoc DESC, word
|
|
LIMIT 10;
|
|
</programlisting>
|
|
|
|
The same, but counting only word occurrences with weight <literal>A</>
|
|
or <literal>B</>:
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
|
|
ORDER BY nentry DESC, ndoc DESC, word
|
|
LIMIT 10;
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-parsers">
|
|
<title>Parsers</title>
|
|
|
|
<para>
|
|
Text search parsers are responsible for splitting raw document text
|
|
into <firstterm>tokens</> and identifying each token's type, where
|
|
the set of possible types is defined by the parser itself.
|
|
Note that a parser does not modify the text at all — it simply
|
|
identifies plausible word boundaries. Because of this limited scope,
|
|
there is less need for application-specific custom parsers than there is
|
|
for custom dictionaries. At present <productname>PostgreSQL</productname>
|
|
provides just one built-in parser, which has been found to be useful for a
|
|
wide range of applications.
|
|
</para>
|
|
|
|
<para>
|
|
The built-in parser is named <literal>pg_catalog.default</>.
|
|
It recognizes 23 token types:
|
|
</para>
|
|
|
|
<table id="textsearch-default-parser">
|
|
<title>Default Parser's Token Types</title>
|
|
<tgroup cols="3">
|
|
<thead>
|
|
<row>
|
|
<entry>Alias</entry>
|
|
<entry>Description</entry>
|
|
<entry>Example</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row>
|
|
<entry><literal>asciiword</></entry>
|
|
<entry>Word, all ASCII letters</entry>
|
|
<entry><literal>elephant</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>word</></entry>
|
|
<entry>Word, all letters</entry>
|
|
<entry><literal>mañana</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>numword</></entry>
|
|
<entry>Word, letters and digits</entry>
|
|
<entry><literal>beta1</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>asciihword</></entry>
|
|
<entry>Hyphenated word, all ASCII</entry>
|
|
<entry><literal>up-to-date</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>hword</></entry>
|
|
<entry>Hyphenated word, all letters</entry>
|
|
<entry><literal>lógico-matemática</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>numhword</></entry>
|
|
<entry>Hyphenated word, letters and digits</entry>
|
|
<entry><literal>postgresql-beta1</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>hword_asciipart</></entry>
|
|
<entry>Hyphenated word part, all ASCII</entry>
|
|
<entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>hword_part</></entry>
|
|
<entry>Hyphenated word part, all letters</entry>
|
|
<entry><literal>lógico</literal> or <literal>matemática</literal>
|
|
in the context <literal>lógico-matemática</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>hword_numpart</></entry>
|
|
<entry>Hyphenated word part, letters and digits</entry>
|
|
<entry><literal>beta1</literal> in the context
|
|
<literal>postgresql-beta1</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>email</></entry>
|
|
<entry>Email address</entry>
|
|
<entry><literal>foo@example.com</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>protocol</></entry>
|
|
<entry>Protocol head</entry>
|
|
<entry><literal>http://</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>url</></entry>
|
|
<entry>URL</entry>
|
|
<entry><literal>example.com/stuff/index.html</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>host</></entry>
|
|
<entry>Host</entry>
|
|
<entry><literal>example.com</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>url_path</></entry>
|
|
<entry>URL path</entry>
|
|
<entry><literal>/stuff/index.html</literal>, in the context of a URL</entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>file</></entry>
|
|
<entry>File or path name</entry>
|
|
<entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>sfloat</></entry>
|
|
<entry>Scientific notation</entry>
|
|
<entry><literal>-1.234e56</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>float</></entry>
|
|
<entry>Decimal notation</entry>
|
|
<entry><literal>-1.234</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>int</></entry>
|
|
<entry>Signed integer</entry>
|
|
<entry><literal>-1234</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>uint</></entry>
|
|
<entry>Unsigned integer</entry>
|
|
<entry><literal>1234</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>version</></entry>
|
|
<entry>Version number</entry>
|
|
<entry><literal>8.3.0</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>tag</></entry>
|
|
<entry>XML tag</entry>
|
|
<entry><literal><a href="dictionaries.html"></literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>entity</></entry>
|
|
<entry>XML entity</entry>
|
|
<entry><literal>&amp;</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>blank</></entry>
|
|
<entry>Space symbols</entry>
|
|
<entry>(any whitespace or punctuation not otherwise recognized)</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<note>
|
|
<para>
|
|
The parser's notion of a <quote>letter</> is determined by the database's
|
|
locale setting, specifically <varname>lc_ctype</>. Words containing
|
|
only the basic ASCII letters are reported as a separate token type,
|
|
since it is sometimes useful to distinguish them. In most European
|
|
languages, token types <literal>word</> and <literal>asciiword</>
|
|
should be treated alike.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
It is possible for the parser to produce overlapping tokens from the same
|
|
piece of text. As an example, a hyphenated word will be reported both
|
|
as the entire word and as each component:
|
|
|
|
<programlisting>
|
|
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
|
|
alias | description | token
|
|
-----------------+------------------------------------------+---------------
|
|
numhword | Hyphenated word, letters and digits | foo-bar-beta1
|
|
hword_asciipart | Hyphenated word part, all ASCII | foo
|
|
blank | Space symbols | -
|
|
hword_asciipart | Hyphenated word part, all ASCII | bar
|
|
blank | Space symbols | -
|
|
hword_numpart | Hyphenated word part, letters and digits | beta1
|
|
</programlisting>
|
|
|
|
This behavior is desirable since it allows searches to work for both
|
|
the whole compound word and for components. Here is another
|
|
instructive example:
|
|
|
|
<programlisting>
|
|
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
|
|
alias | description | token
|
|
----------+---------------+------------------------------
|
|
protocol | Protocol head | http://
|
|
url | URL | example.com/stuff/index.html
|
|
host | Host | example.com
|
|
url_path | URL path | /stuff/index.html
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-dictionaries">
|
|
<title>Dictionaries</title>
|
|
|
|
<para>
|
|
Dictionaries are used to eliminate words that should not be considered in a
|
|
search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
|
|
that different derived forms of the same word will match. A successfully
|
|
normalized word is called a <firstterm>lexeme</>. Aside from
|
|
improving search quality, normalization and removal of stop words reduce the
|
|
size of the <type>tsvector</type> representation of a document, thereby
|
|
improving performance. Normalization does not always have linguistic meaning
|
|
and usually depends on application semantics.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples of normalization:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
<para>
|
|
Linguistic - Ispell dictionaries try to reduce input words to a
|
|
normalized form; stemmer dictionaries remove word endings
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<acronym>URL</acronym> locations can be canonicalized to make
|
|
equivalent URLs match:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
http://www.pgsql.ru/db/mw/index.html
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
http://www.pgsql.ru/db/mw/
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
http://www.pgsql.ru/db/../db/mw/index.html
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Color names can be replaced by their hexadecimal values, e.g.,
|
|
<literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If indexing numbers, we can
|
|
remove some fractional digits to reduce the range of possible
|
|
numbers, so for example <emphasis>3.14</emphasis>159265359,
|
|
<emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
|
|
after normalization if only two digits are kept after the decimal point.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
<para>
|
|
A dictionary is a program that accepts a token as
|
|
input and returns:
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
an array of lexemes if the input token is known to the dictionary
|
|
(notice that one token can produce more than one lexeme)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
an empty array if the dictionary knows the token, but it is a stop word
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>NULL</literal> if the dictionary does not recognize the input token
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> provides predefined dictionaries for
|
|
many languages. There are also several predefined templates that can be
|
|
used to create new dictionaries with custom parameters. Each predefined
|
|
dictionary template is described below. If no existing
|
|
template is suitable, it is possible to create new ones; see the
|
|
<filename>contrib/</> area of the <productname>PostgreSQL</> distribution
|
|
for examples.
|
|
</para>
|
|
|
|
<para>
|
|
A text search configuration binds a parser together with a set of
|
|
dictionaries to process the parser's output tokens. For each token
|
|
type that the parser can return, a separate list of dictionaries is
|
|
specified by the configuration. When a token of that type is found
|
|
by the parser, each dictionary in the list is consulted in turn,
|
|
until some dictionary recognizes it as a known word. If it is identified
|
|
as a stop word, or if no dictionary recognizes the token, it will be
|
|
discarded and not indexed or searched for.
|
|
The general rule for configuring a list of dictionaries
|
|
is to place first the most narrow, most specific dictionary, then the more
|
|
general dictionaries, finishing with a very general dictionary, like
|
|
a <application>Snowball</> stemmer or <literal>simple</>, which
|
|
recognizes everything. For example, for an astronomy-specific search
|
|
(<literal>astro_en</literal> configuration) one could bind token type
|
|
<type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
|
|
terms, a general English dictionary and a <application>Snowball</> English
|
|
stemmer:
|
|
|
|
<programlisting>
|
|
ALTER TEXT SEARCH CONFIGURATION astro_en
|
|
ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
|
|
</programlisting>
|
|
</para>
|
|
|
|
<sect2 id="textsearch-stopwords">
|
|
<title>Stop Words</title>
|
|
|
|
<para>
|
|
Stop words are words that are very common, appear in almost every
|
|
document, and have no discrimination value. Therefore, they can be ignored
|
|
in the context of full text searching. For example, every English text
|
|
contains words like <literal>a</literal> and <literal>the</>, so it is
|
|
useless to store them in an index. However, stop words do affect the
|
|
positions in <type>tsvector</type>, which in turn affect ranking:
|
|
|
|
<programlisting>
|
|
SELECT to_tsvector('english','in the list of stop words');
|
|
to_tsvector
|
|
----------------------------
|
|
'list':3 'stop':5 'word':6
|
|
</programlisting>
|
|
|
|
The missing positions 1,2,4 are because of stop words. Ranks
|
|
calculated for documents with and without stop words are quite different:
|
|
|
|
<programlisting>
|
|
SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list & stop'));
|
|
ts_rank_cd
|
|
------------
|
|
0.05
|
|
|
|
SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list & stop'));
|
|
ts_rank_cd
|
|
------------
|
|
0.1
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
<para>
|
|
It is up to the specific dictionary how it treats stop words. For example,
|
|
<literal>ispell</literal> dictionaries first normalize words and then
|
|
look at the list of stop words, while <literal>Snowball</literal> stemmers
|
|
first check the list of stop words. The reason for the different
|
|
behavior is an attempt to decrease noise.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-simple-dictionary">
|
|
<title>Simple Dictionary</title>
|
|
|
|
<para>
|
|
The <literal>simple</> dictionary template operates by converting the
|
|
input token to lower case and checking it against a file of stop words.
|
|
If it is found in the file then an empty array is returned, causing
|
|
the token to be discarded. If not, the lower-cased form of the word
|
|
is returned as the normalized lexeme. Alternatively, the dictionary
|
|
can be configured to report non-stop-words as unrecognized, allowing
|
|
them to be passed on to the next dictionary in the list.
|
|
</para>
|
|
|
|
<para>
|
|
Here is an example of a dictionary definition using the <literal>simple</>
|
|
template:
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH DICTIONARY public.simple_dict (
|
|
TEMPLATE = pg_catalog.simple,
|
|
STOPWORDS = english
|
|
);
|
|
</programlisting>
|
|
|
|
Here, <literal>english</literal> is the base name of a file of stop words.
|
|
The file's full name will be
|
|
<filename>$SHAREDIR/tsearch_data/english.stop</>,
|
|
where <literal>$SHAREDIR</> means the
|
|
<productname>PostgreSQL</productname> installation's shared-data directory,
|
|
often <filename>/usr/local/share/postgresql</> (use <command>pg_config
|
|
--sharedir</> to determine it if you're not sure).
|
|
The file format is simply a list
|
|
of words, one per line. Blank lines and trailing spaces are ignored,
|
|
and upper case is folded to lower case, but no other processing is done
|
|
on the file contents.
|
|
</para>
|
|
|
|
<para>
|
|
Now we can test our dictionary:
|
|
|
|
<programlisting>
|
|
SELECT ts_lexize('public.simple_dict','YeS');
|
|
ts_lexize
|
|
-----------
|
|
{yes}
|
|
|
|
SELECT ts_lexize('public.simple_dict','The');
|
|
ts_lexize
|
|
-----------
|
|
{}
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
We can also choose to return <literal>NULL</>, instead of the lower-cased
|
|
word, if it is not found in the stop words file. This behavior is
|
|
selected by setting the dictionary's <literal>Accept</> parameter to
|
|
<literal>false</>. Continuing the example:
|
|
|
|
<programlisting>
|
|
ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
|
|
|
|
SELECT ts_lexize('public.simple_dict','YeS');
|
|
ts_lexize
|
|
-----------
|
|
|
|
|
|
SELECT ts_lexize('public.simple_dict','The');
|
|
ts_lexize
|
|
-----------
|
|
{}
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
With the default setting of <literal>Accept</> = <literal>true</>,
|
|
it is only useful to place a <literal>simple</> dictionary at the end
|
|
of a list of dictionaries, since it will never pass on any token to
|
|
a following dictionary. Conversely, <literal>Accept</> = <literal>false</>
|
|
is only useful when there is at least one following dictionary.
|
|
</para>
|
|
|
|
<caution>
|
|
<para>
|
|
Most types of dictionaries rely on configuration files, such as files of
|
|
stop words. These files <emphasis>must</> be stored in UTF-8 encoding.
|
|
They will be translated to the actual database encoding, if that is
|
|
different, when they are read into the server.
|
|
</para>
|
|
</caution>
|
|
|
|
<caution>
|
|
<para>
|
|
Normally, a database session will read a dictionary configuration file
|
|
only once, when it is first used within the session. If you modify a
|
|
configuration file and want to force existing sessions to pick up the
|
|
new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</> command
|
|
on the dictionary. This can be a <quote>dummy</> update that doesn't
|
|
actually change any parameter values.
|
|
</para>
|
|
</caution>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-synonym-dictionary">
|
|
<title>Synonym Dictionary</title>
|
|
|
|
<para>
|
|
This dictionary template is used to create dictionaries that replace a
|
|
word with a synonym. Phrases are not supported (use the thesaurus
|
|
template (<xref linkend="textsearch-thesaurus">) for that). A synonym
|
|
dictionary can be used to overcome linguistic problems, for example, to
|
|
prevent an English stemmer dictionary from reducing the word 'Paris' to
|
|
'pari'. It is enough to have a <literal>Paris paris</literal> line in the
|
|
synonym dictionary and put it before the <literal>english_stem</>
|
|
dictionary. For example:
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_debug('english', 'Paris');
|
|
alias | description | token | dictionaries | dictionary | lexemes
|
|
-----------+-----------------+-------+----------------+--------------+---------
|
|
asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
|
|
|
|
CREATE TEXT SEARCH DICTIONARY my_synonym (
|
|
TEMPLATE = synonym,
|
|
SYNONYMS = my_synonyms
|
|
);
|
|
|
|
ALTER TEXT SEARCH CONFIGURATION english
|
|
ALTER MAPPING FOR asciiword
|
|
WITH my_synonym, english_stem;
|
|
|
|
SELECT * FROM ts_debug('english', 'Paris');
|
|
alias | description | token | dictionaries | dictionary | lexemes
|
|
-----------+-----------------+-------+---------------------------+------------+---------
|
|
asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
An asterisk (<literal>*</literal>) at the end of definition word indicates
|
|
that definition word is a prefix, and <function>to_tsquery()</function>
|
|
function will transform that definition to the prefix search format (see
|
|
<xref linkend="textsearch-parsing-queries">).
|
|
Notice that it is ignored in <function>to_tsvector()</function>.
|
|
</para>
|
|
|
|
<para>
|
|
Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
|
|
</para>
|
|
<programlisting>
|
|
postgres pgsql
|
|
postgresql pgsql
|
|
postgre pgsql
|
|
gogle googl
|
|
indices index*
|
|
</programlisting>
|
|
|
|
<para>
|
|
Results:
|
|
</para>
|
|
<programlisting>
|
|
=# create text search dictionary syn( template=synonym,synonyms='synonym_sample');
|
|
=# select ts_lexize('syn','indices');
|
|
ts_lexize
|
|
-----------
|
|
{index}
|
|
(1 row)
|
|
|
|
=# create text search configuration tst ( copy=simple);
|
|
=# alter text search configuration tst alter mapping for asciiword with syn;
|
|
=# select to_tsquery('tst','indices');
|
|
to_tsquery
|
|
------------
|
|
'index':*
|
|
(1 row)
|
|
|
|
=# select 'indexes are very useful'::tsvector;
|
|
tsvector
|
|
---------------------------------
|
|
'are' 'indexes' 'useful' 'very'
|
|
(1 row)
|
|
|
|
=# select 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
|
|
?column?
|
|
----------
|
|
t
|
|
(1 row)
|
|
|
|
=# select to_tsvector('tst','indices');
|
|
to_tsvector
|
|
-------------
|
|
'index':1
|
|
(1 row)
|
|
</programlisting>
|
|
|
|
<para>
|
|
The only parameter required by the <literal>synonym</> template is
|
|
<literal>SYNONYMS</>, which is the base name of its configuration file
|
|
— <literal>my_synonyms</> in the above example.
|
|
The file's full name will be
|
|
<filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
|
|
(where <literal>$SHAREDIR</> means the
|
|
<productname>PostgreSQL</> installation's shared-data directory).
|
|
The file format is just one line
|
|
per word to be substituted, with the word followed by its synonym,
|
|
separated by white space. Blank lines and trailing spaces are ignored.
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>synonym</> template also has an optional parameter
|
|
<literal>CaseSensitive</>, which defaults to <literal>false</>. When
|
|
<literal>CaseSensitive</> is <literal>false</>, words in the synonym file
|
|
are folded to lower case, as are input tokens. When it is
|
|
<literal>true</>, words and tokens are not folded to lower case,
|
|
but are compared as-is.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-thesaurus">
|
|
<title>Thesaurus Dictionary</title>
|
|
|
|
<para>
|
|
A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
|
|
a collection of words that includes information about the relationships
|
|
of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
|
|
terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
|
|
terms, etc.
|
|
</para>
|
|
|
|
<para>
|
|
Basically a thesaurus dictionary replaces all non-preferred terms by one
|
|
preferred term and, optionally, preserves the original terms for indexing
|
|
as well. <productname>PostgreSQL</>'s current implementation of the
|
|
thesaurus dictionary is an extension of the synonym dictionary with added
|
|
<firstterm>phrase</firstterm> support. A thesaurus dictionary requires
|
|
a configuration file of the following format:
|
|
|
|
<programlisting>
|
|
# this is a comment
|
|
sample word(s) : indexed word(s)
|
|
more sample word(s) : more indexed word(s)
|
|
...
|
|
</programlisting>
|
|
|
|
where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
|
|
a phrase and its replacement.
|
|
</para>
|
|
|
|
<para>
|
|
A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
|
|
is specified in the dictionary's configuration) to normalize the input
|
|
text before checking for phrase matches. It is only possible to select one
|
|
subdictionary. An error is reported if the subdictionary fails to
|
|
recognize a word. In that case, you should remove the use of the word or
|
|
teach the subdictionary about it. You can place an asterisk
|
|
(<symbol>*</symbol>) at the beginning of an indexed word to skip applying
|
|
the subdictionary to it, but all sample words <emphasis>must</> be known
|
|
to the subdictionary.
|
|
</para>
|
|
|
|
<para>
|
|
The thesaurus dictionary chooses the longest match if there are multiple
|
|
phrases matching the input, and ties are broken by using the last
|
|
definition.
|
|
</para>
|
|
|
|
<para>
|
|
Specific stop words recognized by the subdictionary cannot be
|
|
specified; instead use <literal>?</> to mark the location where any
|
|
stop word can appear. For example, assuming that <literal>a</> and
|
|
<literal>the</> are stop words according to the subdictionary:
|
|
|
|
<programlisting>
|
|
? one ? two : swsw
|
|
</programlisting>
|
|
|
|
matches <literal>a one the two</> and <literal>the one a two</>;
|
|
both would be replaced by <literal>swsw</>.
|
|
</para>
|
|
|
|
<para>
|
|
Since a thesaurus dictionary has the capability to recognize phrases it
|
|
must remember its state and interact with the parser. A thesaurus dictionary
|
|
uses these assignments to check if it should handle the next word or stop
|
|
accumulation. The thesaurus dictionary must be configured
|
|
carefully. For example, if the thesaurus dictionary is assigned to handle
|
|
only the <literal>asciiword</literal> token, then a thesaurus dictionary
|
|
definition like <literal>one 7</> will not work since token type
|
|
<literal>uint</literal> is not assigned to the thesaurus dictionary.
|
|
</para>
|
|
|
|
<caution>
|
|
<para>
|
|
Thesauruses are used during indexing so any change in the thesaurus
|
|
dictionary's parameters <emphasis>requires</emphasis> reindexing.
|
|
For most other dictionary types, small changes such as adding or
|
|
removing stopwords does not force reindexing.
|
|
</para>
|
|
</caution>
|
|
|
|
<sect3 id="textsearch-thesaurus-config">
|
|
<title>Thesaurus Configuration</title>
|
|
|
|
<para>
|
|
To define a new thesaurus dictionary, use the <literal>thesaurus</>
|
|
template. For example:
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
|
|
TEMPLATE = thesaurus,
|
|
DictFile = mythesaurus,
|
|
Dictionary = pg_catalog.english_stem
|
|
);
|
|
</programlisting>
|
|
|
|
Here:
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
<literal>thesaurus_simple</literal> is the new dictionary's name
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>mythesaurus</literal> is the base name of the thesaurus
|
|
configuration file.
|
|
(Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</>,
|
|
where <literal>$SHAREDIR</> means the installation shared-data
|
|
directory.)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>pg_catalog.english_stem</literal> is the subdictionary (here,
|
|
a Snowball English stemmer) to use for thesaurus normalization.
|
|
Notice that the subdictionary will have its own
|
|
configuration (for example, stop words), which is not shown here.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
|
|
to the desired token types in a configuration, for example:
|
|
|
|
<programlisting>
|
|
ALTER TEXT SEARCH CONFIGURATION russian
|
|
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
|
|
WITH thesaurus_simple;
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="textsearch-thesaurus-examples">
|
|
<title>Thesaurus Example</title>
|
|
|
|
<para>
|
|
Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
|
|
which contains some astronomical word combinations:
|
|
|
|
<programlisting>
|
|
supernovae stars : sn
|
|
crab nebulae : crab
|
|
</programlisting>
|
|
|
|
Below we create a dictionary and bind some token types to
|
|
an astronomical thesaurus and English stemmer:
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
|
|
TEMPLATE = thesaurus,
|
|
DictFile = thesaurus_astro,
|
|
Dictionary = english_stem
|
|
);
|
|
|
|
ALTER TEXT SEARCH CONFIGURATION russian
|
|
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
|
|
WITH thesaurus_astro, english_stem;
|
|
</programlisting>
|
|
|
|
Now we can see how it works.
|
|
<function>ts_lexize</function> is not very useful for testing a thesaurus,
|
|
because it treats its input as a single token. Instead we can use
|
|
<function>plainto_tsquery</function> and <function>to_tsvector</function>
|
|
which will break their input strings into multiple tokens:
|
|
|
|
<programlisting>
|
|
SELECT plainto_tsquery('supernova star');
|
|
plainto_tsquery
|
|
-----------------
|
|
'sn'
|
|
|
|
SELECT to_tsvector('supernova star');
|
|
to_tsvector
|
|
-------------
|
|
'sn':1
|
|
</programlisting>
|
|
|
|
In principle, one can use <function>to_tsquery</function> if you quote
|
|
the argument:
|
|
|
|
<programlisting>
|
|
SELECT to_tsquery('''supernova star''');
|
|
to_tsquery
|
|
------------
|
|
'sn'
|
|
</programlisting>
|
|
|
|
Notice that <literal>supernova star</literal> matches <literal>supernovae
|
|
stars</literal> in <literal>thesaurus_astro</literal> because we specified
|
|
the <literal>english_stem</literal> stemmer in the thesaurus definition.
|
|
The stemmer removed the <literal>e</> and <literal>s</>.
|
|
</para>
|
|
|
|
<para>
|
|
To index the original phrase as well as the substitute, just include it
|
|
in the right-hand part of the definition:
|
|
|
|
<programlisting>
|
|
supernovae stars : sn supernovae stars
|
|
|
|
SELECT plainto_tsquery('supernova star');
|
|
plainto_tsquery
|
|
-----------------------------
|
|
'sn' & 'supernova' & 'star'
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-ispell-dictionary">
|
|
<title><application>Ispell</> Dictionary</title>
|
|
|
|
<para>
|
|
The <application>Ispell</> dictionary template supports
|
|
<firstterm>morphological dictionaries</>, which can normalize many
|
|
different linguistic forms of a word into the same lexeme. For example,
|
|
an English <application>Ispell</> dictionary can match all declensions and
|
|
conjugations of the search term <literal>bank</literal>, e.g.,
|
|
<literal>banking</>, <literal>banked</>, <literal>banks</>,
|
|
<literal>banks'</>, and <literal>bank's</>.
|
|
</para>
|
|
|
|
<para>
|
|
The standard <productname>PostgreSQL</productname> distribution does
|
|
not include any <application>Ispell</> configuration files.
|
|
Dictionaries for a large number of languages are available from <ulink
|
|
url="http://ficus-www.cs.ucla.edu/geoff/ispell.html">Ispell</ulink>.
|
|
Also, some more modern dictionary file formats are supported — <ulink
|
|
url="http://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO < 2.0.1)
|
|
and <ulink url="http://sourceforge.net/projects/hunspell">Hunspell</ulink>
|
|
(OO >= 2.0.2). A large list of dictionaries is available on the <ulink
|
|
url="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
|
|
Wiki</ulink>.
|
|
</para>
|
|
|
|
<para>
|
|
To create an <application>Ispell</> dictionary, use the built-in
|
|
<literal>ispell</literal> template and specify several parameters:
|
|
</para>
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH DICTIONARY english_ispell (
|
|
TEMPLATE = ispell,
|
|
DictFile = english,
|
|
AffFile = english,
|
|
StopWords = english
|
|
);
|
|
</programlisting>
|
|
|
|
<para>
|
|
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
|
|
specify the base names of the dictionary, affixes, and stop-words files.
|
|
The stop-words file has the same format explained above for the
|
|
<literal>simple</> dictionary type. The format of the other files is
|
|
not specified here but is available from the above-mentioned web sites.
|
|
</para>
|
|
|
|
<para>
|
|
Ispell dictionaries usually recognize a limited set of words, so they
|
|
should be followed by another broader dictionary; for
|
|
example, a Snowball dictionary, which recognizes everything.
|
|
</para>
|
|
|
|
<para>
|
|
Ispell dictionaries support splitting compound words;
|
|
a useful feature.
|
|
Notice that the affix file should specify a special flag using the
|
|
<literal>compoundwords controlled</literal> statement that marks dictionary
|
|
words that can participate in compound formation:
|
|
|
|
<programlisting>
|
|
compoundwords controlled z
|
|
</programlisting>
|
|
|
|
Here are some examples for the Norwegian language:
|
|
|
|
<programlisting>
|
|
SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
|
|
{over,buljong,terning,pakk,mester,assistent}
|
|
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
|
|
{sjokoladefabrikk,sjokolade,fabrikk}
|
|
</programlisting>
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
<application>MySpell</> does not support compound words.
|
|
<application>Hunspell</> has sophisticated support for compound words. At
|
|
present, <productname>PostgreSQL</productname> implements only the basic
|
|
compound word operations of Hunspell.
|
|
</para>
|
|
</note>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-snowball-dictionary">
|
|
<title><application>Snowball</> Dictionary</title>
|
|
|
|
<para>
|
|
The <application>Snowball</> dictionary template is based on a project
|
|
by Martin Porter, inventor of the popular Porter's stemming algorithm
|
|
for the English language. Snowball now provides stemming algorithms for
|
|
many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
|
|
site</ulink> for more information). Each algorithm understands how to
|
|
reduce common variant forms of words to a base, or stem, spelling within
|
|
its language. A Snowball dictionary requires a <literal>language</>
|
|
parameter to identify which stemmer to use, and optionally can specify a
|
|
<literal>stopword</> file name that gives a list of words to eliminate.
|
|
(<productname>PostgreSQL</productname>'s standard stopword lists are also
|
|
provided by the Snowball project.)
|
|
For example, there is a built-in definition equivalent to
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH DICTIONARY english_stem (
|
|
TEMPLATE = snowball,
|
|
Language = english,
|
|
StopWords = english
|
|
);
|
|
</programlisting>
|
|
|
|
The stopword file format is the same as already explained.
|
|
</para>
|
|
|
|
<para>
|
|
A <application>Snowball</> dictionary recognizes everything, whether
|
|
or not it is able to simplify the word, so it should be placed
|
|
at the end of the dictionary list. It is useless to have it
|
|
before any other dictionary because a token will never pass through it to
|
|
the next dictionary.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-configuration">
|
|
<title>Configuration Example</title>
|
|
|
|
<para>
|
|
A text search configuration specifies all options necessary to transform a
|
|
document into a <type>tsvector</type>: the parser to use to break text
|
|
into tokens, and the dictionaries to use to transform each token into a
|
|
lexeme. Every call of
|
|
<function>to_tsvector</function> or <function>to_tsquery</function>
|
|
needs a text search configuration to perform its processing.
|
|
The configuration parameter
|
|
<xref linkend="guc-default-text-search-config">
|
|
specifies the name of the default configuration, which is the
|
|
one used by text search functions if an explicit configuration
|
|
parameter is omitted.
|
|
It can be set in <filename>postgresql.conf</filename>, or set for an
|
|
individual session using the <command>SET</> command.
|
|
</para>
|
|
|
|
<para>
|
|
Several predefined text search configurations are available, and
|
|
you can create custom configurations easily. To facilitate management
|
|
of text search objects, a set of <acronym>SQL</acronym> commands
|
|
is available, and there are several <application>psql</application> commands that display information
|
|
about text search objects (<xref linkend="textsearch-psql">).
|
|
</para>
|
|
|
|
<para>
|
|
As an example we will create a configuration
|
|
<literal>pg</literal>, starting by duplicating the built-in
|
|
<literal>english</> configuration:
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
We will use a PostgreSQL-specific synonym list
|
|
and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>.
|
|
The file contents look like:
|
|
|
|
<programlisting>
|
|
postgres pg
|
|
pgsql pg
|
|
postgresql pg
|
|
</programlisting>
|
|
|
|
We define the synonym dictionary like this:
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH DICTIONARY pg_dict (
|
|
TEMPLATE = synonym,
|
|
SYNONYMS = pg_dict
|
|
);
|
|
</programlisting>
|
|
|
|
Next we register the <productname>Ispell</> dictionary
|
|
<literal>english_ispell</literal>, which has its own configuration files:
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH DICTIONARY english_ispell (
|
|
TEMPLATE = ispell,
|
|
DictFile = english,
|
|
AffFile = english,
|
|
StopWords = english
|
|
);
|
|
</programlisting>
|
|
|
|
Now we can set up the mappings for words in configuration
|
|
<literal>pg</>:
|
|
|
|
<programlisting>
|
|
ALTER TEXT SEARCH CONFIGURATION pg
|
|
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
|
|
word, hword, hword_part
|
|
WITH pg_dict, english_ispell, english_stem;
|
|
</programlisting>
|
|
|
|
We choose not to index or search some token types that the built-in
|
|
configuration does handle:
|
|
|
|
<programlisting>
|
|
ALTER TEXT SEARCH CONFIGURATION pg
|
|
DROP MAPPING FOR email, url, url_path, sfloat, float;
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Now we can test our configuration:
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_debug('public.pg', '
|
|
PostgreSQL, the highly scalable, SQL compliant, open source object-relational
|
|
database management system, is now undergoing beta testing of the next
|
|
version of our software.
|
|
');
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The next step is to set the session to use the new configuration, which was
|
|
created in the <literal>public</> schema:
|
|
|
|
<programlisting>
|
|
=> \dF
|
|
List of text search configurations
|
|
Schema | Name | Description
|
|
---------+------+-------------
|
|
public | pg |
|
|
|
|
SET default_text_search_config = 'public.pg';
|
|
SET
|
|
|
|
SHOW default_text_search_config;
|
|
default_text_search_config
|
|
----------------------------
|
|
public.pg
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-debugging">
|
|
<title>Testing and Debugging Text Search</title>
|
|
|
|
<para>
|
|
The behavior of a custom text search configuration can easily become
|
|
confusing. The functions described
|
|
in this section are useful for testing text search objects. You can
|
|
test a complete configuration, or test parsers and dictionaries separately.
|
|
</para>
|
|
|
|
<sect2 id="textsearch-configuration-testing">
|
|
<title>Configuration Testing</title>
|
|
|
|
<para>
|
|
The function <function>ts_debug</function> allows easy testing of a
|
|
text search configuration.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>ts_debug</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
ts_debug(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">document</replaceable> <type>text</>,
|
|
OUT <replaceable class="PARAMETER">alias</> <type>text</>,
|
|
OUT <replaceable class="PARAMETER">description</> <type>text</>,
|
|
OUT <replaceable class="PARAMETER">token</> <type>text</>,
|
|
OUT <replaceable class="PARAMETER">dictionaries</> <type>regdictionary[]</>,
|
|
OUT <replaceable class="PARAMETER">dictionary</> <type>regdictionary</>,
|
|
OUT <replaceable class="PARAMETER">lexemes</> <type>text[]</>)
|
|
returns setof record
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>ts_debug</> displays information about every token of
|
|
<replaceable class="PARAMETER">document</replaceable> as produced by the
|
|
parser and processed by the configured dictionaries. It uses the
|
|
configuration specified by <replaceable
|
|
class="PARAMETER">config</replaceable>,
|
|
or <varname>default_text_search_config</varname> if that argument is
|
|
omitted.
|
|
</para>
|
|
|
|
<para>
|
|
<function>ts_debug</> returns one row for each token identified in the text
|
|
by the parser. The columns returned are
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
<replaceable>alias</> <type>text</> — short name of the token type
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>description</> <type>text</> — description of the
|
|
token type
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>token</> <type>text</> — text of the token
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>dictionaries</> <type>regdictionary[]</> — the
|
|
dictionaries selected by the configuration for this token type
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>dictionary</> <type>regdictionary</> — the dictionary
|
|
that recognized the token, or <literal>NULL</> if none did
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>lexemes</> <type>text[]</> — the lexeme(s) produced
|
|
by the dictionary that recognized the token, or <literal>NULL</> if
|
|
none did; an empty array (<literal>{}</>) means it was recognized as a
|
|
stop word
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
Here is a simple example:
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats');
|
|
alias | description | token | dictionaries | dictionary | lexemes
|
|
-----------+-----------------+-------+----------------+--------------+---------
|
|
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
|
|
blank | Space symbols | | {} | |
|
|
blank | Space symbols | - | {} | |
|
|
asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
For a more extensive demonstration, we
|
|
first create a <literal>public.english</literal> configuration and
|
|
Ispell dictionary for the English language:
|
|
</para>
|
|
|
|
<programlisting>
|
|
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
|
|
|
|
CREATE TEXT SEARCH DICTIONARY english_ispell (
|
|
TEMPLATE = ispell,
|
|
DictFile = english,
|
|
AffFile = english,
|
|
StopWords = english
|
|
);
|
|
|
|
ALTER TEXT SEARCH CONFIGURATION public.english
|
|
ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
|
|
</programlisting>
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
|
|
alias | description | token | dictionaries | dictionary | lexemes
|
|
-----------+-----------------+-------------+-------------------------------+----------------+-------------
|
|
asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright}
|
|
blank | Space symbols | | {} | |
|
|
asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}
|
|
</programlisting>
|
|
|
|
<para>
|
|
In this example, the word <literal>Brightest</> was recognized by the
|
|
parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>).
|
|
For this token type the dictionary list is
|
|
<literal>english_ispell</> and
|
|
<literal>english_stem</literal>. The word was recognized by
|
|
<literal>english_ispell</literal>, which reduced it to the noun
|
|
<literal>bright</literal>. The word <literal>supernovaes</literal> is
|
|
unknown to the <literal>english_ispell</literal> dictionary so it
|
|
was passed to the next dictionary, and, fortunately, was recognized (in
|
|
fact, <literal>english_stem</literal> is a Snowball dictionary which
|
|
recognizes everything; that is why it was placed at the end of the
|
|
dictionary list).
|
|
</para>
|
|
|
|
<para>
|
|
The word <literal>The</literal> was recognized by the
|
|
<literal>english_ispell</literal> dictionary as a stop word (<xref
|
|
linkend="textsearch-stopwords">) and will not be indexed.
|
|
The spaces are discarded too, since the configuration provides no
|
|
dictionaries at all for them.
|
|
</para>
|
|
|
|
<para>
|
|
You can reduce the width of the output by explicitly specifying which columns
|
|
you want to see:
|
|
|
|
<programlisting>
|
|
SELECT alias, token, dictionary, lexemes
|
|
FROM ts_debug('public.english','The Brightest supernovaes');
|
|
alias | token | dictionary | lexemes
|
|
-----------+-------------+----------------+-------------
|
|
asciiword | The | english_ispell | {}
|
|
blank | | |
|
|
asciiword | Brightest | english_ispell | {bright}
|
|
blank | | |
|
|
asciiword | supernovaes | english_stem | {supernova}
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-parser-testing">
|
|
<title>Parser Testing</title>
|
|
|
|
<para>
|
|
The following functions allow direct testing of a text search parser.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>ts_parse</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
ts_parse(<replaceable class="PARAMETER">parser_name</replaceable> <type>text</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
|
|
OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
|
|
ts_parse(<replaceable class="PARAMETER">parser_oid</replaceable> <type>oid</>, <replaceable class="PARAMETER">document</replaceable> <type>text</>,
|
|
OUT <replaceable class="PARAMETER">tokid</> <type>integer</>, OUT <replaceable class="PARAMETER">token</> <type>text</>) returns <type>setof record</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>ts_parse</> parses the given <replaceable>document</replaceable>
|
|
and returns a series of records, one for each token produced by
|
|
parsing. Each record includes a <varname>tokid</varname> showing the
|
|
assigned token type and a <varname>token</varname> which is the text of the
|
|
token. For example:
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_parse('default', '123 - a number');
|
|
tokid | token
|
|
-------+--------
|
|
22 | 123
|
|
12 |
|
|
12 | -
|
|
1 | a
|
|
12 |
|
|
1 | number
|
|
</programlisting>
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>ts_token_type</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
ts_token_type(<replaceable class="PARAMETER">parser_name</> <type>text</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
|
|
OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
|
|
ts_token_type(<replaceable class="PARAMETER">parser_oid</> <type>oid</>, OUT <replaceable class="PARAMETER">tokid</> <type>integer</>,
|
|
OUT <replaceable class="PARAMETER">alias</> <type>text</>, OUT <replaceable class="PARAMETER">description</> <type>text</>) returns <type>setof record</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>ts_token_type</> returns a table which describes each type of
|
|
token the specified parser can recognize. For each token type, the table
|
|
gives the integer <varname>tokid</varname> that the parser uses to label a
|
|
token of that type, the <varname>alias</varname> that names the token type
|
|
in configuration commands, and a short <varname>description</varname>. For
|
|
example:
|
|
|
|
<programlisting>
|
|
SELECT * FROM ts_token_type('default');
|
|
tokid | alias | description
|
|
-------+-----------------+------------------------------------------
|
|
1 | asciiword | Word, all ASCII
|
|
2 | word | Word, all letters
|
|
3 | numword | Word, letters and digits
|
|
4 | email | Email address
|
|
5 | url | URL
|
|
6 | host | Host
|
|
7 | sfloat | Scientific notation
|
|
8 | version | Version number
|
|
9 | hword_numpart | Hyphenated word part, letters and digits
|
|
10 | hword_part | Hyphenated word part, all letters
|
|
11 | hword_asciipart | Hyphenated word part, all ASCII
|
|
12 | blank | Space symbols
|
|
13 | tag | XML tag
|
|
14 | protocol | Protocol head
|
|
15 | numhword | Hyphenated word, letters and digits
|
|
16 | asciihword | Hyphenated word, all ASCII
|
|
17 | hword | Hyphenated word, all letters
|
|
18 | url_path | URL path
|
|
19 | file | File or path name
|
|
20 | float | Decimal notation
|
|
21 | int | Signed integer
|
|
22 | uint | Unsigned integer
|
|
23 | entity | XML entity
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="textsearch-dictionary-testing">
|
|
<title>Dictionary Testing</title>
|
|
|
|
<para>
|
|
The <function>ts_lexize</> function facilitates dictionary testing.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>ts_lexize</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
ts_lexize(<replaceable class="PARAMETER">dict</replaceable> <type>regdictionary</>, <replaceable class="PARAMETER">token</replaceable> <type>text</>) returns <type>text[]</>
|
|
</synopsis>
|
|
|
|
<para>
|
|
<function>ts_lexize</> returns an array of lexemes if the input
|
|
<replaceable>token</replaceable> is known to the dictionary,
|
|
or an empty array if the token
|
|
is known to the dictionary but it is a stop word, or
|
|
<literal>NULL</literal> if it is an unknown word.
|
|
</para>
|
|
|
|
<para>
|
|
Examples:
|
|
|
|
<programlisting>
|
|
SELECT ts_lexize('english_stem', 'stars');
|
|
ts_lexize
|
|
-----------
|
|
{star}
|
|
|
|
SELECT ts_lexize('english_stem', 'a');
|
|
ts_lexize
|
|
-----------
|
|
{}
|
|
</programlisting>
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
The <function>ts_lexize</function> function expects a single
|
|
<emphasis>token</emphasis>, not text. Here is a case
|
|
where this can be confusing:
|
|
|
|
<programlisting>
|
|
SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
|
|
?column?
|
|
----------
|
|
t
|
|
</programlisting>
|
|
|
|
The thesaurus dictionary <literal>thesaurus_astro</literal> does know the
|
|
phrase <literal>supernovae stars</literal>, but <function>ts_lexize</>
|
|
fails since it does not parse the input text but treats it as a single
|
|
token. Use <function>plainto_tsquery</> or <function>to_tsvector</> to
|
|
test thesaurus dictionaries, for example:
|
|
|
|
<programlisting>
|
|
SELECT plainto_tsquery('supernovae stars');
|
|
plainto_tsquery
|
|
-----------------
|
|
'sn'
|
|
</programlisting>
|
|
</para>
|
|
</note>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-indexes">
|
|
<title>GiST and GIN Index Types</title>
|
|
|
|
<indexterm zone="textsearch-indexes">
|
|
<primary>text search</primary>
|
|
<secondary>indexes</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
There are two kinds of indexes that can be used to speed up full text
|
|
searches.
|
|
Note that indexes are not mandatory for full text searching, but in
|
|
cases where a column is searched on a regular basis, an index is
|
|
usually desirable.
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm zone="textsearch-indexes">
|
|
<primary>index</primary>
|
|
<secondary>GiST</secondary>
|
|
<tertiary>text search</tertiary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gist(<replaceable>column</replaceable>);
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Creates a GiST (Generalized Search Tree)-based index.
|
|
The <replaceable>column</replaceable> can be of <type>tsvector</> or
|
|
<type>tsquery</> type.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm zone="textsearch-indexes">
|
|
<primary>index</primary>
|
|
<secondary>GIN</secondary>
|
|
<tertiary>text search</tertiary>
|
|
</indexterm>
|
|
|
|
<term>
|
|
<synopsis>
|
|
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gin(<replaceable>column</replaceable>);
|
|
</synopsis>
|
|
</term>
|
|
|
|
<listitem>
|
|
<para>
|
|
Creates a GIN (Generalized Inverted Index)-based index.
|
|
The <replaceable>column</replaceable> must be of <type>tsvector</> type.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</para>
|
|
|
|
<para>
|
|
There are substantial performance differences between the two index types,
|
|
so it is important to understand their characteristics.
|
|
</para>
|
|
|
|
<para>
|
|
A GiST index is <firstterm>lossy</firstterm>, meaning that the index
|
|
may produce false matches, and it is necessary
|
|
to check the actual table row to eliminate such false matches.
|
|
(<productname>PostgreSQL</productname> does this automatically when needed.)
|
|
GiST indexes are lossy because each document is represented in the
|
|
index by a fixed-length signature. The signature is generated by hashing
|
|
each word into a random bit in an n-bit string, with all these bits OR-ed
|
|
together to produce an n-bit document signature. When two words hash to
|
|
the same bit position there will be a false match. If all words in
|
|
the query have matches (real or false) then the table row must be
|
|
retrieved to see if the match is correct.
|
|
</para>
|
|
|
|
<para>
|
|
Lossiness causes performance degradation due to unnecessary fetches of table
|
|
records that turn out to be false matches. Since random access to table
|
|
records is slow, this limits the usefulness of GiST indexes. The
|
|
likelihood of false matches depends on several factors, in particular the
|
|
number of unique words, so using dictionaries to reduce this number is
|
|
recommended.
|
|
</para>
|
|
|
|
<para>
|
|
GIN indexes are not lossy for standard queries, but their performance
|
|
depends logarithmically on the number of unique words.
|
|
(However, GIN indexes store only the words (lexemes) of <type>tsvector</>
|
|
values, and not their weight labels. Thus a table row recheck is needed
|
|
when using a query that involves weights.)
|
|
</para>
|
|
|
|
<para>
|
|
In choosing which index type to use, GiST or GIN, consider these
|
|
performance differences:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
GIN index lookups are about three times faster than GiST
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
GIN indexes take about three times longer to build than GiST
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
GIN indexes are moderately slower to update than GiST indexes, but
|
|
about 10 times slower if fast-update support was disabled
|
|
(see <xref linkend="gin-fast-update"> for details)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
GIN indexes are two-to-three times larger than GiST indexes
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
As a rule of thumb, <acronym>GIN</acronym> indexes are best for static data
|
|
because lookups are faster. For dynamic data, GiST indexes are
|
|
faster to update. Specifically, <acronym>GiST</acronym> indexes are very
|
|
good for dynamic data and fast if the number of unique words (lexemes) is
|
|
under 100,000, while <acronym>GIN</acronym> indexes will handle 100,000+
|
|
lexemes better but are slower to update.
|
|
</para>
|
|
|
|
<para>
|
|
Note that <acronym>GIN</acronym> index build time can often be improved
|
|
by increasing <xref linkend="guc-maintenance-work-mem">, while
|
|
<acronym>GiST</acronym> index build time is not sensitive to that
|
|
parameter.
|
|
</para>
|
|
|
|
<para>
|
|
Partitioning of big collections and the proper use of GiST and GIN indexes
|
|
allows the implementation of very fast searches with online update.
|
|
Partitioning can be done at the database level using table inheritance,
|
|
or by distributing documents over
|
|
servers and collecting search results using the <filename>contrib/dblink</>
|
|
extension module. The latter is possible because ranking functions use
|
|
only local information.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-psql">
|
|
<title><application>psql</> Support</title>
|
|
|
|
<para>
|
|
Information about text search configuration objects can be obtained
|
|
in <application>psql</application> using a set of commands:
|
|
<synopsis>
|
|
\dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
|
|
</synopsis>
|
|
An optional <literal>+</literal> produces more details.
|
|
</para>
|
|
|
|
<para>
|
|
The optional parameter <literal>PATTERN</literal> can be the name of
|
|
a text search object, optionally schema-qualified. If
|
|
<literal>PATTERN</literal> is omitted then information about all
|
|
visible objects will be displayed. <literal>PATTERN</literal> can be a
|
|
regular expression and can provide <emphasis>separate</emphasis> patterns
|
|
for the schema and object names. The following examples illustrate this:
|
|
|
|
<programlisting>
|
|
=> \dF *fulltext*
|
|
List of text search configurations
|
|
Schema | Name | Description
|
|
--------+--------------+-------------
|
|
public | fulltext_cfg |
|
|
</programlisting>
|
|
|
|
<programlisting>
|
|
=> \dF *.fulltext*
|
|
List of text search configurations
|
|
Schema | Name | Description
|
|
----------+----------------------------
|
|
fulltext | fulltext_cfg |
|
|
public | fulltext_cfg |
|
|
</programlisting>
|
|
|
|
The available commands are:
|
|
</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term><synopsis>\dF<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
|
|
|
|
<listitem>
|
|
<para>
|
|
List text search configurations (add <literal>+</> for more detail).
|
|
</para>
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
=> \dF russian
|
|
List of text search configurations
|
|
Schema | Name | Description
|
|
------------+---------+------------------------------------
|
|
pg_catalog | russian | configuration for russian language
|
|
|
|
=> \dF+ russian
|
|
Text search configuration "pg_catalog.russian"
|
|
Parser: "pg_catalog.default"
|
|
Token | Dictionaries
|
|
-----------------+--------------
|
|
asciihword | english_stem
|
|
asciiword | english_stem
|
|
email | simple
|
|
file | simple
|
|
float | simple
|
|
host | simple
|
|
hword | russian_stem
|
|
hword_asciipart | english_stem
|
|
hword_numpart | simple
|
|
hword_part | russian_stem
|
|
int | simple
|
|
numhword | simple
|
|
numword | simple
|
|
sfloat | simple
|
|
uint | simple
|
|
url | simple
|
|
url_path | simple
|
|
version | simple
|
|
word | russian_stem
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><synopsis>\dFd<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
|
|
<listitem>
|
|
<para>
|
|
List text search dictionaries (add <literal>+</> for more detail).
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
=> \dFd
|
|
List of text search dictionaries
|
|
Schema | Name | Description
|
|
------------+-----------------+-----------------------------------------------------------
|
|
pg_catalog | danish_stem | snowball stemmer for danish language
|
|
pg_catalog | dutch_stem | snowball stemmer for dutch language
|
|
pg_catalog | english_stem | snowball stemmer for english language
|
|
pg_catalog | finnish_stem | snowball stemmer for finnish language
|
|
pg_catalog | french_stem | snowball stemmer for french language
|
|
pg_catalog | german_stem | snowball stemmer for german language
|
|
pg_catalog | hungarian_stem | snowball stemmer for hungarian language
|
|
pg_catalog | italian_stem | snowball stemmer for italian language
|
|
pg_catalog | norwegian_stem | snowball stemmer for norwegian language
|
|
pg_catalog | portuguese_stem | snowball stemmer for portuguese language
|
|
pg_catalog | romanian_stem | snowball stemmer for romanian language
|
|
pg_catalog | russian_stem | snowball stemmer for russian language
|
|
pg_catalog | simple | simple dictionary: just lower case and check for stopword
|
|
pg_catalog | spanish_stem | snowball stemmer for spanish language
|
|
pg_catalog | swedish_stem | snowball stemmer for swedish language
|
|
pg_catalog | turkish_stem | snowball stemmer for turkish language
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<term><synopsis>\dFp<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
|
|
<listitem>
|
|
<para>
|
|
List text search parsers (add <literal>+</> for more detail).
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
=> \dFp
|
|
List of text search parsers
|
|
Schema | Name | Description
|
|
------------+---------+---------------------
|
|
pg_catalog | default | default word parser
|
|
=> \dFp+
|
|
Text search parser "pg_catalog.default"
|
|
Method | Function | Description
|
|
-----------------+----------------+-------------
|
|
Start parse | prsd_start |
|
|
Get next token | prsd_nexttoken |
|
|
End parse | prsd_end |
|
|
Get headline | prsd_headline |
|
|
Get token types | prsd_lextype |
|
|
|
|
Token types for parser "pg_catalog.default"
|
|
Token name | Description
|
|
-----------------+------------------------------------------
|
|
asciihword | Hyphenated word, all ASCII
|
|
asciiword | Word, all ASCII
|
|
blank | Space symbols
|
|
email | Email address
|
|
entity | XML entity
|
|
file | File or path name
|
|
float | Decimal notation
|
|
host | Host
|
|
hword | Hyphenated word, all letters
|
|
hword_asciipart | Hyphenated word part, all ASCII
|
|
hword_numpart | Hyphenated word part, letters and digits
|
|
hword_part | Hyphenated word part, all letters
|
|
int | Signed integer
|
|
numhword | Hyphenated word, letters and digits
|
|
numword | Word, letters and digits
|
|
protocol | Protocol head
|
|
sfloat | Scientific notation
|
|
tag | XML tag
|
|
uint | Unsigned integer
|
|
url | URL
|
|
url_path | URL path
|
|
version | Version number
|
|
word | Word, all letters
|
|
(23 rows)
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
|
|
<term><synopsis>\dFt<optional>+</optional> <optional>PATTERN</optional></synopsis></term>
|
|
<listitem>
|
|
<para>
|
|
List text search templates (add <literal>+</> for more detail).
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
=> \dFt
|
|
List of text search templates
|
|
Schema | Name | Description
|
|
------------+-----------+-----------------------------------------------------------
|
|
pg_catalog | ispell | ispell dictionary
|
|
pg_catalog | simple | simple dictionary: just lower case and check for stopword
|
|
pg_catalog | snowball | snowball stemmer
|
|
pg_catalog | synonym | synonym dictionary: replace word by its synonym
|
|
pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-limitations">
|
|
<title>Limitations</title>
|
|
|
|
<para>
|
|
The current limitations of <productname>PostgreSQL</productname>'s
|
|
text search features are:
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>The length of each lexeme must be less than 2K bytes</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The length of a <type>tsvector</type> (lexemes + positions) must be
|
|
less than 1 megabyte</para>
|
|
</listitem>
|
|
<listitem>
|
|
<!-- TODO: number of lexemes in what? This is unclear -->
|
|
<para>The number of lexemes must be less than
|
|
2<superscript>64</superscript></para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Position values in <type>tsvector</> must be greater than 0 and
|
|
no more than 16,383</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>No more than 256 positions per lexeme</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The number of nodes (lexemes + operators) in a <type>tsquery</type>
|
|
must be less than 32,768</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
|
|
contained 10,441 unique words, a total of 335,420 words, and the most
|
|
frequent word <quote>postgresql</> was mentioned 6,127 times in 655
|
|
documents.
|
|
</para>
|
|
|
|
<!-- TODO we need to put a date on these numbers? -->
|
|
<para>
|
|
Another example — the <productname>PostgreSQL</productname> mailing
|
|
list archives contained 910,989 unique words with 57,491,343 lexemes in
|
|
461,020 messages.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="textsearch-migration">
|
|
<title>Migration from Pre-8.3 Text Search</title>
|
|
|
|
<para>
|
|
Applications that used the <filename>contrib/tsearch2</> add-on module
|
|
for text searching will need some adjustments to work with the
|
|
built-in features:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Some functions have been renamed or had small adjustments in their
|
|
argument lists, and all of them are now in the <literal>pg_catalog</>
|
|
schema, whereas in a previous installation they would have been in
|
|
<literal>public</> or another non-system schema. There is a new
|
|
version of <filename>contrib/tsearch2</> (see <xref linkend="tsearch2">)
|
|
that provides a compatibility layer to solve most problems in this
|
|
area.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
The old <filename>contrib/tsearch2</> functions and other objects
|
|
<emphasis>must</> be suppressed when loading <application>pg_dump</>
|
|
output from a pre-8.3 database. While many of them won't load anyway,
|
|
a few will and then cause problems. One simple way to deal with this
|
|
is to load the new <filename>contrib/tsearch2</> module before restoring
|
|
the dump; then it will block the old objects from being loaded.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Text search configuration setup is completely different now.
|
|
Instead of manually inserting rows into configuration tables,
|
|
search is configured through the specialized SQL commands shown
|
|
earlier in this chapter. There is no automated
|
|
support for converting an existing custom configuration for 8.3;
|
|
you're on your own here.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Most types of dictionaries rely on some outside-the-database
|
|
configuration files. These are largely compatible with pre-8.3
|
|
usage, but note the following differences:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
Configuration files now must be placed in a single specified
|
|
directory (<filename>$SHAREDIR/tsearch_data</>), and must have
|
|
a specific extension depending on the type of file, as noted
|
|
previously in the descriptions of the various dictionary types.
|
|
This restriction was added to forestall security problems.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Configuration files must be encoded in UTF-8 encoding,
|
|
regardless of what database encoding is used.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
In thesaurus configuration files, stop words must be marked with
|
|
<literal>?</>.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|