2007-08-22 05:08:47 +08:00
|
|
|
<chapter id="textsearch">
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<title>Full Text Search</title>
|
|
|
|
|
|
|
|
|
|
|
|
<sect1 id="textsearch-intro">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Introduction</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
Full Text Searching (or just <firstterm>text search</firstterm>) allows
|
|
|
|
identifying documents that satisfy a <firstterm>query</firstterm>, and
|
|
|
|
optionally sorting them by relevance to the query. The most common search
|
|
|
|
is to find all documents containing given <firstterm>query terms</firstterm>
|
|
|
|
and return them in order of their <firstterm>similarity</firstterm> to the
|
|
|
|
<varname>query</varname>. Notions of <varname>query</varname> and
|
|
|
|
<varname>similarity</varname> are very flexible and depend on the specific
|
|
|
|
application. The simplest search considers <varname>query</varname> as a
|
|
|
|
set of words and <varname>similarity</varname> as the frequency of query
|
|
|
|
words in the document. Full text indexing can be done inside the
|
|
|
|
database or outside. Doing indexing inside the database allows easy access
|
|
|
|
to document metadata to assist in indexing and display.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Textual search operators have existed in databases for years.
|
|
|
|
<productname>PostgreSQL</productname> has
|
|
|
|
<literal>~</literal>,<literal>~*</literal>, <literal>LIKE</literal>,
|
|
|
|
<literal>ILIKE</literal> operators for textual datatypes, but they lack
|
|
|
|
many essential properties required by modern information systems:
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
There is no linguistic support, even for English. Regular expressions are
|
|
|
|
not sufficient because they cannot easily handle derived words,
|
|
|
|
e.g., <literal>satisfies</literal> and <literal>satisfy</literal>. You might
|
|
|
|
miss documents which contain <literal>satisfies</literal>, although you
|
|
|
|
probably would like to find them when searching for
|
|
|
|
<literal>satisfy</literal>. It is possible to use <literal>OR</literal>
|
|
|
|
to search <emphasis>any</emphasis> of them, but it is tedious and error-prone
|
|
|
|
(some words can have several thousand derivatives).
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
They provide no ordering (ranking) of search results, which makes them
|
|
|
|
ineffective when thousands of matching documents are found.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
They tend to be slow because they process all documents for every search and
|
|
|
|
there is no index support.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
|
|
|
|
and an index saved for later rapid searching. Preprocessing includes:
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<itemizedlist mark="none">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis>Parsing documents into <firstterm>lexemes</></emphasis>. It is
|
|
|
|
useful to identify various lexemes, e.g. digits, words, complex words,
|
|
|
|
email addresses, so they can be processed differently. In principle
|
|
|
|
lexemes depend on the specific application but for an ordinary search it
|
|
|
|
is useful to have a predefined list of lexemes. <!-- add list of lexemes.
|
|
|
|
-->
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis>Dictionaries</emphasis> allow the conversion of lexemes into
|
|
|
|
a <emphasis>normalized form</emphasis> so it is not necessary to enter
|
|
|
|
search words in a specific form.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2007-08-30 04:37:14 +08:00
|
|
|
<emphasis>Store</emphasis> preprocessed documents optimized for
|
|
|
|
searching. For example, represent each document as a sorted array
|
2007-08-29 10:37:04 +08:00
|
|
|
of lexemes. Along with lexemes it is desirable to store positional
|
2007-08-30 04:37:14 +08:00
|
|
|
information to use for <varname>proximity ranking</varname>, so that
|
|
|
|
a document which contains a more "dense" region of query words is
|
|
|
|
assigned a higher rank than one with scattered query words.
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Dictionaries allow fine-grained control over how lexemes are created. With
|
|
|
|
dictionaries you can:
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Define "stop words" that should not be indexed.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Map synonyms to a single word using <application>ispell</>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Map phrases to a single word using a thesaurus.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Map different variations of a word to a canonical form using
|
|
|
|
an <application>ispell</> dictionary.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Map different variations of a word to a canonical form using
|
|
|
|
<application>snowball</> stemmer rules.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<para>
|
2007-08-30 04:37:14 +08:00
|
|
|
A data type (<xref linkend="datatype-textsearch">), <type>tsvector</type>
|
2007-08-29 10:37:04 +08:00
|
|
|
is provided, for storing preprocessed documents,
|
|
|
|
along with a type <type>tsquery</type> for representing textual
|
|
|
|
queries. Also, a full text search operator <literal>@@</literal> is defined
|
|
|
|
for these data types (<xref linkend="textsearch-searches">). Full text
|
|
|
|
searches can be accelerated using indexes (<xref
|
|
|
|
linkend="textsearch-indexes">).
|
|
|
|
</para>
|
|
|
|
|
|
|
|
|
|
|
|
<sect2 id="textsearch-document">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>What Is a <firstterm>Document</firstterm>?</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<indexterm zone="textsearch-document">
|
|
|
|
<primary>document</primary>
|
|
|
|
</indexterm>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
A document can be a simple text file stored in the file system. The full
|
|
|
|
text indexing engine can parse text files and store associations of lexemes
|
|
|
|
(words) with their parent document. Later, these associations are used to
|
|
|
|
search for documents which contain query words. In this case, the database
|
|
|
|
can be used to store the full text index and for executing searches, and
|
|
|
|
some unique identifier can be used to retrieve the document from the file
|
|
|
|
system.
|
|
|
|
</para>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
A document can also be any textual database attribute or a combination
|
|
|
|
(concatenation), which in turn can be stored in various tables or obtained
|
|
|
|
dynamically. In other words, a document can be constructed from different
|
|
|
|
parts for indexing and it might not exist as a whole. For example:
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<programlisting>
|
|
|
|
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
|
|
|
|
FROM messages
|
|
|
|
WHERE mid = 12;
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
|
|
|
|
FROM messages m, docs d
|
|
|
|
WHERE mid = did AND mid = 12;
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<note>
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
2007-08-30 04:37:14 +08:00
|
|
|
Actually, in the previous example queries, <literal>COALESCE</literal>
|
|
|
|
<!-- TODO make this a link? -->
|
|
|
|
should be used to prevent a <literal>NULL</literal> attribute from causing
|
|
|
|
a <literal>NULL</literal> result.
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-30 04:37:14 +08:00
|
|
|
</note>
|
|
|
|
</sect2>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<sect2 id="textsearch-searches">
|
|
|
|
<title>Performing Searches</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Full text searching in <productname>PostgreSQL</productname> is based on
|
|
|
|
the operator <literal>@@</literal>, which tests whether a <type>tsvector</type>
|
|
|
|
(document) matches a <type>tsquery</type> (query). Also, this operator
|
|
|
|
supports <type>text</type> input, allowing explicit conversion of a text
|
|
|
|
string to <type>tsvector</type> to be skipped. The variants available
|
|
|
|
are:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
tsvector @@ tsquery
|
|
|
|
tsquery @@ tsvector
|
|
|
|
text @@ tsquery
|
|
|
|
text @@ text
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
The match operator <literal>@@</literal> returns <literal>true</literal> if
|
|
|
|
the <type>tsvector</type> matches the <type>tsquery</type>. It doesn't
|
|
|
|
matter which data type is written first:
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
SELECT 'cat & rat'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
f
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
|
|
|
|
is equivalent to <literal>to_tsvector(x) @@ y</literal>.
|
|
|
|
The form <type>text</type> <literal>@@</literal> <type>text</type>
|
|
|
|
is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
|
2007-08-30 05:51:45 +08:00
|
|
|
<xref linkend="functions-textsearch"> contains a complete list of full text
|
|
|
|
search functions and operators.
|
2007-08-30 04:37:14 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<sect2 id="textsearch-configurations">
|
|
|
|
<title>Configurations</title>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<indexterm zone="textsearch-configurations">
|
|
|
|
<primary>configurations</primary>
|
|
|
|
</indexterm>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
The above are all simple text search examples. As mentioned before, full
|
|
|
|
text search functionality includes the ability to do many more things:
|
|
|
|
skip indexing certain words (stop words), process synonyms, and use
|
|
|
|
sophisticated parsing, e.g. parse based on more than just white space.
|
|
|
|
This functionality is controlled by <emphasis>configurations</>.
|
|
|
|
Fortunately, <productname>PostgreSQL</> comes with predefined
|
|
|
|
configurations for many languages. (<application>psql</>'s <command>\dF</>
|
2007-08-30 05:51:45 +08:00
|
|
|
shows all predefined configurations.)
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
During installation an appropriate configuration was selected and
|
|
|
|
<xref linkend="guc-default-text-search-config"> was set accordingly
|
|
|
|
in <filename>postgresql.conf</>. If you are using the same text search
|
|
|
|
configuration for the entire cluster you can use the value in
|
|
|
|
<filename>postgresql.conf</>. If using different configurations but
|
|
|
|
the same text search configuration for an entire database,
|
|
|
|
use <command>ALTER DATABASE ... SET</>. If not, you must set <xref
|
|
|
|
linkend="guc-default-text-search-config"> in each session. Many
|
|
|
|
functions also take an optional configuration name.
|
2007-08-30 04:37:14 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
</sect2>
|
2007-08-30 05:51:45 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
</sect1>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<sect1 id="textsearch-tables">
|
|
|
|
<title>Tables and Indexes</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
The previous section described how to perform full text searches using
|
|
|
|
constant strings. This section shows how to search table data, optionally
|
|
|
|
using indexes.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<sect2 id="textsearch-tables-search">
|
|
|
|
<title>Searching a Table</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
It is possible to do full text table search with no index. A simple query
|
|
|
|
to find all <literal>title</> entries that contain the word
|
|
|
|
<literal>friend</> is:
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
SELECT title
|
|
|
|
FROM pgweb
|
|
|
|
WHERE to_tsvector('english', body) @@ to_tsquery('friend')
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
</para>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
The query above uses the <literal>english</> the configuration set by <xref
|
|
|
|
linkend="guc-default-text-search-config">. A more complex query is to
|
|
|
|
select the ten most recent documents which contain <literal>create</> and
|
|
|
|
<literal>table</> in the <literal>title</> or <literal>body</>:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
SELECT title
|
|
|
|
FROM pgweb
|
|
|
|
WHERE to_tsvector('english', title || body) @@ to_tsquery('create & table')
|
|
|
|
ORDER BY dlm DESC LIMIT 10;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<literal>dlm</> is the last-modified date so we
|
|
|
|
used <command>ORDER BY dlm LIMIT 10</> to get the ten most recent
|
|
|
|
matches. For clarity we omitted the <function>coalesce</function> function
|
|
|
|
which prevents the unwanted effect of <literal>NULL</literal>
|
|
|
|
concatenation.
|
|
|
|
</para>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<sect2 id="textsearch-tables-index">
|
|
|
|
<title>Creating Indexes</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
We can create a <acronym>GIN</acronym> (<xref
|
|
|
|
linkend="textsearch-indexes">) index to speed up the search:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
Notice that the 2-argument version of <function>to_tsvector</function> is
|
|
|
|
used. Only text search functions which specify a configuration name can
|
|
|
|
be used in expression indexes (<xref linkend="indexes-expressional">).
|
|
|
|
This is because the index contents must be unaffected by <xref
|
|
|
|
linkend="guc-default-text-search-config">. If they were affected, the
|
|
|
|
index contents might be inconsistent because different entries could
|
|
|
|
contain <type>tsvector</>s that were created with different text search
|
|
|
|
configurations, and there would be no way to guess which was which. It
|
|
|
|
would be impossible to dump and restore such an index correctly.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
Because the two-argument version of <function>to_tsvector</function> was
|
|
|
|
used in the index above, only a query reference that uses the 2-argument
|
|
|
|
version of <function>to_tsvector</function> with the same configuration
|
|
|
|
name will use that index, i.e. <literal>WHERE 'a & b' @@
|
|
|
|
to_svector('english', body)</> will use the index, but <literal>WHERE
|
|
|
|
'a & b' @@ to_svector(body))</> and <literal>WHERE 'a & b' @@
|
|
|
|
body::tsvector</> will not. This guarantees that an index will be used
|
|
|
|
only with the same configuration used to create the index rows.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
It is possible to setup more complex expression indexes where the
|
|
|
|
configuration name is specified by another column, e.g.:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<programlisting>
|
|
|
|
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
|
|
|
|
</programlisting>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
where <literal>config_name</> is a column in the <literal>pgweb</>
|
|
|
|
table. This allows mixed configurations in the same index while
|
|
|
|
recording which configuration was used for each index row.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
Indexes can even concatenate columns:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<programlisting>
|
|
|
|
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || body));
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<para>
|
|
|
|
A more complex case is to create a separate <type>tsvector</> column
|
|
|
|
to hold the output of <function>to_tsvector()</>. This example is a
|
|
|
|
concatenation of <literal>title</literal> and <literal>body</literal>,
|
|
|
|
with ranking information. We assign different labels to them to encode
|
|
|
|
information about the origin of each word:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<programlisting>
|
|
|
|
ALTER TABLE pgweb ADD COLUMN textsearch_index tsvector;
|
|
|
|
UPDATE pgweb SET textsearch_index =
|
|
|
|
setweight(to_tsvector('english', coalesce(title,'')), 'A') || ' ' ||
|
|
|
|
setweight(to_tsvector('english', coalesce(body,'')),'D');
|
|
|
|
</programlisting>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
Then we create a <acronym>GIN</acronym> index to speed up the search:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
<programlisting>
|
|
|
|
CREATE INDEX textsearch_idx ON pgweb USING gin(textsearch_index);
|
|
|
|
</programlisting>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
After vacuuming, we are ready to perform a fast full text search:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-30 04:37:14 +08:00
|
|
|
SELECT ts_rank_cd(textsearch_index, q) AS rank, title
|
|
|
|
FROM pgweb, to_tsquery('create & table') q
|
|
|
|
WHERE q @@ textsearch_index
|
|
|
|
ORDER BY rank DESC LIMIT 10;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
|
|
|
|
2007-08-30 04:37:14 +08:00
|
|
|
It is necessary to create a trigger to keep the new <type>tsvector</>
|
|
|
|
column current anytime <literal>title</> or <literal>body</> changes.
|
|
|
|
Keep in mind that, just like with expression indexes, it is important to
|
|
|
|
specify the configuration name when creating text search data types
|
|
|
|
inside triggers so the column's contents are not affected by changes to
|
|
|
|
<varname>default_text_search_config</>.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect1>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect1 id="textsearch-controls">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Additional Controls</title>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
To implement full text searching there must be a function to create a
|
|
|
|
<type>tsvector</type> from a document and a <type>tsquery</type> from a
|
|
|
|
user query. Also, we need to return results in some order, i.e., we need
|
|
|
|
a function which compares documents with respect to their relevance to
|
|
|
|
the <type>tsquery</type>. Full text searching in
|
|
|
|
<productname>PostgreSQL</productname> provides support for all of these
|
|
|
|
functions.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect2 id="textsearch-parser">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Parsing</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
Full text searching in <productname>PostgreSQL</productname> provides
|
|
|
|
function <function>to_tsvector</function>, which converts a document to
|
|
|
|
the <type>tsvector</type> data type. More details are available in <xref
|
2007-08-30 04:37:14 +08:00
|
|
|
linkend="functions-textsearch-tsvector">, but for now consider a simple example:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
|
2007-08-29 10:37:04 +08:00
|
|
|
to_tsvector
|
2007-08-22 05:08:47 +08:00
|
|
|
-----------------------------------------------------
|
|
|
|
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
In the example above we see that the resulting <type>tsvector</type> does not
|
|
|
|
contain the words <literal>a</literal>, <literal>on</literal>, or
|
|
|
|
<literal>it</literal>, the word <literal>rats</literal> became
|
|
|
|
<literal>rat</literal>, and the punctuation sign <literal>-</literal> was
|
|
|
|
ignored.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The <function>to_tsvector</function> function internally calls a parser
|
|
|
|
which breaks the document (<literal>a fat cat sat on a mat - it ate a
|
|
|
|
fat rats</literal>) into words and corresponding types. The default parser
|
|
|
|
recognizes 23 types. Each word, depending on its type, passes through a
|
|
|
|
group of dictionaries (<xref linkend="textsearch-dictionaries">). At the
|
|
|
|
end of this step we obtain <emphasis>lexemes</emphasis>. For example,
|
|
|
|
<literal>rats</literal> became <literal>rat</literal> because one of the
|
|
|
|
dictionaries recognized that the word <literal>rats</literal> is a plural
|
|
|
|
form of <literal>rat</literal>. Some words are treated as "stop words"
|
|
|
|
(<xref linkend="textsearch-stopwords">) and ignored since they occur too
|
|
|
|
frequently and have little informational value. In our example these are
|
|
|
|
<literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
|
|
|
|
The punctuation sign <literal>-</literal> was also ignored because its
|
|
|
|
type (<literal>Space symbols</literal>) is not indexed. All information
|
|
|
|
about the parser, dictionaries and what types of lexemes to index is
|
|
|
|
documented in the full text configuration section (<xref
|
|
|
|
linkend="textsearch-tables-configuration">). It is possible to have
|
|
|
|
several different configurations in the same database, and many predefined
|
|
|
|
system configurations are available for different languages. In our example
|
|
|
|
we used the default configuration <literal>english</literal> for the
|
|
|
|
English language.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
As another example, below is the output from the <function>ts_debug</function>
|
|
|
|
function ( <xref linkend="textsearch-debugging"> ), which shows all details
|
|
|
|
of the full text machinery:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats');
|
2007-08-22 12:45:20 +08:00
|
|
|
Alias | Description | Token | Dictionaries | Lexized token
|
|
|
|
-------+---------------+-------+--------------+----------------
|
|
|
|
lword | Latin word | a | {english} | english: {}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | fat | {english} | english: {fat}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | cat | {english} | english: {cat}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | sat | {english} | english: {sat}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | on | {english} | english: {}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | a | {english} | english: {}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | mat | {english} | english: {mat}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
blank | Space symbols | - | |
|
|
|
|
lword | Latin word | it | {english} | english: {}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | ate | {english} | english: {ate}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | a | {english} | english: {}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | fat | {english} | english: {fat}
|
|
|
|
blank | Space symbols | | |
|
|
|
|
lword | Latin word | rats | {english} | english: {rat}
|
2007-08-29 10:37:04 +08:00
|
|
|
(24 rows)
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Function <function>setweight()</function> is used to label
|
|
|
|
<type>tsvector</type>. The typical usage of this is to mark out the
|
|
|
|
different parts of a document, perhaps by importance. Later, this can be
|
|
|
|
used for ranking of search results in addition to positional information
|
|
|
|
(distance between query terms). If no ranking is required, positional
|
|
|
|
information can be removed from <type>tsvector</type> using the
|
|
|
|
<function>strip()</function> function to save space.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Because <function>to_tsvector</function>(<LITERAL>NULL</LITERAL>) can
|
|
|
|
return <LITERAL>NULL</LITERAL>, it is recommended to use
|
|
|
|
<function>coalesce</function>. Here is the safe method for creating a
|
|
|
|
<type>tsvector</type> from a structured document:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
UPDATE tt SET ti=
|
|
|
|
setweight(to_tsvector(coalesce(title,'')), 'A') || ' ' ||
|
|
|
|
setweight(to_tsvector(coalesce(keyword,'')), 'B') || ' ' ||
|
|
|
|
setweight(to_tsvector(coalesce(abstract,'')), 'C') || ' ' ||
|
|
|
|
setweight(to_tsvector(coalesce(body,'')), 'D');
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The following functions allow manual parsing control:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<variablelist>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<indexterm zone="textsearch-parser">
|
|
|
|
<primary>parse</primary>
|
|
|
|
</indexterm>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOF <type>tokenout</type>
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Parses the given <replaceable>document</replaceable> and returns a series
|
|
|
|
of records, one for each token produced by parsing. Each record includes
|
|
|
|
a <varname>tokid</varname> giving its type and a <varname>token</varname>
|
|
|
|
which gives its content:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT * FROM ts_parse('default','123 - a number');
|
2007-08-22 05:08:47 +08:00
|
|
|
tokid | token
|
|
|
|
-------+--------
|
|
|
|
22 | 123
|
|
|
|
12 |
|
|
|
|
12 | -
|
|
|
|
1 | a
|
|
|
|
12 |
|
|
|
|
1 | number
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<indexterm zone="textsearch-parser">
|
|
|
|
<primary>ts_token_type</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
ts_token_type(<replaceable class="PARAMETER">parser</replaceable> ) returns SETOF <type>tokentype</type>
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Returns a table which describes each kind of token the
|
|
|
|
<replaceable>parser</replaceable> might produce as output. For each token
|
|
|
|
type the table gives the <varname>tokid</varname> which the
|
|
|
|
<replaceable>parser</replaceable> uses to label each
|
|
|
|
<varname>token</varname> of that type, the <varname>alias</varname> which
|
|
|
|
names the token type, and a short <varname>description</varname>:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT * FROM ts_token_type('default');
|
2007-08-22 05:08:47 +08:00
|
|
|
tokid | alias | description
|
|
|
|
-------+--------------+-----------------------------------
|
|
|
|
1 | lword | Latin word
|
|
|
|
2 | nlword | Non-latin word
|
|
|
|
3 | word | Word
|
|
|
|
4 | email | Email
|
|
|
|
5 | url | URL
|
|
|
|
6 | host | Host
|
|
|
|
7 | sfloat | Scientific notation
|
|
|
|
8 | version | VERSION
|
|
|
|
9 | part_hword | Part of hyphenated word
|
|
|
|
10 | nlpart_hword | Non-latin part of hyphenated word
|
|
|
|
11 | lpart_hword | Latin part of hyphenated word
|
|
|
|
12 | blank | Space symbols
|
|
|
|
13 | tag | HTML Tag
|
|
|
|
14 | protocol | Protocol head
|
|
|
|
15 | hword | Hyphenated word
|
|
|
|
16 | lhword | Latin hyphenated word
|
|
|
|
17 | nlhword | Non-latin hyphenated word
|
|
|
|
18 | uri | URI
|
|
|
|
19 | file | File or path name
|
|
|
|
20 | float | Decimal notation
|
|
|
|
21 | int | Signed integer
|
|
|
|
22 | uint | Unsigned integer
|
|
|
|
23 | entity | HTML Entity
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</variablelist>
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect2 id="textsearch-ranking">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Ranking Search Results</title>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
Ranking attempts to measure how relevant documents are to a particular
|
|
|
|
query by inspecting the number of times each search word appears in the
|
|
|
|
document, and whether different search terms occur near each other. Full
|
|
|
|
text searching provides two predefined ranking functions which attempt to
|
|
|
|
produce a measure of how a document is relevant to the query. In spite
|
|
|
|
of that, the concept of relevancy is vague and very application-specific.
|
|
|
|
These functions try to take into account lexical, proximity, and structural
|
|
|
|
information. Different applications might require additional information
|
|
|
|
for ranking, e.g. document modification time.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
The lexical part of ranking reflects how often the query terms appear in
|
|
|
|
the document, how close the document query terms are, and in what part of
|
|
|
|
the document they occur. Note that ranking functions that use positional
|
|
|
|
information will only work on unstripped tsvectors because stripped
|
|
|
|
tsvectors lack positional information.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
The two ranking functions currently available are:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<variablelist>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<indexterm zone="textsearch-ranking">
|
|
|
|
<primary>ts_rank</primary>
|
|
|
|
</indexterm>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
ts_rank(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[]</optional>, <replaceable class="PARAMETER">vector</replaceable> TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable class="PARAMETER">normalization</replaceable> int4 </optional>) returns float4
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
This ranking function offers the ability to weigh word instances more
|
|
|
|
heavily depending on how you have classified them. The weights specify
|
|
|
|
how heavily to weigh each category of word:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
{D-weight, C-weight, B-weight, A-weight}
|
2007-08-29 10:37:04 +08:00
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
If no weights are provided,
|
|
|
|
then these defaults are used:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
{0.1, 0.2, 0.4, 1.0}
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
Often weights are used to mark words from special areas of the document,
|
|
|
|
like the title or an initial abstract, and make them more or less important
|
|
|
|
than words in the document body.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
|
|
|
|
<indexterm zone="textsearch-ranking">
|
|
|
|
<primary>ts_rank_cd</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[], </optional> <replaceable class="PARAMETER">vector</replaceable> TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable class="PARAMETER">normalization</replaceable> int4 </optional>) returns float4
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
This function computes the <emphasis>cover density</emphasis> ranking for
|
|
|
|
the given document vector and query, as described in Clarke, Cormack, and
|
|
|
|
Tudhope's "Relevance Ranking for One to Three Term Queries" in the
|
|
|
|
"Information Processing and Management", 1999.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
</variablelist>
|
|
|
|
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Since a longer document has a greater chance of containing a query term
|
|
|
|
it is reasonable to take into account document size, i.e. a hundred-word
|
|
|
|
document with five instances of a search word is probably more relevant
|
|
|
|
than a thousand-word document with five instances. Both ranking functions
|
|
|
|
take an integer <replaceable>normalization</replaceable> option that
|
|
|
|
specifies whether a document's length should impact its rank. The integer
|
|
|
|
option controls several behaviors which is done using bit-wise fields and
|
|
|
|
<literal>|</literal> (for example, <literal>2|4</literal>):
|
|
|
|
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
0 (the default) ignores the document length
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
1 divides the rank by 1 + the logarithm of the document length
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
2 divides the rank by the length itself
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<!-- what is mean harmonic distance -->
|
|
|
|
4 divides the rank by the mean harmonic distance between extents
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
8 divides the rank by the number of unique words in document
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
16 divides the rank by 1 + logarithm of the number of unique words in document
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
It is important to note that ranking functions do not use any global
|
|
|
|
information so it is impossible to produce a fair normalization to 1% or
|
|
|
|
100%, as sometimes required. However, a simple technique like
|
|
|
|
<literal>rank/(rank+1)</literal> can be applied. Of course, this is just
|
|
|
|
a cosmetic change, i.e., the ordering of the search results will not change.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Several examples are shown below; note that the second example uses
|
|
|
|
normalized ranking:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) AS rnk
|
2007-08-22 05:08:47 +08:00
|
|
|
FROM apod, to_tsquery('neutrino|(dark & matter)') query
|
|
|
|
WHERE query @@ textsearch
|
|
|
|
ORDER BY rnk DESC LIMIT 10;
|
|
|
|
title | rnk
|
|
|
|
-----------------------------------------------+----------
|
|
|
|
Neutrinos in the Sun | 3.1
|
|
|
|
The Sudbury Neutrino Detector | 2.4
|
|
|
|
A MACHO View of Galactic Dark Matter | 2.01317
|
|
|
|
Hot Gas and Dark Matter | 1.91171
|
|
|
|
The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
|
|
|
|
Rafting for Solar Neutrinos | 1.9
|
|
|
|
NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
|
|
|
|
Hot Gas and Dark Matter | 1.6123
|
|
|
|
Ice Fishing for Cosmic Neutrinos | 1.6
|
|
|
|
Weak Lensing Distorts the Universe | 0.818218
|
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query)/
|
|
|
|
(ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) + 1) AS rnk
|
2007-08-22 05:08:47 +08:00
|
|
|
FROM apod, to_tsquery('neutrino|(dark & matter)') query
|
|
|
|
WHERE query @@ textsearch
|
|
|
|
ORDER BY rnk DESC LIMIT 10;
|
|
|
|
title | rnk
|
|
|
|
-----------------------------------------------+-------------------
|
|
|
|
Neutrinos in the Sun | 0.756097569485493
|
|
|
|
The Sudbury Neutrino Detector | 0.705882361190954
|
|
|
|
A MACHO View of Galactic Dark Matter | 0.668123210574724
|
|
|
|
Hot Gas and Dark Matter | 0.65655958650282
|
|
|
|
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
|
|
|
|
Rafting for Solar Neutrinos | 0.655172410958162
|
|
|
|
NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
|
|
|
|
Hot Gas and Dark Matter | 0.617195790024749
|
|
|
|
Ice Fishing for Cosmic Neutrinos | 0.615384618911517
|
|
|
|
Weak Lensing Distorts the Universe | 0.450010798361481
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The first argument in <function>ts_rank_cd</function> (<literal>'{0.1, 0.2,
|
|
|
|
0.4, 1.0}'</literal>) is an optional parameter which specifies the
|
|
|
|
weights for labels <literal>D</literal>, <literal>C</literal>,
|
|
|
|
<literal>B</literal>, and <literal>A</literal> used in function
|
|
|
|
<function>setweight</function>. These default values show that lexemes
|
|
|
|
labeled as <literal>A</literal> are ten times more important than ones
|
|
|
|
that are labeled with <literal>D</literal>.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Ranking can be expensive since it requires consulting the
|
|
|
|
<type>tsvector</type> of all documents, which can be I/O bound and
|
|
|
|
therefore slow. Unfortunately, it is almost impossible to avoid since full
|
|
|
|
text searching in a database should work without indexes <!-- TODO I don't
|
|
|
|
get this -->. Moreover an index can be lossy (a <acronym>GiST</acronym>
|
|
|
|
index, for example) so it must check documents to avoid false hits.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Note that the ranking functions above are only examples. You can write
|
|
|
|
your own ranking functions and/or combine additional factors to fit your
|
|
|
|
specific needs.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="textsearch-headline">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Highlighting Results</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<indexterm zone="textsearch-headline">
|
|
|
|
<primary>headline</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
To present search results it is ideal to show a part of each document and
|
|
|
|
how it is related to the query. Usually, search engines show fragments of
|
|
|
|
the document with marked search terms. <productname>PostgreSQL</> full
|
|
|
|
text searching provides the function <function>headline</function> that
|
|
|
|
implements such functionality.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<variablelist>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
|
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
ts_headline(<optional> <replaceable class="PARAMETER">config_name</replaceable> text</optional>, <replaceable class="PARAMETER">document</replaceable> text, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable class="PARAMETER">options</replaceable> text </optional>) returns text
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
The <function>ts_headline</function> function accepts a document along with
|
|
|
|
a query, and returns one or more ellipsis-separated excerpts from the
|
|
|
|
document in which terms from the query are highlighted. The configuration
|
|
|
|
used to parse the document can be specified by its
|
|
|
|
<replaceable>config_name</replaceable>; if none is specified, the current
|
|
|
|
configuration is used.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
If an <replaceable>options</replaceable> string is specified it should
|
|
|
|
consist of a comma-separated list of one or more 'option=value' pairs.
|
|
|
|
The available options are:
|
|
|
|
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>StartSel</>, <literal>StopSel</literal>: the strings with which
|
|
|
|
query words appearing in the document should be delimited to distinguish
|
|
|
|
them from other excerpted words.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem >
|
|
|
|
<para>
|
|
|
|
<literal>MaxWords</>, <literal>MinWords</literal>: limit the shortest and
|
|
|
|
longest headlines to output
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>ShortWord</literal>: this prevents your headline from beginning
|
|
|
|
or ending with a word which has this many characters or less. The default
|
|
|
|
value of three eliminates the English articles.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>HighlightAll</literal>: boolean flag; if
|
|
|
|
<literal>true</literal> the whole document will be highlighted
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
Any unspecified options receive these defaults:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
For example:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_headline('a b c', 'c'::tsquery);
|
2007-08-22 05:08:47 +08:00
|
|
|
headline
|
|
|
|
--------------
|
|
|
|
a b <b>c</b>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_headline('a b c', 'c'::tsquery, 'StartSel=<,StopSel=>');
|
|
|
|
ts_headline
|
|
|
|
-------------
|
2007-08-22 05:08:47 +08:00
|
|
|
a b <c>
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
<function>headline</> uses the original document, not
|
|
|
|
<type>tsvector</type>, so it can be slow and should be used with care.
|
|
|
|
A typical mistake is to call <function>headline()</function> for
|
|
|
|
<emphasis>every</emphasis> matching document when only ten documents are
|
|
|
|
shown. <acronym>SQL</acronym> subselects can help here; below is an
|
|
|
|
example:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT id,ts_headline(body,q), rank
|
|
|
|
FROM (SELECT id,body,q, ts_rank_cd (ti,q) AS rank FROM apod, to_tsquery('stars') q
|
2007-08-29 10:37:04 +08:00
|
|
|
WHERE ti @@ q
|
|
|
|
ORDER BY rank DESC LIMIT 10) AS foo;
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Note that the cascade dropping of the <function>parser</function> function
|
|
|
|
causes dropping of the <literal>ts_headline</literal> used in the full text search
|
|
|
|
configuration <replaceable>config_name</replaceable><!-- TODO I don't get this -->.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="textsearch-dictionaries">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Dictionaries</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
Dictionaries are used to eliminate words that should not be considered in a
|
|
|
|
search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
|
|
|
|
that different derived forms of the same word will match. Aside from
|
|
|
|
improving search quality, normalization and removal of stop words reduce the
|
|
|
|
size of the <type>tsvector</type> representation of a document, thereby
|
|
|
|
improving performance. Normalization does not always have linguistic meaning
|
|
|
|
and usually depends on application semantics.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Some examples of normalization:
|
|
|
|
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Linguistic - ispell dictionaries try to reduce input words to a
|
|
|
|
normalized form; stemmer dictionaries remove word endings
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Identical <acronym>URL</acronym> locations are identified and canonicalized:
|
|
|
|
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
http://www.pgsql.ru/db/mw/index.html
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
http://www.pgsql.ru/db/mw/
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
http://www.pgsql.ru/db/../db/mw/index.html
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Colour names are substituted by their hexadecimal values, e.g.,
|
|
|
|
<literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Remove some numeric fractional digits to reduce the range of possible
|
|
|
|
numbers, so <emphasis>3.14</emphasis>159265359,
|
|
|
|
<emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
|
|
|
|
after normalization if only two digits are kept after the decimal point.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
A dictionary is a <emphasis>program</emphasis> which accepts lexemes as
|
|
|
|
input and returns:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
an array of lexemes if the input lexeme is known to the dictionary
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
a void array if the dictionary knows the lexeme, but it is a stop word
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>NULL</literal> if the dictionary does not recognize the input lexeme
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Full text searching provides predefined dictionaries for many languages,
|
|
|
|
and <acronym>SQL</acronym> commands to manipulate them. There are also
|
|
|
|
several predefined template dictionaries that can be used to create new
|
|
|
|
dictionaries by overriding their default parameters. Besides this, it is
|
|
|
|
possible to develop custom dictionaries using an <acronym>API</acronym>;
|
|
|
|
see the dictionary for integers (<xref
|
|
|
|
linkend="textsearch-rule-dictionary-example">) as an example.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The <literal>ALTER TEXT SEARCH CONFIGURATION ADD
|
|
|
|
MAPPING</literal> command binds specific types of lexemes and a set of
|
|
|
|
dictionaries to process them. (Mappings can also be specified as part of
|
|
|
|
configuration creation.) Lexemes are processed by a stack of dictionaries
|
|
|
|
until some dictionary identifies it as a known word or it turns out to be
|
|
|
|
a stop word. If no dictionary recognizes a lexeme, it will be discarded
|
|
|
|
and not indexed. A general rule for configuring a stack of dictionaries
|
|
|
|
is to place first the most narrow, most specific dictionary, then the more
|
|
|
|
general dictionaries and finish it with a very general dictionary, like
|
|
|
|
the <application>snowball</> stemmer or <literal>simple</>, which
|
|
|
|
recognizes everything. For example, for an astronomy-specific search
|
|
|
|
(<literal>astro_en</literal> configuration) one could bind
|
|
|
|
<type>lword</type> (latin word) with a synonym dictionary of astronomical
|
|
|
|
terms, a general English dictionary and a <application>snowball</> English
|
|
|
|
stemmer:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-25 14:26:57 +08:00
|
|
|
ALTER TEXT SEARCH CONFIGURATION astro_en
|
|
|
|
ADD MAPPING FOR lword WITH astrosyn, english_ispell, english_stem;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Function <function>ts_lexize</function> can be used to test dictionaries,
|
|
|
|
for example:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-25 14:26:57 +08:00
|
|
|
SELECT ts_lexize('english_stem', 'stars');
|
2007-08-22 12:45:20 +08:00
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{star}
|
|
|
|
(1 row)
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
Also, the <function>ts_debug</function> function (<xref linkend="textsearch-debugging">)
|
|
|
|
can be used for this.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<sect2 id="textsearch-stopwords">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Stop Words</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
Stop words are words which are very common, appear in almost
|
|
|
|
every document, and have no discrimination value. Therefore, they can be ignored
|
|
|
|
in the context of full text searching. For example, every English text contains
|
|
|
|
words like <literal>a</literal> although it is useless to store them in an index.
|
|
|
|
However, stop words do affect the positions in <type>tsvector</type>,
|
|
|
|
which in turn, do affect ranking:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
SELECT to_tsvector('english','in the list of stop words');
|
|
|
|
to_tsvector
|
|
|
|
----------------------------
|
|
|
|
'list':3 'stop':5 'word':6
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
The gaps between positions 1-3 and 3-5 are because of stop words, so ranks
|
|
|
|
calculated for documents with and without stop words are quite different:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_rank_cd ('{1,1,1,1}', to_tsvector('english','in the list of stop words'), to_tsquery('list & stop'));
|
|
|
|
ts_rank_cd
|
|
|
|
------------
|
|
|
|
0.5
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_rank_cd ('{1,1,1,1}', to_tsvector('english','list stop words'), to_tsquery('list & stop'));
|
|
|
|
ts_rank_cd
|
|
|
|
------------
|
|
|
|
1
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
It is up to the specific dictionary how it treats stop words. For example,
|
|
|
|
<literal>ispell</literal> dictionaries first normalize words and then
|
|
|
|
look at the list of stop words, while <literal>stemmers</literal>
|
|
|
|
first check the list of stop words. The reason for the different
|
|
|
|
behaviour is an attempt to decrease possible noise.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
Here is an example of a dictionary that returns the input word as lowercase
|
|
|
|
or <literal>NULL</literal> if it is a stop word; it also specifies the name
|
|
|
|
of a file of stop words. It uses the <literal>simple</> dictionary as
|
|
|
|
a template:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY public.simple_dict (
|
|
|
|
TEMPLATE = pg_catalog.simple,
|
|
|
|
STOPWORDS = english
|
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
Now we can test our dictionary:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_lexize('public.simple_dict','YeS');
|
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{yes}
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_lexize('public.simple_dict','The');
|
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{}
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<caution>
|
|
|
|
<para>
|
|
|
|
Most types of dictionaries rely on configuration files, such as files of stop
|
|
|
|
words. These files <emphasis>must</> be stored in UTF-8 encoding. They will
|
|
|
|
be translated to the actual database encoding, if that is different, when they
|
|
|
|
are read into the server.
|
|
|
|
</para>
|
|
|
|
</caution>
|
2007-08-25 14:26:57 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect2 id="textsearch-synonym-dictionary">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Synonym Dictionary</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
This dictionary template is used to create dictionaries which replace a
|
|
|
|
word with a synonym. Phrases are not supported (use the thesaurus
|
|
|
|
dictionary (<xref linkend="textsearch-thesaurus">) for that). A synonym
|
|
|
|
dictionary can be used to overcome linguistic problems, for example, to
|
|
|
|
prevent an English stemmer dictionary from reducing the word 'Paris' to
|
|
|
|
'pari'. It is enough to have a <literal>Paris paris</literal> line in the
|
|
|
|
synonym dictionary and put it before the <literal>english_stem</> dictionary:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM ts_debug('english','Paris');
|
2007-08-25 14:26:57 +08:00
|
|
|
Alias | Description | Token | Dictionaries | Lexized token
|
|
|
|
-------+-------------+-------+----------------+----------------------
|
|
|
|
lword | Latin word | Paris | {english_stem} | english_stem: {pari}
|
2007-08-22 05:08:47 +08:00
|
|
|
(1 row)
|
2007-08-22 12:45:20 +08:00
|
|
|
|
2007-08-25 14:26:57 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY synonym
|
|
|
|
(TEMPLATE = synonym, SYNONYMS = my_synonyms);
|
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
ALTER TEXT SEARCH CONFIGURATION english
|
2007-08-25 14:26:57 +08:00
|
|
|
ALTER MAPPING FOR lword WITH synonym, english_stem;
|
2007-08-22 12:45:20 +08:00
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
SELECT * FROM ts_debug('english','Paris');
|
2007-08-25 14:26:57 +08:00
|
|
|
Alias | Description | Token | Dictionaries | Lexized token
|
|
|
|
-------+-------------+-------+------------------------+------------------
|
|
|
|
lword | Latin word | Paris | {synonym,english_stem} | synonym: {paris}
|
2007-08-22 05:08:47 +08:00
|
|
|
(1 row)
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="textsearch-thesaurus">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Thesaurus Dictionary</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
|
|
|
|
a collection of words which includes information about the relationships
|
|
|
|
of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
|
|
|
|
terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
|
|
|
|
terms, etc.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Basically a thesaurus dictionary replaces all non-preferred terms by one
|
|
|
|
preferred term and, optionally, preserves them for indexing. Thesauruses
|
|
|
|
are used during indexing so any change in the thesaurus <emphasis>requires</emphasis>
|
|
|
|
reindexing. The current implementation of the thesaurus
|
|
|
|
dictionary is an extension of the synonym dictionary with added
|
|
|
|
<emphasis>phrase</emphasis> support. A thesaurus dictionary requires
|
|
|
|
a configuration file of the following format:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
# this is a comment
|
|
|
|
sample word(s) : indexed word(s)
|
2007-08-25 14:26:57 +08:00
|
|
|
more sample word(s) : more indexed word(s)
|
|
|
|
...
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
|
|
|
|
a phrase and its replacement.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
A thesaurus dictionary uses a <emphasis>subdictionary</emphasis> (which
|
|
|
|
is defined in the dictionary's configuration) to normalize the input text
|
|
|
|
before checking for phrase matches. It is only possible to select one
|
|
|
|
subdictionary. An error is reported if the subdictionary fails to
|
|
|
|
recognize a word. In that case, you should remove the use of the word or teach
|
|
|
|
the subdictionary about it. Use an asterisk (<symbol>*</symbol>) at the
|
|
|
|
beginning of an indexed word to skip the subdictionary. It is still required
|
|
|
|
that sample words are known.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The thesaurus dictionary looks for the longest match.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Stop words recognized by the subdictionary are replaced by a 'stop word
|
|
|
|
placeholder' to record their position. To break possible ties the thesaurus
|
|
|
|
uses the last definition. To illustrate this, consider a thesaurus (with
|
|
|
|
a <parameter>simple</parameter> subdictionary) with pattern
|
|
|
|
<replaceable>swsw</>, where <replaceable>s</> designates any stop word and
|
|
|
|
<replaceable>w</>, any known word:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
a one the two : swsw
|
|
|
|
the one a two : swsw2
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
Words <literal>a</> and <literal>the</> are stop words defined in the
|
|
|
|
configuration of a subdictionary. The thesaurus considers <literal>the
|
|
|
|
one the two</literal> and <literal>that one then two</literal> as equal
|
|
|
|
and will use definition <replaceable>swsw2</>.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
As any normal dictionary, it can be assigned to the specific lexeme types.
|
|
|
|
Since a thesaurus dictionary has the capability to recognize phrases it
|
|
|
|
must remember its state and interact with the parser. A thesaurus dictionary
|
|
|
|
uses these assignments to check if it should handle the next word or stop
|
|
|
|
accumulation. The thesaurus dictionary compiler must be configured
|
|
|
|
carefully. For example, if the thesaurus dictionary is assigned to handle
|
|
|
|
only the <token>lword</token> lexeme, then a thesaurus dictionary
|
|
|
|
definition like ' one 7' will not work since lexeme type
|
|
|
|
<token>digit</token> is not assigned to the thesaurus dictionary.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect2 id="textsearch-thesaurus-config">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Thesaurus Configuration</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
To define a new thesaurus dictionary one can use the thesaurus template.
|
|
|
|
For example:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
|
|
|
|
TEMPLATE = thesaurus,
|
|
|
|
DictFile = mythesaurus,
|
2007-08-25 14:26:57 +08:00
|
|
|
Dictionary = pg_catalog.english_stem
|
2007-08-22 12:45:20 +08:00
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
Here:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>thesaurus_simple</literal> is the thesaurus dictionary name
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>mythesaurus</literal> is the base name of the thesaurus file
|
|
|
|
(its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</>,
|
|
|
|
where <literal>$SHAREDIR</> means the installation shared-data directory,
|
|
|
|
often <filename>/usr/local/share</>).
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>pg_catalog.english_stem</literal> is the dictionary (Snowball
|
|
|
|
English stemmer) to use for thesaurus normalization. Notice that the
|
|
|
|
<literal>english_stem</> dictionary has its own configuration (for example,
|
|
|
|
stop words), which is not shown here.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
|
|
|
|
and selected <literal>tokens</literal>, for example:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
ALTER TEXT SEARCH CONFIGURATION russian
|
|
|
|
ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_simple;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect2 id="textsearch-thesaurus-examples">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Thesaurus Example</title>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
|
|
|
|
which contains some astronomical word combinations:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
supernovae stars : sn
|
|
|
|
crab nebulae : crab
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
Below we create a dictionary and bind some token types with
|
|
|
|
an astronomical thesaurus and english stemmer:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
|
|
|
|
TEMPLATE = thesaurus,
|
|
|
|
DictFile = thesaurus_astro,
|
2007-08-25 14:26:57 +08:00
|
|
|
Dictionary = english_stem
|
2007-08-22 12:45:20 +08:00
|
|
|
);
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
ALTER TEXT SEARCH CONFIGURATION russian
|
2007-08-25 14:26:57 +08:00
|
|
|
ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_astro, english_stem;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
Now we can see how it works. Note that <function>ts_lexize</function> cannot
|
|
|
|
be used for testing the thesaurus (see description of
|
|
|
|
<function>ts_lexize</function>), but we can use
|
|
|
|
<function>plainto_tsquery</function> and <function>to_tsvector</function>
|
|
|
|
which accept <literal>text</literal> arguments, not lexemes:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
SELECT plainto_tsquery('supernova star');
|
|
|
|
plainto_tsquery
|
|
|
|
-----------------
|
|
|
|
'sn'
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
SELECT to_tsvector('supernova star');
|
|
|
|
to_tsvector
|
|
|
|
-------------
|
|
|
|
'sn':1
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
In principle, one can use <function>to_tsquery</function> if you quote
|
|
|
|
the argument:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
SELECT to_tsquery('''supernova star''');
|
|
|
|
to_tsquery
|
|
|
|
------------
|
|
|
|
'sn'
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
Notice that <literal>supernova star</literal> matches <literal>supernovae
|
|
|
|
stars</literal> in <literal>thesaurus_astro</literal> because we specified the
|
|
|
|
<literal>english_stem</literal> stemmer in the thesaurus definition.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
To keep an original phrase in full text indexing just add it to the right part
|
|
|
|
of the definition:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
supernovae stars : sn supernovae stars
|
|
|
|
|
|
|
|
SELECT plainto_tsquery('supernova star');
|
|
|
|
plainto_tsquery
|
|
|
|
-----------------------------
|
|
|
|
'sn' & 'supernova' & 'star'
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="textsearch-ispell-dictionary">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Ispell Dictionary</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
The <application>Ispell</> template dictionary for full text allows the
|
|
|
|
creation of morphological dictionaries based on <ulink
|
|
|
|
url="http://ficus-www.cs.ucla.edu/geoff/ispell.html">Ispell</ulink>, which
|
|
|
|
supports a large number of languages. This dictionary tries to change an
|
|
|
|
input word to its normalized form. Also, more modern spelling dictionaries
|
|
|
|
are supported - <ulink
|
|
|
|
url="http://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO < 2.0.1)
|
|
|
|
and <ulink url="http://sourceforge.net/projects/hunspell">Hunspell</ulink>
|
|
|
|
(OO >= 2.0.2). A large list of dictionaries is available on the <ulink
|
|
|
|
url="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
|
|
|
|
Wiki</ulink>.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The <application>Ispell</> dictionary allows searches without bothering
|
|
|
|
about different linguistic forms of a word. For example, a search on
|
|
|
|
<literal>bank</literal> would return hits of all declensions and
|
|
|
|
conjugations of the search term <literal>bank</literal>, e.g.
|
|
|
|
<literal>banking</>, <literal>banked</>, <literal>banks</>,
|
|
|
|
<literal>banks'</>, and <literal>bank's</>.
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-25 14:26:57 +08:00
|
|
|
SELECT ts_lexize('english_ispell','banking');
|
2007-08-22 12:45:20 +08:00
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{bank}
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-25 14:26:57 +08:00
|
|
|
SELECT ts_lexize('english_ispell','bank''s');
|
2007-08-22 12:45:20 +08:00
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{bank}
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-25 14:26:57 +08:00
|
|
|
SELECT ts_lexize('english_ispell','banked');
|
2007-08-22 12:45:20 +08:00
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{bank}
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
To create an ispell dictionary one should use the built-in
|
|
|
|
<literal>ispell</literal> dictionary and specify several
|
|
|
|
parameters.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-25 14:26:57 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY english_ispell (
|
2007-08-22 12:45:20 +08:00
|
|
|
TEMPLATE = ispell,
|
|
|
|
DictFile = english,
|
|
|
|
AffFile = english,
|
|
|
|
StopWords = english
|
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
|
|
|
|
specify the names of the dictionary, affixes, and stop-words files.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Ispell dictionaries usually recognize a restricted set of words so they
|
|
|
|
should be used in conjunction with another broader dictionary; for
|
|
|
|
example, a stemming dictionary, which recognizes everything.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Ispell dictionaries support splitting compound words based on an
|
|
|
|
ispell dictionary. This is a nice feature and full text searching
|
|
|
|
in <productname>PostgreSQL</productname> supports it.
|
|
|
|
Notice that the affix file should specify a special flag using the
|
|
|
|
<literal>compoundwords controlled</literal> statement that marks dictionary
|
|
|
|
words that can participate in compound formation:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
compoundwords controlled z
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
Several examples for the Norwegian language:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_lexize('norwegian_ispell','overbuljongterningpakkmesterassistent');
|
2007-08-29 10:37:04 +08:00
|
|
|
{over,buljong,terning,pakk,mester,assistent}
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_lexize('norwegian_ispell','sjokoladefabrikk');
|
2007-08-29 10:37:04 +08:00
|
|
|
{sjokoladefabrikk,sjokolade,fabrikk}
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<note>
|
|
|
|
<para>
|
|
|
|
<application>MySpell</> does not support compound words.
|
|
|
|
<application>Hunspell</> has sophisticated support for compound words. At
|
|
|
|
present, full text searching implements only the basic compound word
|
|
|
|
operations of Hunspell.
|
|
|
|
</para>
|
|
|
|
</note>
|
|
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="textsearch-stemming-dictionary">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title><application>Snowball</> Stemming Dictionary</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
The <application>Snowball</> dictionary template is based on the project
|
|
|
|
of Martin Porter, inventor of the popular Porter's stemming algorithm
|
|
|
|
for the English language and now supported in many languages (see the <ulink
|
|
|
|
url="http://snowball.tartarus.org">Snowball site</ulink> for more
|
|
|
|
information). The Snowball project supplies a large number of stemmers for
|
|
|
|
many languages. A Snowball dictionary requires a language parameter to
|
|
|
|
identify which stemmer to use, and optionally can specify a stopword file name.
|
|
|
|
For example, there is a built-in definition equivalent to
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-25 14:26:57 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY english_stem (
|
|
|
|
TEMPLATE = snowball, Language = english, StopWords = english
|
2007-08-22 12:45:20 +08:00
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The <application>Snowball</> dictionary recognizes everything, so it is best
|
|
|
|
to place it at the end of the dictionary stack. It it useless to have it
|
|
|
|
before any other dictionary because a lexeme will never pass through it to
|
|
|
|
the next dictionary.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect2 id="textsearch-dictionary-testing">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Dictionary Testing</title>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
The <function>ts_lexize</> function facilitates dictionary testing:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<variablelist>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<indexterm zone="textsearch-dictionaries">
|
|
|
|
<primary>ts_lexize</primary>
|
|
|
|
</indexterm>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
ts_lexize(<replaceable class="PARAMETER">dict_name</replaceable> text, <replaceable class="PARAMETER">lexeme</replaceable> text) returns text[]
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Returns an array of lexemes if the input <replaceable>lexeme</replaceable>
|
|
|
|
is known to the dictionary <replaceable>dictname</replaceable>, or a void
|
|
|
|
array if the lexeme is known to the dictionary but it is a stop word, or
|
|
|
|
<literal>NULL</literal> if it is an unknown word.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-25 14:26:57 +08:00
|
|
|
SELECT ts_lexize('english_stem', 'stars');
|
2007-08-22 12:45:20 +08:00
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{star}
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-25 14:26:57 +08:00
|
|
|
SELECT ts_lexize('english_stem', 'a');
|
2007-08-22 12:45:20 +08:00
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{}
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
</variablelist>
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<note>
|
|
|
|
<para>
|
|
|
|
The <function>ts_lexize</function> function expects a
|
|
|
|
<replaceable>lexeme</replaceable>, not text. Below is an example:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
|
2007-08-22 05:08:47 +08:00
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
The thesaurus dictionary <literal>thesaurus_astro</literal> does know
|
|
|
|
<literal>supernovae stars</literal>, but <function>ts_lexize</> fails since it
|
|
|
|
does not parse the input text and considers it as a single lexeme. Use
|
|
|
|
<function>plainto_tsquery</> and <function>to_tsvector</> to test thesaurus
|
|
|
|
dictionaries:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
SELECT plainto_tsquery('supernovae stars');
|
|
|
|
plainto_tsquery
|
|
|
|
-----------------
|
|
|
|
'sn'
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
</note>
|
|
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="textsearch-tables-configuration">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Configuration Example</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
A full text configuration specifies all options necessary to transform a
|
|
|
|
document into a <type>tsvector</type>: the parser breaks text into tokens,
|
|
|
|
and the dictionaries transform each token into a lexeme. Every call to
|
|
|
|
<function>to_tsvector()</function> and <function>to_tsquery()</function>
|
|
|
|
needs a configuration to perform its processing. To facilitate management
|
|
|
|
of full text searching objects, a set of <acronym>SQL</acronym> commands
|
|
|
|
is available, and there are several psql commands which display information
|
|
|
|
about full text searching objects (<xref linkend="textsearch-psql">).
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The configuration parameter
|
|
|
|
<xref linkend="guc-default-text-search-config">
|
|
|
|
specifies the name of the current default configuration, which is the
|
|
|
|
one used by text search functions when an explicit configuration
|
|
|
|
parameter is omitted.
|
|
|
|
It can be set in <filename>postgresql.conf</filename>, or set for an
|
|
|
|
individual session using the <command>SET</> command.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Several predefined text searching configurations are available in the
|
|
|
|
<literal>pg_catalog</literal> schema. If you need a custom configuration
|
|
|
|
you can create a new text searching configuration and modify it using SQL
|
|
|
|
commands.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
New text searching objects are created in the current schema by default
|
|
|
|
(usually the <literal>public</literal> schema), but a schema-qualified
|
|
|
|
name can be used to create objects in the specified schema.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
As an example, we will create a configuration
|
|
|
|
<literal>pg</literal> which starts as a duplicate of the
|
|
|
|
<literal>english</> configuration. To be safe, we do this in a transaction:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
BEGIN;
|
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = english );
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
We will use a PostgreSQL-specific synonym list
|
|
|
|
and store it in <filename>share/tsearch_data/pg_dict.syn</filename>.
|
|
|
|
The file contents look like:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<programlisting>
|
2007-08-22 05:08:47 +08:00
|
|
|
postgres pg
|
|
|
|
pgsql pg
|
|
|
|
postgresql pg
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
We define the dictionary like this:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY pg_dict (
|
|
|
|
TEMPLATE = synonym
|
|
|
|
SYNONYMS = pg_dict
|
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
Then register the <productname>ispell</> dictionary
|
|
|
|
<literal>english_ispell</literal> using the <literal>ispell</literal> template:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-25 14:26:57 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY english_ispell (
|
2007-08-22 12:45:20 +08:00
|
|
|
TEMPLATE = ispell,
|
2007-08-25 14:26:57 +08:00
|
|
|
DictFile = english,
|
|
|
|
AffFile = english,
|
|
|
|
StopWords = english
|
2007-08-22 12:45:20 +08:00
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
Now modify mappings for Latin words for configuration <literal>pg</>:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
ALTER TEXT SEARCH CONFIGURATION pg
|
|
|
|
ALTER MAPPING FOR lword, lhword, lpart_hword
|
2007-08-25 14:26:57 +08:00
|
|
|
WITH pg_dict, english_ispell, english_stem;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
We do not index or search some tokens:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
ALTER TEXT SEARCH CONFIGURATION pg
|
|
|
|
DROP MAPPING FOR email, url, sfloat, uri, float;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Now, we can test our configuration:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM ts_debug('public.pg', '
|
|
|
|
PostgreSQL, the highly scalable, SQL compliant, open source object-relational
|
|
|
|
database management system, is now undergoing beta testing of the next
|
2007-08-22 12:45:20 +08:00
|
|
|
version of our software: PostgreSQL 8.3.
|
2007-08-22 05:08:47 +08:00
|
|
|
');
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
COMMIT;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
With the dictionaries and mappings set up, suppose we have a table
|
|
|
|
<literal>pgweb</literal> which contains 11239 documents from the
|
|
|
|
<productname>PostgreSQL</productname> web site. Only relevant columns
|
|
|
|
are shown:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
=> \d pgweb
|
|
|
|
Table "public.pgweb"
|
|
|
|
Column | Type | Modifiers
|
|
|
|
-----------+-------------------+-----------
|
|
|
|
tid | integer | not null
|
|
|
|
path | character varying | not null
|
|
|
|
body | character varying |
|
|
|
|
title | character varying |
|
2007-08-22 12:45:20 +08:00
|
|
|
dlm | date |
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The next step is to set the session to use the new configuration, which was
|
|
|
|
created in the <literal>public</> schema:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
=> \dF
|
2007-08-29 10:37:04 +08:00
|
|
|
List of fulltext configurations
|
|
|
|
Schema | Name | Description
|
|
|
|
---------+------+-------------
|
|
|
|
public | pg |
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
SET default_text_search_config = 'public.pg';
|
|
|
|
SET
|
|
|
|
|
|
|
|
SHOW default_text_search_config;
|
|
|
|
default_text_search_config
|
|
|
|
----------------------------
|
|
|
|
public.pg
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect2>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect1>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect1 id="textsearch-indexes">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>GiST and GIN Index Types</title>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<indexterm zone="textsearch-indexes">
|
|
|
|
<primary>index</primary>
|
|
|
|
<secondary>full text</secondary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
There are two kinds of indexes which can be used to speed up full text
|
|
|
|
operators (<xref linkend="textsearch-searches">).
|
|
|
|
Note that indexes are not mandatory for full text searching.
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<variablelist>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<indexterm zone="textsearch-indexes">
|
|
|
|
<primary>index</primary>
|
|
|
|
<secondary>GIST, for text searching</secondary>
|
|
|
|
</indexterm>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gist(<replaceable>column</replaceable>);
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Creates a GiST (Generalized Search Tree)-based index.
|
|
|
|
The <replaceable>column</replaceable> can be of <type>tsvector</> or
|
|
|
|
<type>tsquery</> type.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<indexterm zone="textsearch-indexes">
|
|
|
|
<primary>index</primary>
|
|
|
|
<secondary>GIN</secondary>
|
|
|
|
</indexterm>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gin(<replaceable>column</replaceable>);
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Creates a GIN (Generalized Inverted Index)-based index.
|
|
|
|
The <replaceable>column</replaceable> must be of <type>tsvector</> type.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</variablelist>
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
A GiST index is <firstterm>lossy</firstterm>, meaning it is necessary
|
|
|
|
to check the actual table row to eliminate false matches.
|
|
|
|
<productname>PostgreSQL</productname> does this automatically; for
|
|
|
|
example, in the query plan below, the <literal>Filter:</literal>
|
|
|
|
line indicates the index output will be rechecked:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae');
|
|
|
|
QUERY PLAN
|
|
|
|
-------------------------------------------------------------------------
|
|
|
|
Index Scan using textsearch_gidx on apod (cost=0.00..12.29 rows=2 width=1469)
|
|
|
|
Index Cond: (textsearch @@ '''supernova'''::tsquery)
|
|
|
|
Filter: (textsearch @@ '''supernova'''::tsquery)
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
GiST index lossiness happens because each document is represented by a
|
|
|
|
fixed-length signature. The signature is generated by hashing (crc32) each
|
|
|
|
word into a random bit in an n-bit string and all words combine to produce
|
|
|
|
an n-bit document signature. Because of hashing there is a chance that
|
|
|
|
some words hash to the same position and could result in a false hit.
|
|
|
|
Signatures calculated for each document in a collection are stored in an
|
|
|
|
<literal>RD-tree</literal> (Russian Doll tree), invented by Hellerstein,
|
|
|
|
which is an adaptation of <literal>R-tree</literal> for sets. In our case
|
|
|
|
the transitive containment relation <!-- huh --> is realized by
|
|
|
|
superimposed coding (Knuth, 1973) of signatures, i.e., a parent is the
|
|
|
|
result of 'OR'-ing the bit-strings of all children. This is a second
|
|
|
|
factor of lossiness. It is clear that parents tend to be full of
|
|
|
|
<literal>1</>s (degenerates) and become quite useless because of the
|
|
|
|
limited selectivity. Searching is performed as a bit comparison of a
|
|
|
|
signature representing the query and an <literal>RD-tree</literal> entry.
|
|
|
|
If all <literal>1</>s of both signatures are in the same position we
|
|
|
|
say that this branch probably matches the query, but if there is even one
|
|
|
|
discrepancy we can definitely reject this branch.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Lossiness causes serious performance degradation since random access of
|
|
|
|
<literal>heap</literal> records is slow and limits the usefulness of GiST
|
|
|
|
indexes. The likelihood of false hits depends on several factors, like
|
|
|
|
the number of unique words, so using dictionaries to reduce this number
|
|
|
|
is recommended.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Actually, this is not the whole story. GiST indexes have an optimization
|
|
|
|
for storing small tsvectors (< <literal>TOAST_INDEX_TARGET</literal>
|
|
|
|
bytes, 512 bytes). On leaf pages small tsvectors are stored unchanged,
|
|
|
|
while longer ones are represented by their signatures, which introduces
|
|
|
|
some lossiness. Unfortunately, the existing index API does not allow for
|
|
|
|
a return value to say whether it found an exact value (tsvector) or whether
|
|
|
|
the result needs to be checked. This is why the GiST index is
|
|
|
|
currently marked as lossy. We hope to improve this in the future.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
GIN indexes are not lossy but their performance depends logarithmically on
|
|
|
|
the number of unique words.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
There is one side-effect of the non-lossiness of a GIN index when using
|
|
|
|
query labels/weights, like <literal>'supernovae:a'</literal>. A GIN index
|
|
|
|
has all the information necessary to determine a match, so the heap is
|
|
|
|
not accessed. However, label information is not stored in the index,
|
|
|
|
so if the query involves label weights it must access
|
|
|
|
the heap. Therefore, a special full text search operator <literal>@@@</literal>
|
|
|
|
was created which forces the use of the heap to get information about
|
|
|
|
labels. GiST indexes are lossy so it always reads the heap and there is
|
|
|
|
no need for a special operator. In the example below,
|
|
|
|
<literal>fulltext_idx</literal> is a GIN index:<!-- why isn't this
|
|
|
|
automatic -->
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
|
|
|
|
QUERY PLAN
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
Index Scan using textsearch_idx on apod (cost=0.00..12.30 rows=2 width=1469)
|
|
|
|
Index Cond: (textsearch @@@ '''supernova'':A'::tsquery)
|
|
|
|
Filter: (textsearch @@@ '''supernova'':A'::tsquery)
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
In choosing which index type to use, GiST or GIN, consider these differences:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
GiN index lookups are three times faster than GiST
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
GiN indexes take three times longer to build than GiST
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
GiN is about ten times slower to update than GiST
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
GiN indexes are two-to-three times larger than GiST
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
In summary, <acronym>GIN</acronym> indexes are best for static data because
|
|
|
|
the indexes are faster for lookups. For dynamic data, GiST indexes are
|
|
|
|
faster to update. Specifically, <acronym>GiST</acronym> indexes are very
|
|
|
|
good for dynamic data and fast if the number of unique words (lexemes) is
|
|
|
|
under 100,000, while <acronym>GIN</acronym> handles +100,000 lexemes better
|
|
|
|
but is slower to update.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Partitioning of big collections and the proper use of GiST and GIN indexes
|
|
|
|
allows the implementation of very fast searches with online update.
|
|
|
|
Partitioning can be done at the database level using table inheritance
|
|
|
|
and <varname>constraint_exclusion</>, or distributing documents over
|
|
|
|
servers and collecting search results using the <filename>contrib/dblink</>
|
|
|
|
extension module. The latter is possible because ranking functions use
|
|
|
|
only local information.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="textsearch-limitations">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Limitations</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
The current limitations of Full Text Searching are:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
<listitem>
|
|
|
|
<para>The length of each lexeme must be less than 2K bytes </para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>The length of a <type>tsvector</type> (lexemes + positions) must be less than 1 megabyte </para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>The number of lexemes must be less than 2<superscript>64</superscript> </para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>Positional information must be non-negative and less than 16,383 </para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>No more than 256 positions per lexeme </para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>The number of nodes (lexemes + operations) in tsquery must be less than 32,768 </para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
|
|
|
|
contained 10,441 unique words, a total of 335,420 words, and the most frequent
|
|
|
|
word <quote>postgresql</> was mentioned 6,127 times in 655 documents.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<!-- TODO we need to put a date on these numbers? -->
|
|
|
|
<para>
|
|
|
|
Another example — the <productname>PostgreSQL</productname> mailing list
|
|
|
|
archives contained 910,989 unique words with 57,491,343 lexemes in 461,020
|
|
|
|
messages.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="textsearch-psql">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title><application>psql</> Support</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
Information about full text searching objects can be obtained
|
|
|
|
in <literal>psql</literal> using a set of commands:
|
|
|
|
<synopsis>
|
|
|
|
\dF{,d,p}<optional>+</optional> <optional>PATTERN</optional>
|
|
|
|
</synopsis>
|
|
|
|
An optional <literal>+</literal> produces more details.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The optional parameter <literal>PATTERN</literal> should be the name of
|
|
|
|
a full text searching object, optionally schema-qualified. If
|
|
|
|
<literal>PATTERN</literal> is not specified then information about all
|
|
|
|
visible objects will be displayed. <literal>PATTERN</literal> can be a
|
|
|
|
regular expression and can apply <emphasis>separately</emphasis> to schema
|
|
|
|
names and object names. The following examples illustrate this:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
=> \dF *fulltext*
|
|
|
|
List of fulltext configurations
|
|
|
|
Schema | Name | Description
|
|
|
|
--------+--------------+-------------
|
|
|
|
public | fulltext_cfg |
|
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
=> \dF *.fulltext*
|
|
|
|
List of fulltext configurations
|
|
|
|
Schema | Name | Description
|
|
|
|
----------+----------------------------
|
|
|
|
fulltext | fulltext_cfg |
|
|
|
|
public | fulltext_cfg |
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<variablelist>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<varlistentry>
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>\dF[+] [PATTERN]</term>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2007-08-29 10:37:04 +08:00
|
|
|
List full text searching configurations (add "+" for more detail)
|
2007-08-22 05:08:47 +08:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
By default (without <literal>PATTERN</literal>), information about
|
|
|
|
all <emphasis>visible</emphasis> full text configurations will be
|
|
|
|
displayed.
|
|
|
|
</para>
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
=> \dF russian
|
|
|
|
List of fulltext configurations
|
|
|
|
Schema | Name | Description
|
|
|
|
------------+---------+-----------------------------------
|
|
|
|
pg_catalog | russian | default configuration for Russian
|
|
|
|
|
|
|
|
=> \dF+ russian
|
2007-08-29 10:37:04 +08:00
|
|
|
Configuration "pg_catalog.russian"
|
|
|
|
Parser name: "pg_catalog.default"
|
2007-08-22 05:08:47 +08:00
|
|
|
Token | Dictionaries
|
|
|
|
--------------+-------------------------
|
|
|
|
email | pg_catalog.simple
|
|
|
|
file | pg_catalog.simple
|
|
|
|
float | pg_catalog.simple
|
|
|
|
host | pg_catalog.simple
|
2007-08-25 14:26:57 +08:00
|
|
|
hword | pg_catalog.russian_stem
|
2007-08-22 05:08:47 +08:00
|
|
|
int | pg_catalog.simple
|
|
|
|
lhword | public.tz_simple
|
|
|
|
lpart_hword | public.tz_simple
|
|
|
|
lword | public.tz_simple
|
2007-08-25 14:26:57 +08:00
|
|
|
nlhword | pg_catalog.russian_stem
|
|
|
|
nlpart_hword | pg_catalog.russian_stem
|
|
|
|
nlword | pg_catalog.russian_stem
|
2007-08-22 05:08:47 +08:00
|
|
|
part_hword | pg_catalog.simple
|
|
|
|
sfloat | pg_catalog.simple
|
|
|
|
uint | pg_catalog.simple
|
|
|
|
uri | pg_catalog.simple
|
|
|
|
url | pg_catalog.simple
|
|
|
|
version | pg_catalog.simple
|
2007-08-25 14:26:57 +08:00
|
|
|
word | pg_catalog.russian_stem
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>\dFd[+] [PATTERN]</term>
|
2007-08-22 05:08:47 +08:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2007-08-29 10:37:04 +08:00
|
|
|
List full text dictionaries (add "+" for more detail).
|
2007-08-22 05:08:47 +08:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
By default (without <literal>PATTERN</literal>), information about
|
|
|
|
all <emphasis>visible</emphasis> dictionaries will be displayed.
|
|
|
|
</para>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
=> \dFd
|
|
|
|
List of fulltext dictionaries
|
|
|
|
Schema | Name | Description
|
|
|
|
------------+------------+-----------------------------------------------------------
|
|
|
|
pg_catalog | danish | Snowball stemmer for danish language
|
|
|
|
pg_catalog | dutch | Snowball stemmer for dutch language
|
|
|
|
pg_catalog | english | Snowball stemmer for english language
|
|
|
|
pg_catalog | finnish | Snowball stemmer for finnish language
|
|
|
|
pg_catalog | french | Snowball stemmer for french language
|
|
|
|
pg_catalog | german | Snowball stemmer for german language
|
|
|
|
pg_catalog | hungarian | Snowball stemmer for hungarian language
|
|
|
|
pg_catalog | italian | Snowball stemmer for italian language
|
|
|
|
pg_catalog | norwegian | Snowball stemmer for norwegian language
|
|
|
|
pg_catalog | portuguese | Snowball stemmer for portuguese language
|
|
|
|
pg_catalog | romanian | Snowball stemmer for romanian language
|
|
|
|
pg_catalog | russian | Snowball stemmer for russian language
|
|
|
|
pg_catalog | simple | simple dictionary: just lower case and check for stopword
|
|
|
|
pg_catalog | spanish | Snowball stemmer for spanish language
|
|
|
|
pg_catalog | swedish | Snowball stemmer for swedish language
|
|
|
|
pg_catalog | turkish | Snowball stemmer for turkish language
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<term>\dFp[+] [PATTERN]</term>
|
2007-08-22 05:08:47 +08:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2007-08-29 10:37:04 +08:00
|
|
|
List full text parsers (add "+" for more detail)
|
2007-08-22 05:08:47 +08:00
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
By default (without <literal>PATTERN</literal>), information about
|
|
|
|
all <emphasis>visible</emphasis> full text parsers will be displayed.
|
|
|
|
</para>
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
=> \dFp
|
2007-08-22 05:08:47 +08:00
|
|
|
List of fulltext parsers
|
|
|
|
Schema | Name | Description
|
|
|
|
------------+---------+---------------------
|
|
|
|
pg_catalog | default | default word parser
|
2007-08-29 10:37:04 +08:00
|
|
|
(1 row)
|
2007-08-22 05:08:47 +08:00
|
|
|
=> \dFp+
|
|
|
|
Fulltext parser "pg_catalog.default"
|
|
|
|
Method | Function | Description
|
|
|
|
-------------------+---------------------------+-------------
|
|
|
|
Start parse | pg_catalog.prsd_start |
|
|
|
|
Get next token | pg_catalog.prsd_nexttoken |
|
|
|
|
End parse | pg_catalog.prsd_end |
|
|
|
|
Get headline | pg_catalog.prsd_headline |
|
|
|
|
Get lexeme's type | pg_catalog.prsd_lextype |
|
|
|
|
|
|
|
|
Token's types for parser "pg_catalog.default"
|
|
|
|
Token name | Description
|
|
|
|
--------------+-----------------------------------
|
|
|
|
blank | Space symbols
|
|
|
|
email | Email
|
|
|
|
entity | HTML Entity
|
|
|
|
file | File or path name
|
|
|
|
float | Decimal notation
|
|
|
|
host | Host
|
|
|
|
hword | Hyphenated word
|
|
|
|
int | Signed integer
|
|
|
|
lhword | Latin hyphenated word
|
|
|
|
lpart_hword | Latin part of hyphenated word
|
|
|
|
lword | Latin word
|
|
|
|
nlhword | Non-latin hyphenated word
|
|
|
|
nlpart_hword | Non-latin part of hyphenated word
|
|
|
|
nlword | Non-latin word
|
|
|
|
part_hword | Part of hyphenated word
|
|
|
|
protocol | Protocol head
|
|
|
|
sfloat | Scientific notation
|
|
|
|
tag | HTML Tag
|
|
|
|
uint | Unsigned integer
|
|
|
|
uri | URI
|
|
|
|
url | URL
|
|
|
|
version | VERSION
|
|
|
|
word | Word
|
|
|
|
(23 rows)
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
</variablelist>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect1>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<sect1 id="textsearch-debugging">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Debugging</title>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
Function <function>ts_debug</function> allows easy testing of your full text searching
|
|
|
|
configuration.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<synopsis>
|
|
|
|
ts_debug(<optional><replaceable class="PARAMETER">config_name</replaceable></optional>, <replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOF ts_debug
|
|
|
|
</synopsis>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<function>ts_debug</> displays information about every token of
|
|
|
|
<replaceable class="PARAMETER">document</replaceable> as produced by the
|
|
|
|
parser and processed by the configured dictionaries using the configuration
|
|
|
|
specified by <replaceable class="PARAMETER">config_name</replaceable>.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<replaceable class="PARAMETER">ts_debug</replaceable> type defined as:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TYPE ts_debug AS (
|
|
|
|
"Alias" text,
|
|
|
|
"Description" text,
|
|
|
|
"Token" text,
|
|
|
|
"Dictionaries" regdictionary[],
|
|
|
|
"Lexized token" text
|
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
For a demonstration of how function <function>ts_debug</function> works we
|
|
|
|
first create a <literal>public.english</literal> configuration and
|
|
|
|
ispell dictionary for the English language. You can skip the test step and
|
|
|
|
play with the standard <literal>english</literal> configuration.
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
|
|
|
|
|
2007-08-25 14:26:57 +08:00
|
|
|
CREATE TEXT SEARCH DICTIONARY english_ispell (
|
2007-08-22 12:45:20 +08:00
|
|
|
TEMPLATE = ispell,
|
2007-08-25 14:26:57 +08:00
|
|
|
DictFile = english,
|
|
|
|
AffFile = english,
|
2007-08-22 12:45:20 +08:00
|
|
|
StopWords = english
|
|
|
|
);
|
|
|
|
|
|
|
|
ALTER TEXT SEARCH CONFIGURATION public.english
|
2007-08-29 10:37:04 +08:00
|
|
|
ALTER MAPPING FOR lword WITH english_ispell, english_stem;
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
|
|
|
|
Alias | Description | Token | Dicts list | Lexized token
|
|
|
|
-------+---------------+-------------+---------------------------------------+---------------------------------
|
2007-08-25 14:26:57 +08:00
|
|
|
lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {}
|
2007-08-22 05:08:47 +08:00
|
|
|
blank | Space symbols | | |
|
2007-08-25 14:26:57 +08:00
|
|
|
lword | Latin word | Brightest | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {bright}
|
2007-08-22 05:08:47 +08:00
|
|
|
blank | Space symbols | | |
|
2007-08-25 14:26:57 +08:00
|
|
|
lword | Latin word | supernovaes | {public.english_ispell,pg_catalog.english_stem} | pg_catalog.english_stem: {supernova}
|
2007-08-22 05:08:47 +08:00
|
|
|
(5 rows)
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
In this example, the word <literal>Brightest</> was recognized by a
|
|
|
|
parser as a <literal>Latin word</literal> (alias <literal>lword</literal>)
|
|
|
|
and came through the dictionaries <literal>public.english_ispell</> and
|
|
|
|
<literal>pg_catalog.english_stem</literal>. It was recognized by
|
|
|
|
<literal>public.english_ispell</literal>, which reduced it to the noun
|
|
|
|
<literal>bright</literal>. The word <literal>supernovaes</literal> is unknown
|
|
|
|
by the <literal>public.english_ispell</literal> dictionary so it was passed to
|
|
|
|
the next dictionary, and, fortunately, was recognized (in fact,
|
|
|
|
<literal>public.english_stem</literal> is a stemming dictionary and recognizes
|
|
|
|
everything; that is why it was placed at the end of the dictionary stack).
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The word <literal>The</literal> was recognized by <literal>public.english_ispell</literal>
|
|
|
|
dictionary as a stop word (<xref linkend="textsearch-stopwords">) and will not be indexed.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
You can always explicitly specify which columns you want to see:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
SELECT "Alias", "Token", "Lexized token"
|
|
|
|
FROM ts_debug('public.english','The Brightest supernovaes');
|
|
|
|
Alias | Token | Lexized token
|
|
|
|
-------+-------------+---------------------------------
|
2007-08-25 14:26:57 +08:00
|
|
|
lword | The | public.english_ispell: {}
|
2007-08-22 05:08:47 +08:00
|
|
|
blank | |
|
2007-08-25 14:26:57 +08:00
|
|
|
lword | Brightest | public.english_ispell: {bright}
|
2007-08-22 05:08:47 +08:00
|
|
|
blank | |
|
2007-08-25 14:26:57 +08:00
|
|
|
lword | supernovaes | pg_catalog.english_stem: {supernova}
|
2007-08-22 05:08:47 +08:00
|
|
|
(5 rows)
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="textsearch-rule-dictionary-example">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Example of Creating a Rule-Based Dictionary</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
The motivation for this example dictionary is to control the indexing of
|
|
|
|
integers (signed and unsigned), and, consequently, to minimize the number
|
|
|
|
of unique words which greatly affects to performance of searching.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The dictionary accepts two options:
|
|
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
The <LITERAL>MAXLEN</literal> parameter specifies the maximum length of the
|
|
|
|
number considered as a 'good' integer. The default value is 6.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
The <LITERAL>REJECTLONG</LITERAL> parameter specifies if a 'long' integer
|
|
|
|
should be indexed or treated as a stop word. If
|
|
|
|
<literal>REJECTLONG</literal>=<LITERAL>FALSE</LITERAL> (default),
|
|
|
|
the dictionary returns the prefixed part of the integer with length
|
|
|
|
<LITERAL>MAXLEN</literal>. If
|
|
|
|
<LITERAL>REJECTLONG</LITERAL>=<LITERAL>TRUE</LITERAL>, the dictionary
|
|
|
|
considers a long integer as a stop word.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
A similar idea can be applied to the indexing of decimal numbers, for
|
|
|
|
example, in the <literal>DecDict</literal> dictionary. The dictionary
|
|
|
|
accepts two options: the <literal>MAXLENFRAC</literal> parameter specifies
|
|
|
|
the maximum length of the fractional part considered as a 'good' decimal.
|
|
|
|
The default value is 3. The <literal>REJECTLONG</literal> parameter
|
|
|
|
controls whether a decimal number with a 'long' fractional part should be indexed
|
|
|
|
or treated as a stop word. If
|
|
|
|
<literal>REJECTLONG</literal>=<literal>FALSE</literal> (default),
|
|
|
|
the dictionary returns the decimal number with the length of its fraction part
|
|
|
|
truncated to <literal>MAXLEN</literal>. If
|
|
|
|
<literal>REJECTLONG</literal>=<literal>TRUE</literal>, the dictionary
|
|
|
|
considers the number as a stop word. Notice that
|
|
|
|
<literal>REJECTLONG</literal>=<literal>FALSE</literal> allows the indexing
|
|
|
|
of 'shortened' numbers and search results will contain documents with
|
|
|
|
shortened numbers.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Examples:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_lexize('intdict', 11234567890);
|
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{112345}
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Now, we want to ignore long integers:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
ALTER TEXT SEARCH DICTIONARY intdict (
|
|
|
|
MAXLEN = 6, REJECTLONG = TRUE
|
|
|
|
);
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_lexize('intdict', 11234567890);
|
|
|
|
ts_lexize
|
|
|
|
-----------
|
2007-08-22 05:08:47 +08:00
|
|
|
{}
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Create <filename>contrib/dict_intdict</> directory with files
|
|
|
|
<filename>dict_tmpl.c</>, <filename>Makefile</>, <filename>dict_intdict.sql.in</>:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
$ make && make install
|
|
|
|
$ psql DBNAME < dict_intdict.sql
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
This is a <filename>dict_tmpl.c</> file:
|
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
#include "postgres.h"
|
|
|
|
#include "utils/builtins.h"
|
|
|
|
#include "fmgr.h"
|
|
|
|
|
|
|
|
#ifdef PG_MODULE_MAGIC
|
|
|
|
PG_MODULE_MAGIC;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#include "utils/ts_locale.h"
|
|
|
|
#include "utils/ts_public.h"
|
|
|
|
#include "utils/ts_utils.h"
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
typedef struct {
|
|
|
|
int maxlen;
|
|
|
|
bool rejectlong;
|
|
|
|
} DictInt;
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
PG_FUNCTION_INFO_V1(dinit_intdict);
|
|
|
|
Datum dinit_intdict(PG_FUNCTION_ARGS);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
Datum
|
|
|
|
dinit_intdict(PG_FUNCTION_ARGS) {
|
|
|
|
DictInt *d = (DictInt*)malloc( sizeof(DictInt) );
|
|
|
|
Map *cfg, *pcfg;
|
|
|
|
text *in;
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
if (!d)
|
|
|
|
elog(ERROR, "No memory");
|
|
|
|
memset(d, 0, sizeof(DictInt));
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
/* Your INIT code */
|
|
|
|
/* defaults */
|
|
|
|
d->maxlen = 6;
|
|
|
|
d->rejectlong = false;
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
if (PG_ARGISNULL(0) || PG_GETARG_POINTER(0) == NULL) /* no options */
|
|
|
|
PG_RETURN_POINTER(d);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
in = PG_GETARG_TEXT_P(0);
|
|
|
|
parse_keyvalpairs(in, &cfg);
|
|
|
|
PG_FREE_IF_COPY(in, 0);
|
|
|
|
pcfg=cfg;
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
while (pcfg->key)
|
|
|
|
{
|
|
|
|
if (strcasecmp("MAXLEN", pcfg->key) == 0)
|
|
|
|
d->maxlen=atoi(pcfg->value);
|
|
|
|
else if ( strcasecmp("REJECTLONG", pcfg->key) == 0)
|
2007-08-22 05:08:47 +08:00
|
|
|
{
|
2007-08-29 10:37:04 +08:00
|
|
|
if ( strcasecmp("true", pcfg->value) == 0 )
|
|
|
|
d->rejectlong=true;
|
|
|
|
else if ( strcasecmp("false", pcfg->value) == 0)
|
|
|
|
d->rejectlong=false;
|
|
|
|
else
|
|
|
|
elog(ERROR,"Unknown value: %s => %s", pcfg->key, pcfg->value);
|
2007-08-22 05:08:47 +08:00
|
|
|
}
|
2007-08-29 10:37:04 +08:00
|
|
|
else
|
|
|
|
elog(ERROR,"Unknown option: %s => %s", pcfg->key, pcfg->value);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
pfree(pcfg->key);
|
|
|
|
pfree(pcfg->value);
|
|
|
|
pcfg++;
|
|
|
|
}
|
|
|
|
pfree(cfg);
|
|
|
|
|
|
|
|
PG_RETURN_POINTER(d);
|
2007-08-22 05:08:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
PG_FUNCTION_INFO_V1(dlexize_intdict);
|
|
|
|
Datum dlexize_intdict(PG_FUNCTION_ARGS);
|
|
|
|
Datum
|
|
|
|
dlexize_intdict(PG_FUNCTION_ARGS)
|
|
|
|
{
|
2007-08-29 10:37:04 +08:00
|
|
|
DictInt *d = (DictInt*)PG_GETARG_POINTER(0);
|
|
|
|
char *in = (char*)PG_GETARG_POINTER(1);
|
|
|
|
char *txt = pnstrdup(in, PG_GETARG_INT32(2));
|
|
|
|
TSLexeme *res = palloc(sizeof(TSLexeme) * 2);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
/* Your INIT dictionary code */
|
|
|
|
res[1].lexeme = NULL;
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
if (PG_GETARG_INT32(2) > d->maxlen)
|
|
|
|
{
|
|
|
|
if (d->rejectlong)
|
|
|
|
{ /* stop, return void array */
|
|
|
|
pfree(txt);
|
|
|
|
res[0].lexeme = NULL;
|
2007-08-22 05:08:47 +08:00
|
|
|
}
|
|
|
|
else
|
2007-08-29 10:37:04 +08:00
|
|
|
{ /* cut integer */
|
|
|
|
txt[d->maxlen] = '\0';
|
|
|
|
res[0].lexeme = txt;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
res[0].lexeme = txt;
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
PG_RETURN_POINTER(res);
|
2007-08-22 05:08:47 +08:00
|
|
|
}
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
This is the <literal>Makefile</literal>:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
|
|
|
subdir = contrib/dict_intdict
|
|
|
|
top_builddir = ../..
|
|
|
|
include $(top_builddir)/src/Makefile.global
|
|
|
|
|
|
|
|
MODULE_big = dict_intdict
|
|
|
|
OBJS = dict_tmpl.o
|
|
|
|
DATA_built = dict_intdict.sql
|
|
|
|
DOCS =
|
|
|
|
|
|
|
|
include $(top_srcdir)/contrib/contrib-global.mk
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
This is a <literal>dict_intdict.sql.in</literal>:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
SET default_text_search_config = 'english';
|
|
|
|
|
|
|
|
BEGIN;
|
|
|
|
|
|
|
|
CREATE OR REPLACE FUNCTION dinit_intdict(internal)
|
2007-08-29 10:37:04 +08:00
|
|
|
RETURNS internal
|
|
|
|
AS 'MODULE_PATHNAME'
|
|
|
|
LANGUAGE 'C';
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
CREATE OR REPLACE FUNCTION dlexize_intdict(internal,internal,internal,internal)
|
2007-08-29 10:37:04 +08:00
|
|
|
RETURNS internal
|
|
|
|
AS 'MODULE_PATHNAME'
|
|
|
|
LANGUAGE 'C'
|
|
|
|
WITH (isstrict);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH TEMPLATE intdict_template (
|
|
|
|
LEXIZE = dlexize_intdict, INIT = dinit_intdict
|
|
|
|
);
|
|
|
|
|
|
|
|
CREATE TEXT SEARCH DICTIONARY intdict (
|
2007-08-29 10:37:04 +08:00
|
|
|
TEMPLATE = intdict_template,
|
|
|
|
MAXLEN = 6, REJECTLONG = false
|
2007-08-22 12:45:20 +08:00
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
COMMENT ON TEXT SEARCH DICTIONARY intdict IS 'Dictionary for Integers';
|
|
|
|
|
|
|
|
END;
|
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="textsearch-parser-example">
|
2007-08-30 04:37:14 +08:00
|
|
|
<title>Example of Creating a Parser</title>
|
2007-08-29 10:37:04 +08:00
|
|
|
|
|
|
|
<para>
|
|
|
|
<acronym>SQL</acronym> command <literal>CREATE TEXT SEARCH PARSER</literal> creates
|
|
|
|
a parser for full text searching. In our example we will implement
|
|
|
|
a simple parser which recognizes space-delimited words and
|
|
|
|
has only two types (3, word, Word; 12, blank, Space symbols). Identifiers
|
|
|
|
were chosen to keep compatibility with the default <function>headline()</function> function
|
|
|
|
since we do not implement our own version.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
To implement a parser one needs to create a minimum of four functions.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<variablelist>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
START = <replaceable class="PARAMETER">start_function</replaceable>
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Initialize the parser. Arguments are a pointer to the parsed text and its
|
|
|
|
length.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Returns a pointer to the internal structure of a parser. Note that it should
|
|
|
|
be <function>malloc</>ed or <function>palloc</>ed in the
|
|
|
|
<literal>TopMemoryContext</>. We name it <literal>ParserState</>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
GETTOKEN = <replaceable class="PARAMETER">gettoken_function</replaceable>
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Returns the next token.
|
|
|
|
Arguments are <literal>ParserState *, char **, int *</literal>.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
This procedure will be called as long as the procedure returns token type zero.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
END = <replaceable class="PARAMETER">end_function</replaceable>,
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
This void function will be called after parsing is finished to free
|
|
|
|
allocated resources in this procedure (<literal>ParserState</>). The argument
|
|
|
|
is <literal>ParserState *</literal>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>
|
|
|
|
<synopsis>
|
|
|
|
LEXTYPES = <replaceable class="PARAMETER">lextypes_function</replaceable>
|
|
|
|
</synopsis>
|
|
|
|
</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Returns an array containing the id, alias, and the description of the tokens
|
|
|
|
in the parser. See <structname>LexDescr</structname> in <filename>src/include/utils/ts_public.h</>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
</variablelist>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Below is the source code of our test parser, organized as a <filename>contrib</> module.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Testing:
|
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
<programlisting>
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT * FROM ts_parse('testparser','That''s my first own parser');
|
2007-08-22 05:08:47 +08:00
|
|
|
tokid | token
|
|
|
|
-------+--------
|
|
|
|
3 | That's
|
|
|
|
12 |
|
|
|
|
3 | my
|
|
|
|
12 |
|
|
|
|
3 | first
|
|
|
|
12 |
|
|
|
|
3 | own
|
|
|
|
12 |
|
|
|
|
3 | parser
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
SELECT to_tsvector('testcfg','That''s my first own parser');
|
|
|
|
to_tsvector
|
|
|
|
-------------------------------------------------
|
|
|
|
'my':2 'own':4 'first':3 'parser':5 'that''s':1
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
SELECT ts_headline('testcfg','Supernovae stars are the brightest phenomena in galaxies', to_tsquery('testcfg', 'star'));
|
2007-08-22 05:08:47 +08:00
|
|
|
headline
|
|
|
|
-----------------------------------------------------------------
|
|
|
|
Supernovae <b>stars</b> are the brightest phenomena in galaxies
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
<para>
|
|
|
|
This test parser is an example adopted from a tutorial by Valli, <ulink
|
|
|
|
url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html">parser
|
|
|
|
HOWTO</ulink>.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
To compile the example just do:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
$ make
|
|
|
|
$ make install
|
|
|
|
$ psql regression < test_parser.sql
|
2007-08-22 05:08:47 +08:00
|
|
|
</programlisting>
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
This is a <filename>test_parser.c</>:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
|
|
|
|
#ifdef PG_MODULE_MAGIC
|
|
|
|
PG_MODULE_MAGIC;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* types
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* self-defined type */
|
|
|
|
typedef struct {
|
|
|
|
char * buffer; /* text to parse */
|
|
|
|
int len; /* length of the text in buffer */
|
|
|
|
int pos; /* position of the parser */
|
|
|
|
} ParserState;
|
|
|
|
|
|
|
|
/* copy-paste from wparser.h of tsearch2 */
|
|
|
|
typedef struct {
|
|
|
|
int lexid;
|
|
|
|
char *alias;
|
|
|
|
char *descr;
|
|
|
|
} LexDescr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* prototypes
|
|
|
|
*/
|
|
|
|
PG_FUNCTION_INFO_V1(testprs_start);
|
|
|
|
Datum testprs_start(PG_FUNCTION_ARGS);
|
|
|
|
|
|
|
|
PG_FUNCTION_INFO_V1(testprs_getlexeme);
|
|
|
|
Datum testprs_getlexeme(PG_FUNCTION_ARGS);
|
|
|
|
|
|
|
|
PG_FUNCTION_INFO_V1(testprs_end);
|
|
|
|
Datum testprs_end(PG_FUNCTION_ARGS);
|
|
|
|
|
|
|
|
PG_FUNCTION_INFO_V1(testprs_lextype);
|
|
|
|
Datum testprs_lextype(PG_FUNCTION_ARGS);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* functions
|
|
|
|
*/
|
|
|
|
Datum testprs_start(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
ParserState *pst = (ParserState *) palloc(sizeof(ParserState));
|
|
|
|
pst->buffer = (char *) PG_GETARG_POINTER(0);
|
|
|
|
pst->len = PG_GETARG_INT32(1);
|
|
|
|
pst->pos = 0;
|
|
|
|
|
|
|
|
PG_RETURN_POINTER(pst);
|
|
|
|
}
|
|
|
|
|
|
|
|
Datum testprs_getlexeme(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
|
|
|
|
char **t = (char **) PG_GETARG_POINTER(1);
|
|
|
|
int *tlen = (int *) PG_GETARG_POINTER(2);
|
|
|
|
int type;
|
|
|
|
|
|
|
|
*tlen = pst->pos;
|
|
|
|
*t = pst->buffer + pst->pos;
|
|
|
|
|
|
|
|
if ((pst->buffer)[pst->pos] == ' ')
|
|
|
|
{
|
|
|
|
/* blank type */
|
|
|
|
type = 12;
|
|
|
|
/* go to the next non-white-space character */
|
|
|
|
while ((pst->buffer)[pst->pos] == ' ' &&
|
|
|
|
pst->pos < pst->len)
|
|
|
|
(pst->pos)++;
|
|
|
|
} else {
|
|
|
|
/* word type */
|
|
|
|
type = 3;
|
|
|
|
/* go to the next white-space character */
|
|
|
|
while ((pst->buffer)[pst->pos] != ' ' &&
|
|
|
|
pst->pos < pst->len)
|
2007-08-29 10:37:04 +08:00
|
|
|
(pst->pos)++;
|
2007-08-22 05:08:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
*tlen = pst->pos - *tlen;
|
|
|
|
|
|
|
|
/* we are finished if (*tlen == 0) */
|
|
|
|
if (*tlen == 0)
|
|
|
|
type=0;
|
|
|
|
|
|
|
|
PG_RETURN_INT32(type);
|
|
|
|
}
|
2007-08-29 10:37:04 +08:00
|
|
|
|
2007-08-22 05:08:47 +08:00
|
|
|
Datum testprs_end(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
|
|
|
|
pfree(pst);
|
|
|
|
PG_RETURN_VOID();
|
|
|
|
}
|
|
|
|
|
|
|
|
Datum testprs_lextype(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
Remarks:
|
|
|
|
- we have to return the blanks for headline reason
|
|
|
|
- we use the same lexids like Teodor in the default
|
|
|
|
word parser; in this way we can reuse the headline
|
|
|
|
function of the default word parser.
|
|
|
|
*/
|
|
|
|
LexDescr *descr = (LexDescr *) palloc(sizeof(LexDescr) * (2+1));
|
|
|
|
|
|
|
|
/* there are only two types in this parser */
|
|
|
|
descr[0].lexid = 3;
|
|
|
|
descr[0].alias = pstrdup("word");
|
|
|
|
descr[0].descr = pstrdup("Word");
|
|
|
|
descr[1].lexid = 12;
|
|
|
|
descr[1].alias = pstrdup("blank");
|
|
|
|
descr[1].descr = pstrdup("Space symbols");
|
|
|
|
descr[2].lexid = 0;
|
|
|
|
|
|
|
|
PG_RETURN_POINTER(descr);
|
|
|
|
}
|
|
|
|
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
This is a <literal>Makefile</literal>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
override CPPFLAGS := -I. $(CPPFLAGS)
|
|
|
|
|
|
|
|
MODULE_big = test_parser
|
|
|
|
OBJS = test_parser.o
|
|
|
|
|
|
|
|
DATA_built = test_parser.sql
|
|
|
|
DATA =
|
|
|
|
DOCS = README.test_parser
|
|
|
|
REGRESS = test_parser
|
|
|
|
|
|
|
|
|
|
|
|
ifdef USE_PGXS
|
|
|
|
PGXS := $(shell pg_config --pgxs)
|
|
|
|
include $(PGXS)
|
|
|
|
else
|
|
|
|
subdir = contrib/test_parser
|
|
|
|
top_builddir = ../..
|
|
|
|
include $(top_builddir)/src/Makefile.global
|
|
|
|
include $(top_srcdir)/contrib/contrib-global.mk
|
|
|
|
endif
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
This is a <literal>test_parser.sql.in</literal>:
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
SET default_text_search_config = 'english';
|
|
|
|
|
|
|
|
BEGIN;
|
|
|
|
|
|
|
|
CREATE FUNCTION testprs_start(internal,int4)
|
2007-08-29 10:37:04 +08:00
|
|
|
RETURNS internal
|
|
|
|
AS 'MODULE_PATHNAME'
|
|
|
|
LANGUAGE 'C' with (isstrict);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
CREATE FUNCTION testprs_getlexeme(internal,internal,internal)
|
2007-08-29 10:37:04 +08:00
|
|
|
RETURNS internal
|
|
|
|
AS 'MODULE_PATHNAME'
|
|
|
|
LANGUAGE 'C' with (isstrict);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
CREATE FUNCTION testprs_end(internal)
|
2007-08-29 10:37:04 +08:00
|
|
|
RETURNS void
|
|
|
|
AS 'MODULE_PATHNAME'
|
|
|
|
LANGUAGE 'C' with (isstrict);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
CREATE FUNCTION testprs_lextype(internal)
|
2007-08-29 10:37:04 +08:00
|
|
|
RETURNS internal
|
|
|
|
AS 'MODULE_PATHNAME'
|
|
|
|
LANGUAGE 'C' with (isstrict);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
|
2007-08-22 12:45:20 +08:00
|
|
|
CREATE TEXT SEARCH PARSER testparser (
|
2007-08-29 10:37:04 +08:00
|
|
|
START = testprs_start,
|
|
|
|
GETTOKEN = testprs_getlexeme,
|
|
|
|
END = testprs_end,
|
|
|
|
LEXTYPES = testprs_lextype
|
|
|
|
);
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
CREATE TEXT SEARCH CONFIGURATION testcfg (PARSER = testparser);
|
2007-08-22 12:45:20 +08:00
|
|
|
ALTER TEXT SEARCH CONFIGURATION testcfg ADD MAPPING FOR word WITH simple;
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
END;
|
|
|
|
</programlisting>
|
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</para>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
2007-08-29 10:37:04 +08:00
|
|
|
</sect1>
|
2007-08-22 05:08:47 +08:00
|
|
|
|
|
|
|
</chapter>
|