mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-12-15 08:20:16 +08:00
1058 lines
38 KiB
HTML
1058 lines
38 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
|
|
<title>tsearch2 guide</title>
|
|
</head>
|
|
<body>
|
|
<h1 align=center>The tsearch2 Guide</h1>
|
|
|
|
<p align=center>
|
|
Brandon Craig Rhodes<br>30 June 2003
|
|
<p>
|
|
This Guide introduces the reader to the PostgreSQL tsearch2 module,
|
|
version 2.
|
|
More formal descriptions of the module's types and functions
|
|
are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
|
|
which is a companion to this document.
|
|
You can retrieve a beta copy of the tsearch2 module from the
|
|
<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
|
|
page — look under the section entitled <i>Development History</i>
|
|
for the current version.
|
|
<p>
|
|
First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
|
|
and how they are used to search documents;
|
|
next, we will use them to build a simple search engine in SQL;
|
|
and finally, we will study the internals of document conversion
|
|
and how you might tune the internals to accommodate various searching needs.
|
|
<p>
|
|
Once you have tsearch2 working with PostgreSQL,
|
|
you should be able to run the examples here exactly as they are typed.
|
|
<p>
|
|
<hr>
|
|
<h2>Table of Contents</h2>
|
|
<blockquote>
|
|
<a href="#vectors_queries">Vectors and Queries</a><br>
|
|
<a href="#simple_search">A Simple Search Engine</a><br>
|
|
<a href="#weights">Ranking and Position Weights</a><br>
|
|
<a href="#casting">Casting Vectors and Queries</a><br>
|
|
<a href="#parsing_lexing">Parsing and Lexing</a><br>
|
|
</blockquote>
|
|
|
|
<hr>
|
|
|
|
<h2><a name=vectors_queries>Vectors and Queries</a></h2>
|
|
|
|
<blockquote>
|
|
<i>This section introduces
|
|
the two data types upon which tsearch2 search engines are based,
|
|
and illustrates their interaction using the simplest possible case.
|
|
The complex examples we present later on
|
|
are merely variations and elaborations of this basic mechanism.</i>
|
|
</blockquote>
|
|
<p>
|
|
The tsearch2 module allows you to index documents by the words they contain,
|
|
and then perform very efficient searches
|
|
for documents that contain a given combination of words.
|
|
Preparing your document index involves two steps:
|
|
<ul>
|
|
<li><b>Making a list of the words each document contains.</b>
|
|
You must reduce each document to a <tt>tsvector</tt>
|
|
which lists each word that appears in the document.
|
|
This process offers many options,
|
|
because there is no requirement
|
|
that you must copy words into the vector
|
|
exactly as they appear in the document.
|
|
For example,
|
|
many developers omit frequent and content-free <b>stop words</b>
|
|
like <i>the</i> to reduce the size of their index;
|
|
others reduce different forms of the same word
|
|
(<i>forked</i>, <i>forking</i>, <i>forks</i>)
|
|
to a common form (<i>fork</i>)
|
|
to make search results independent of tense and case.
|
|
Because words are very often stored in a modified form,
|
|
we use the special term <b>lexemes</b>
|
|
for the word forms we actually store in the vector.
|
|
<li><b>Creating an index of the documents by lexeme.</b>
|
|
This is managed automatically by tsearch2
|
|
when you creat a <tt>gist()</tt> index
|
|
on the <tt>tsvector</tt> column of a table,
|
|
which implements a form of the Berkeley
|
|
<a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
|
|
</ul>
|
|
Once your documents are indexed,
|
|
performing a search involves:
|
|
<ul>
|
|
<li><b>Reducing the search terms to lexemes.</b>
|
|
You must express each search you want to perform
|
|
as a <tt>tsquery</tt> specifying a boolean combination of lexemes.
|
|
Note that tsearch2 only finds <i>exact</i> matches
|
|
between the lexemes in your query and the ones in each vector —
|
|
even capitalization counts as a difference
|
|
(which is why all lexemes are usually kept lowercase).
|
|
So you must process search words the same way you processed document words;
|
|
if <i>forking</i> became <i>fork</i> in the document's <tt>tsvector</tt>,
|
|
then the search term <i>forking</i> must also become <i>fork</i>
|
|
or the search will not find the document.
|
|
<li><b>Retrieving the documents that match the query.</b>
|
|
Running a <tt>SELECT</tt> ... <tt>WHERE</tt>
|
|
<tt><i>query</i></tt> <tt>@@</tt> <tt><i>vector</i></tt>
|
|
on the table with the <tt><i>vector</i></tt> column
|
|
will return the documents that match your query.
|
|
<li><b>Presenting your results.</b>
|
|
This final stage offers as many options
|
|
as turning documents into vectors.
|
|
You can order documents by how well they matched the search terms;
|
|
create a headline for each document
|
|
showing some of the phrases in which it uses the search terms;
|
|
and restrict the number of results retrieved.
|
|
You will of course want some way to identify each document,
|
|
so the user can ask for the full text of the ones he wants to read.
|
|
</ul>
|
|
And beyond deciding upon rules for turning documents into vectors
|
|
and for presenting search results to users,
|
|
you have to decide <i>where</i> to perform these operations —
|
|
whether one database server
|
|
will parse documents, perform searches, and prepare search results,
|
|
or whether to spread the load of these operations across several machines.
|
|
These are complicated design issues
|
|
which we will explore later;
|
|
in this section and the next,
|
|
we will illustrate what can be accomplished
|
|
using a single database server.
|
|
<p>
|
|
The <tt>default</tt> tsearch2 configuration,
|
|
which we will learn more about later,
|
|
provides a good example of a process for reducing documents to vectors:
|
|
|
|
<pre>
|
|
=# <b>SELECT set_curcfg('default')</b>
|
|
=# <b>SELECT to_tsvector('The air smells of sea water.')</b>
|
|
to_tsvector
|
|
-------------------------------------
|
|
'air':2 'sea':5 'smell':3 'water':6
|
|
(1 row)
|
|
</pre>
|
|
|
|
Note the complex relationship between this document and its vector.
|
|
The vector lists only words from the document —
|
|
spaces and punctuation have disappeared.
|
|
Common words like <i>the</i> and <i>of</i> have been eliminated.
|
|
The <i>-s</i> that makes <i>smells</i> a plural has been removed,
|
|
leaving a lexeme that represents the word in its simplest form.
|
|
And finally,
|
|
though the vector remembers the positions in which each word appeared,
|
|
it does not store the lexemes in that order.
|
|
<p>
|
|
Keeping word positions in your vectors is optional, by the way.
|
|
The positions are necessary for the tsearch2 ranking functions,
|
|
which you can use to prioritize documents
|
|
based on how often each document uses the search terms
|
|
and whether they appear in close proximity.
|
|
But if you do not perform ranking,
|
|
or use your own process that ignores the word positions stored in the vector,
|
|
then you can save space by stripping them from your vectors:
|
|
|
|
<pre>
|
|
=# <b>SELECT strip(to_tsvector('The air smells of sea water.'))</b>
|
|
strip
|
|
-----------------------------
|
|
'air' 'sea' 'smell' 'water'
|
|
(1 row)
|
|
</pre>
|
|
|
|
Now that we have a procedure for creating vectors,
|
|
we can build an indexed table of vectors very simply:
|
|
|
|
<pre>
|
|
=# <b>CREATE TABLE vectors ( vector tsvector )</b>
|
|
=# <b>CREATE INDEX vector_index ON vectors USING gist(vector)</b>
|
|
=# <b>INSERT INTO vectors VALUES (to_tsvector('The path forks here'))</b>
|
|
=# <b>INSERT INTO vectors VALUES (to_tsvector('A crawl leads west'))</b>
|
|
=# <b>INSERT INTO vectors VALUES (to_tsvector('The left fork leads northeast'))</b>
|
|
=# <b>SELECT * FROM vectors</b>
|
|
vector
|
|
------------------------------------------
|
|
'fork':3 'path':2
|
|
'lead':3 'west':4 'crawl':2
|
|
'fork':3 'lead':4 'left':2 'northeast':5
|
|
(3 rows)
|
|
</pre>
|
|
|
|
Now we can search this collection of document vectors
|
|
using the <tt>@@</tt> operator and a <tt>tsquery</tt>
|
|
that specifies the combination of lexemes we are looking for.
|
|
Note that while vectors simply list lexemes,
|
|
queries always combine them with the operators
|
|
‘<tt>&</tt>’ and,
|
|
‘<tt>|</tt>’ or,
|
|
and ‘<tt>!</tt>’ not,
|
|
plus parentheses for grouping.
|
|
Some examples of the query syntax:
|
|
<table align=center>
|
|
<tr>
|
|
<td>‘find documents with the word <i>forks</i> in them’<br>
|
|
<td><tt>'forks'</tt>
|
|
<tr>
|
|
<td>‘... with both <i>forks</i> and <i>leads</i>’<br>
|
|
<td><tt>'forks & leads'</tt>
|
|
<tr>
|
|
<td>‘... with either <i>forks</i> or <i>leads</i>’<br>
|
|
<td><tt>'forks | leads'</tt>
|
|
<tr>
|
|
<td>‘... with either <i>forks</i> or <i>leads</i>,
|
|
but without <i>crawl</i>’<br>
|
|
<td><tt>'(forks|leads) & !crawl'</tt>
|
|
</table>
|
|
The tsearch2 module
|
|
provides a <tt>to_tsquery()</tt> function for creating queries
|
|
that uses the same process as <tt>to_tsvector()</tt> uses
|
|
to reduce words to lexemes.
|
|
For instance,
|
|
it will remove the <i>-s</i> from the plurals in the last example above:
|
|
|
|
<pre>
|
|
=# <b>SELECT to_tsquery('(leads|forks) & !crawl')</b>
|
|
to_tsquery
|
|
--------------------------------
|
|
( 'lead' | 'fork' ) & !'crawl'
|
|
(1 row)
|
|
</pre>
|
|
|
|
Again,
|
|
this is critically important because the search operator <tt>@@</tt>
|
|
only finds <i>exact</i> matches
|
|
between the words in a query and the words in a vector;
|
|
if the document vector lists the lexeme <i>fork</i>
|
|
but the query looks for the plural form <i>forks</i>,
|
|
the query would not match that document.
|
|
Thanks to the symmetry between our process
|
|
for producing vectors and queries, however,
|
|
the above searches return correct results:
|
|
|
|
<pre>
|
|
=# <b>SELECT * FROM vectors WHERE vector @@ to_tsquery('(leads|forks) & !crawl')</b>
|
|
vector
|
|
------------------------------------------
|
|
'fork':3 'path':2
|
|
'fork':3 'lead':4 'left':2 'northeast':5
|
|
(2 rows)
|
|
</pre>
|
|
|
|
You may want to try the other queries shown above,
|
|
and perhaps invent some of your own.
|
|
<p>
|
|
You should not include stop words in a query,
|
|
since you cannot search for words you have discarded.
|
|
If you throw out the word <i>the</i> when building vectors, for example,
|
|
your index will obviously not know which documents included it.
|
|
The <tt>to_tsquery()</tt> function will automatically detect this
|
|
and give you an error to prevent this mistake:
|
|
|
|
<pre>
|
|
=# <b>SELECT to_tsquery('the')</b>
|
|
NOTICE: Query contains only stopword(s) or doesn't contain lexem(s), ignored
|
|
to_tsquery
|
|
------------
|
|
|
|
(1 row)
|
|
</pre>
|
|
|
|
But if you every build vectors and queries using your own routines,
|
|
a possibility we will discuss later,
|
|
then you will need to enforce this rule yourself.
|
|
|
|
<blockquote><i>
|
|
Now that you understand how vectors and queries work together,
|
|
you are prepared to tackle many additional topics:
|
|
how to distribute searching across many servers;
|
|
how to customize the process
|
|
by which tsearch2 turns documents and queries into lexemes,
|
|
or use a process of your own;
|
|
and how to sort and display search results to your users.
|
|
But before discussing these detailed questions,
|
|
we will build a simple search engine
|
|
to see how easily its basic features work together.
|
|
</i></blockquote>
|
|
|
|
<h2><a name=simple_search>A Simple Search Engine</a></h2>
|
|
|
|
<blockquote><i>
|
|
In this section we build a simple search engine out of SQL functions
|
|
that use the vector and query types described in the previous section.
|
|
While this example is simpler
|
|
than a search engine that has to interface with the outside world,
|
|
it will illustrate the basic principles of building a search engine,
|
|
and better prepare you for developing your own.
|
|
</i></blockquote>
|
|
Building a search engine involves only a few improvements
|
|
upon the rudimentary vector searches described in the last section.
|
|
<ul>
|
|
<li>Because the user wants to read documents, not vectors,
|
|
you must provide some way
|
|
for the full text of each document to be accessed —
|
|
either by storing the entire text of each document in the database,
|
|
or storing an identifier
|
|
like a URL, file name, or document routing number
|
|
that lets you fetch the document from other storage.
|
|
<li>You can make it easier for user interface code to refer to each document
|
|
by providing a unique identifier for each document,
|
|
perhaps with a <tt>SERIAL</tt> column.
|
|
<li>Search results should be ordered by relevance.
|
|
If you leave word positions in your vectors,
|
|
you can either have PostgreSQL <tt>ORDER</tt> your results
|
|
<tt>BY</tt> a ranking function,
|
|
or you can fetch the vectors yourself and perform your own sort.
|
|
If you choose to ignore word positions or strip them from your vectors,
|
|
you will have to determine relevance yourself,
|
|
using either the full text of the document
|
|
or other information about each document you may possess.
|
|
<li>For each document returned by a search,
|
|
you will usually want to display a summary called a <i>headline</i>
|
|
that shows short excerpts
|
|
illustrating how the document uses the query words.
|
|
Headlines are usually generated from the full text of the document,
|
|
not from position information in the <tt>tsvector</tt>,
|
|
since excerpts lacking stop words, punctuation, and suffixes
|
|
would not be comprehensible.
|
|
If you store the full text of each document in the database,
|
|
headlines can be generated very simply by a tsearch2 function.
|
|
If you store your documents elsewhere,
|
|
then you will either have to transmit each document to the database
|
|
every time you want to run the headline function on it,
|
|
or use your own headline code outside of the database.
|
|
</ul>
|
|
<p>
|
|
We can easily construct a simple search engine
|
|
that accomplishes these goals.
|
|
First we build a table that, for each document,
|
|
stores a unique identifier, the full text of the document,
|
|
and its <tt>tsvector</tt>:
|
|
|
|
<pre>
|
|
=# <b>CREATE TABLE docs ( id SERIAL, doc TEXT, vector tsvector )</b>
|
|
=# <b>CREATE INDEX docs_index ON docs USING gist(vector);</b>
|
|
</pre>
|
|
|
|
Note that although searches will still work
|
|
on tables where you have neglected
|
|
to create a <tt>gist()</tt> index over your vectors,
|
|
they will run much more slowly
|
|
since they will have to compare the query
|
|
against every document vector in the table.
|
|
<p>
|
|
Because the table we have created
|
|
stores each document in two different ways —
|
|
both as text and as a vector —
|
|
our <tt>INSERT</tt> statements must provide the document in both forms.
|
|
While more advanced PostgreSQL programmers
|
|
might accomplish this with a database trigger or rule,
|
|
for this simple example we will use a small SQL function:
|
|
|
|
<pre>
|
|
=# <b>CREATE FUNCTION insdoc(text) RETURNS void LANGUAGE sql AS
|
|
'INSERT INTO docs (doc, vector) VALUES ($1, to_tsvector($1));'</b>
|
|
</pre>
|
|
|
|
Now, by calling <tt>insdoc()</tt> several times,
|
|
we can populate our table with documents:
|
|
|
|
<pre>
|
|
=# <b>SELECT insdoc('A low crawl over cobbles leads inward to the west.')</b>
|
|
=# <b>SELECT insdoc('The canyon runs into a mass of boulders -- dead end.')</b>
|
|
=# <b>SELECT insdoc('You are crawling over cobbles in a low passage.')</b>
|
|
=# <b>SELECT insdoc('Cavernous passages lead east, north, and south.')</b>
|
|
=# <b>SELECT insdoc('To the east a low wide crawl slants up.')</b>
|
|
=# <b>SELECT insdoc('You are in the south side chamber.')</b>
|
|
=# <b>SELECT insdoc('The passage here is blocked by a recent cave-in.')</b>
|
|
=# <b>SELECT insdoc('You are in a splendid chamber thirty feet high.')</b>
|
|
</pre>
|
|
|
|
Now we can build a search function.
|
|
Its <tt>SELECT</tt> statement is based upon
|
|
the same <tt>@@</tt> operation illustrated in the previous section.
|
|
But instead of returning matching vectors,
|
|
we return for each document
|
|
its <tt>SERIAL</tt> identifier, so the user can retrieve it later;
|
|
a headline that illustrates its use of the search terms;
|
|
and a ranking with which we also order the results.
|
|
Our search operation can be coded as a single <tt>SELECT</tt> statement
|
|
returning its own kind of table row,
|
|
which we call a <tt>finddoc_t</tt>:
|
|
|
|
<pre>
|
|
=# <b>CREATE TYPE finddoc_t AS (id INTEGER, headline TEXT, rank REAL)</b>
|
|
=# <b>CREATE FUNCTION finddoc(text) RETURNS SETOF finddoc_t LANGUAGE sql AS '
|
|
SELECT id, headline(doc, q), rank(vector, q)
|
|
FROM docs, to_tsquery($1) AS q
|
|
WHERE vector @@ q ORDER BY rank(vector, q) DESC'</b>
|
|
</pre>
|
|
|
|
This function is a rather satisfactory search engine.
|
|
Here is one example search,
|
|
after which the user fetches the top-ranking document itself;
|
|
with similar commands you can try queries of your own:
|
|
|
|
<pre>
|
|
=# <b>SELECT * FROM finddoc('passage|crawl')</b>
|
|
id | headline | rank
|
|
----+-------------------------------------------------------+------
|
|
3 | <b>crawling</b> over cobbles in a low <b>passage</b>. | 0.19
|
|
1 | <b>crawl</b> over cobbles leads inward to the west. | 0.1
|
|
4 | <b>passages</b> lead east, north, and south. | 0.1
|
|
5 | <b>crawl</b> slants up. | 0.1
|
|
7 | <b>passage</b> here is blocked by a recent cave-in. | 0.1
|
|
(5 rows)
|
|
=# <b>SELECT doc FROM docs WHERE id = 3</b>
|
|
doc
|
|
-------------------------------------------------
|
|
You are crawling over cobbles in a low passage.
|
|
(1 row)
|
|
</pre>
|
|
|
|
While by default the <tt>headline()</tt> function
|
|
surrounds matching words with <tt><b></tt> and <tt></b></tt>
|
|
in order to distinguish them from the surrounding text,
|
|
you can provide options that change its behavior;
|
|
consult the tsearch2 Reference for more details about
|
|
<a href="tsearch2-ref.html#headlines">Headline Functions</a>.
|
|
<p>
|
|
Though a search may match hundreds or thousands of documents,
|
|
you will usually present only ten or twenty results to the user at a time.
|
|
This can be most easily accomplished
|
|
by limiting your query with a <tt>LIMIT</tt>
|
|
and an <tt>OFFSET</tt> clause —
|
|
to display results ten at a time, for example,
|
|
your would generate your first page of results
|
|
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>0</tt>,
|
|
your second page
|
|
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>10</tt>,
|
|
your third page
|
|
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>20</tt>,
|
|
and so forth.
|
|
There are two problems with this approach, however.
|
|
<p>
|
|
The first problem is the strain of running the query over again
|
|
for every page of results the user views.
|
|
For small document collections or lightly loaded servers,
|
|
this may not be a problem;
|
|
but the impact can be high
|
|
when a search must repeatedly rank and sort
|
|
the same ten thousand results
|
|
on an already busy server.
|
|
So instead of selecting only one page of results,
|
|
you will probably use <tt>LIMIT</tt> and <tt>OFFSET</tt>
|
|
to return a few dozen or few hundred results,
|
|
which you can cache and display to the user one page at a time.
|
|
Whether a result cache rewards your effort
|
|
will depend principally on the behavior of your users —
|
|
how often they even view the second page of results, for instance.
|
|
<p>
|
|
The second issue solved by caching involves consistency.
|
|
If the database is changing while the user browses their results,
|
|
then documents might appear and disappear as they page through them.
|
|
In some cases the user might even miss a particular result —
|
|
perhaps the one they were looking for —
|
|
if, say, its rank improves from 31th to 30th
|
|
after they load results 21–30 but before they view results 31–40.
|
|
While many databases are static or infrequently updated,
|
|
and will not present this problem,
|
|
users searching very dymanic document collections
|
|
might benefit from the stable results that caches yield.
|
|
|
|
<blockquote><i>
|
|
Having seen the features of a search engine
|
|
implemented entirely within the database,
|
|
we will learn about some specific tsearch2 features.
|
|
First we will look in more detail at document ranking.
|
|
</i></blockquote>
|
|
|
|
<h2><a name=weights>Ranking and Position Weights</a></h2>
|
|
|
|
<blockquote><i>
|
|
When we built our simple search engine,
|
|
we used the </i><tt>rank()</tt><i> function to order our results.
|
|
Here we describe tsearch2 ranking in more detail.
|
|
</i></blockquote>
|
|
|
|
There are two functions with which tsearch2 can rank search results.
|
|
They both use the lexeme positions listed in the <tt>tsvector</tt>,
|
|
so you cannot rank vectors
|
|
from which these have been removed with <tt>strip()</tt>.
|
|
The <tt>rank()</tt> function existed in older versions of OpenFTS,
|
|
and has the feature that you can assign different weights
|
|
to words from different sections of your document.
|
|
The <tt>rank_cd()</tt> uses a recent technique for weighting results
|
|
but does not allow different weight to be given
|
|
to different sections of your document.
|
|
<p>
|
|
Both ranking functions allow you to specify,
|
|
as an optional last argument,
|
|
whether you want their results <i>normalized</i> —
|
|
whether the rank returned should be adjusted for document length.
|
|
Specifying a last argument of <tt>0</tt> (zero) makes no adjustment;
|
|
<tt>1</tt> (one) divides the document rank
|
|
by the logarithm of the document length;
|
|
and <tt>2</tt> divides it by the plain length.
|
|
In all of these examples we omit this optional argument,
|
|
which is the same as specifying zero —
|
|
we are making no adjustment for document length.
|
|
<p>
|
|
The <tt>rank_cd()</tt> function uses an experimental measurement
|
|
called <i>cover density ranking</i> that rewards documents
|
|
when they make frequent use of the search terms
|
|
that are close together in the document.
|
|
You can read about the algorithm in more detail
|
|
in Clarke et al.,
|
|
“<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
|
|
>Relevance Ranking for One to Three Term Queries</a>.”
|
|
An optional first argument allows you to tune their formula;
|
|
for details
|
|
see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
|
|
in the Reference.
|
|
<p>
|
|
The <tt>rank()</tt> function offers more flexibility
|
|
because it pays attention to the <i>weights</i>
|
|
with which you have labelled lexeme positions.
|
|
Currently tsearch2 supports four different weight labels:
|
|
<tt>'D'</tt>, the default weight;
|
|
and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
|
|
All vectors created with <tt>to_tsvector()</tt>
|
|
assign the weight <tt>'D'</tt> to each position,
|
|
which as the default is not displayed when you print a vector out.
|
|
<p>
|
|
If you want positions with weights other than <tt>'D'</tt>,
|
|
you have two options:
|
|
either you can author a vector directly through the <tt>::tsvector</tt>
|
|
casting operation,
|
|
as described in the following section,
|
|
which lets you give each position whichever weight you want;
|
|
or you can pass a vector through the <tt>setweight()</tt> function
|
|
which sets all of its position weights to a single value.
|
|
An example of the latter:
|
|
|
|
|
|
<pre>
|
|
=# <b>SELECT vector FROM docs WHERE id = 3</b>
|
|
vector
|
|
----------------------------------------
|
|
'low':8 'cobbl':5 'crawl':3 'passag':9
|
|
(1 row)
|
|
=# <b>SELECT setweight(vector, 'A') FROM docs WHERE id = 3</b>
|
|
setweight
|
|
--------------------------------------------
|
|
'low':8A 'cobbl':5A 'crawl':3A 'passag':9A
|
|
(1 row)
|
|
</pre>
|
|
|
|
|
|
Merely changing all of the weights in a vector is not very useful,
|
|
of course,
|
|
since this results still in all words having the same weight.
|
|
But if we parse different parts of a document separately,
|
|
giving each section its own weight,
|
|
and then concatenate the vectors of each part into a single vector,
|
|
the result can be very useful.
|
|
We can construct a simple example
|
|
in which document titles are given greater weight
|
|
that text in the body of the document:
|
|
|
|
|
|
<pre>
|
|
=# <b>CREATE TABLE tdocs ( id SERIAL, title TEXT, doc TEXT, vector tsvector )</b>
|
|
=# <b>CREATE INDEX tdocs_index ON tdocs USING gist(vector);</b>
|
|
=# <b>CREATE FUNCTION instdoc(text, text) RETURNS void LANGUAGE sql AS
|
|
'INSERT INTO tdocs (title, doc, vector)
|
|
VALUES ($1, $2, setweight(to_tsvector($1), ''A'') || to_tsvector($2));'</b>
|
|
</pre>
|
|
|
|
|
|
Now words from a document title will be weighted differently
|
|
than those in the main text
|
|
if we provide the title and body as separate arguments:
|
|
|
|
|
|
<pre>
|
|
=# <b>SELECT instdoc('Spendid Chamber',
|
|
'The walls are frozen rivers of orange stone.')</b>
|
|
instdoc
|
|
---------
|
|
|
|
(1 row)
|
|
=# <b>SELECT vector FROM tdocs</b>
|
|
vector
|
|
------------------------------------------------------------------------------
|
|
'wall':4 'orang':9 'river':7 'stone':10 'frozen':6 'chamber':2A 'spendid':1A
|
|
(1 row)
|
|
</pre>
|
|
|
|
|
|
Note that although the necessity is unusual,
|
|
you can constrain search terms
|
|
to only match words from certain sections
|
|
by following them with a colon
|
|
and a list of the sections in which the word can occur;
|
|
by default this list is <tt>'ABCD'</tt>
|
|
so that search terms match words from all sections.
|
|
For example,
|
|
here we search for a word both generally,
|
|
and then looking only for specific weights:
|
|
|
|
|
|
<pre>
|
|
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid')</b>
|
|
title | doc
|
|
-----------------+----------------------------------------------
|
|
Spendid Chamber | The walls are frozen rivers of orange stone.
|
|
(1 row)
|
|
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:A')</b>
|
|
title | doc
|
|
-----------------+----------------------------------------------
|
|
Spendid Chamber | The walls are frozen rivers of orange stone.
|
|
(1 row)
|
|
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:D')</b>
|
|
title | doc
|
|
-------+-----
|
|
(0 rows)
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
<blockquote><i>
|
|
Our examples so far use tsearch2 to parse our documents into vectors.
|
|
When your application needs absolute control over vector content,
|
|
you will want to use direct type casting,
|
|
which is described in the next section.
|
|
</i></blockquote>
|
|
|
|
<h2><a name=casting>Casting Vectors and Queries</a></h2>
|
|
|
|
<blockquote><i>
|
|
While tsearch2 has powerful and flexible ways
|
|
to process documents and turn them into document vectors,
|
|
you will sometimes want to parse documents on your own
|
|
and place the results directly in vectors.
|
|
Here we show you how.
|
|
</i></blockquote>
|
|
|
|
In the preceding examples,
|
|
we used the <tt>to_tsvector()</tt> function
|
|
when we needed a document's text reduced to a document vector.
|
|
We saw that the function stripped whitespace and punctuation,
|
|
eliminated common words,
|
|
and altered suffixes to reduce words to a common form.
|
|
While these operations are often desirable,
|
|
and while in the sections below
|
|
we will gain precise control over this process,
|
|
there are occasions on which
|
|
you want to avoid the changes that <tt>to_tsvector()</tt> makes to text
|
|
and specify explicitly the words that you want in your vectors.
|
|
Or you may want to create queries directly
|
|
rather than through <tt>to_tsquery()</tt>.
|
|
<p>
|
|
For example,
|
|
you may have already developed your own routine
|
|
for reducing your documents to searchable lexemes,
|
|
and do not want your carefully generated terms altered
|
|
by passing them through <tt>to_tsvector()</tt>.
|
|
Or you might be developing and debugging parsing routines of your own
|
|
that you are not ready to load into the database.
|
|
In either case,
|
|
you will find that direct insertion is easily accomplished
|
|
if you simply follow some simple rules.
|
|
<p>
|
|
Vectors are created directly
|
|
when you cast a string of whitespace separated lexemes
|
|
to the <tt>tsvector</tt> type:
|
|
|
|
|
|
<pre>
|
|
=# <b>select 'the only exit is the way you came in'::tsvector</b>
|
|
tsvector
|
|
--------------------------------------------------
|
|
'in' 'is' 'the' 'way' 'you' 'came' 'exit' 'only'
|
|
(1 row)
|
|
</pre>
|
|
|
|
|
|
Notice that the conversion interpreted the string
|
|
simply as a list of lexemes to be included in the vector.
|
|
Their order was lost,
|
|
as was the number of times each lexeme appeared.
|
|
You must keep in mind that directly creating vectors with casting
|
|
is <i>not</i> an alternate means of parsing;
|
|
it is a way of directly entering lexemes into a vector <i>without</i> parsing.
|
|
<p>
|
|
Queries can also be created through casting,
|
|
if you separate lexemes with boolean operators
|
|
rather than with whitespace.
|
|
When creating your own vectors and queries,
|
|
remember that the search operator <tt>@@</tt>
|
|
finds only <i>exact</i> matches between query lexemes and vector lexemes
|
|
—
|
|
if they are not exactly the same string,
|
|
they will not be considered a match.
|
|
<p>
|
|
To include lexeme positions in your vector,
|
|
write the positions exactly the way tsearch2 displays them
|
|
when it prints vectors:
|
|
by following each lexeme with a colon
|
|
and a comma-separated list of integer positions.
|
|
If you list a lexeme more than once,
|
|
then all the positions listed for it are combined into a single list.
|
|
For example,
|
|
here are two ways of writing the same vector,
|
|
depending on whether you mention ‘<tt>the</tt>’ twice
|
|
or combine its positions into a list yourself:
|
|
|
|
|
|
<pre>
|
|
=# <b>select 'the:1 only:2 exit:3 is:4 the:5 way:6 you:7 came:8 in:9'::tsvector</b>
|
|
tsvector
|
|
--------------------------------------------------------------------
|
|
'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
|
|
(1 row)
|
|
=# <b>select 'the:1,5 only:2 exit:3 is:4 way:6 you:7 came:8 in:9'::tsvector</b>
|
|
tsvector
|
|
--------------------------------------------------------------------
|
|
'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
|
|
(1 row)
|
|
</pre>
|
|
|
|
|
|
Things can get slightly tricky
|
|
if you want to include apostrophes, backslashes, or spaces
|
|
inside your lexemes
|
|
(wanting to include either of the latter would be unusual,
|
|
but they can be included if you follow the rules).
|
|
The main problem is that the apostrophe and backslash
|
|
are important <i>both</i> to PostgreSQL when it is interpreting a string,
|
|
<i>and</i> to the <tt>tsvector</tt> conversion function.
|
|
You may want to review section
|
|
<a href="http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=sql-syntax.html#SQL-SYNTAX-STRINGS">1.1.2.1,
|
|
“String Constants”</a>
|
|
in the PostgreSQL documentation before proceeding.
|
|
<p>
|
|
When you cast strings directly into vectors:
|
|
<ul>
|
|
<li>The string is interpreted as a whitespace-separated list of lexemes,
|
|
any of which can be suffixed with a colon and a list of positions.
|
|
<li>A lexeme can be quoted by preceding it with an apostrophe,
|
|
in which case it runs until the next apostrophe;
|
|
otherwise a lexeme ends with the first whitespace or colon encountered.
|
|
<li>Any character preceded by a backslash,
|
|
including whitespace, the apostrophe, the colon, and the backslash itself,
|
|
loses its normal meaning and is treated as a letter.
|
|
Backslashes are effective
|
|
both inside and outside of apostrophe-quoted lexemes.
|
|
<li>A lexeme can be suffixed with a list of positions
|
|
by appending a colon and a comma-separated list of integers,
|
|
each of which can itself be followed by a letter
|
|
to designate a position weight
|
|
(position weights are <a href="#weights">described below</a>).
|
|
</ul>
|
|
|
|
Here are some example strings,
|
|
showing the lexeme you want to insert
|
|
together with the string that the <tt>::tsvector</tt> operator
|
|
needs to see,
|
|
and how you would type that string at the PostgreSQL prompt:
|
|
|
|
<table align=center>
|
|
<tr>
|
|
<td><i>For the lexeme...</i>
|
|
<td><i>you need the string...</i>
|
|
<td><i>which you can type as:</i>
|
|
<tr>
|
|
<td><tt>nugget</tt>
|
|
<td><tt>nugget</tt>
|
|
<td><tt>'nugget'</tt>
|
|
<tr>
|
|
<td><tt>won't</tt>
|
|
<td><tt>won't</tt>
|
|
<td><tt>'won''t'</tt>
|
|
<tr>
|
|
<td><tt>pinin'</tt>
|
|
<td><tt>pinin'</tt>
|
|
<td><tt>'pinin'''</tt>
|
|
<tr>
|
|
<td><tt>'bout</tt>
|
|
<td><tt>\'bout</tt>
|
|
<td><tt>'\\''bout'</tt>
|
|
<tr>
|
|
<td><tt>white mist</tt>
|
|
<td><tt>white\ mist</tt>
|
|
<td><tt>'white\\ mist'</tt>
|
|
<tr>
|
|
<td align=right><tt><i>or:</i></tt>
|
|
<td><tt>'white mist'</tt>
|
|
<td><tt>'''white mist'''</tt>
|
|
<tr>
|
|
<td><tt>won't budge</tt>
|
|
<td><tt>won\'t\ budge</tt>
|
|
<td><tt>'won\\''t\\ budge'</tt>
|
|
<tr>
|
|
<td align=right><tt><i>or:</i></tt>
|
|
<td><tt>'won\'t budge'</tt>
|
|
<td><tt>'''won\\''t budge'''</tt>
|
|
<tr>
|
|
<td><tt>back\slashed</tt>
|
|
<td><tt>back\\slashed</tt>
|
|
<td><tt>'back\\\\slashed'</tt>
|
|
</table>
|
|
|
|
Remember to use the quoted quoting shown at the right
|
|
only when typing in strings as part of a PostgreSQL query.
|
|
If you are providing strings through a library
|
|
that automatically quotes them
|
|
or provides them in binary form to PostgreSQL,
|
|
then you can use the strings in the middle instead —
|
|
suitably quoted in the language you are using, of course.
|
|
<p>
|
|
Position weights are <a href="#weights">described below</a>
|
|
and can be written exactly as they will be displayed
|
|
when you select a weighted vector:
|
|
|
|
<pre>
|
|
=# <b>select 'weighty:1,3A trivial:2B,4'::tsvector</b>
|
|
tsvector
|
|
-------------------------------
|
|
'trivial':2B,4 'weighty':1,3A
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>
|
|
Note that if you are composing SQL queries
|
|
in a scripting language like Perl or Python,
|
|
that itself considers quotes and backslashes special,
|
|
then you may have another quoting layer to deal with
|
|
on top of the two layers already shown above.
|
|
In such cases you may want to write a function
|
|
that performs the necessary quoting for you.
|
|
|
|
<blockquote><i>
|
|
Having seen how to create vectors of your own,
|
|
it is time to learn how the native tsearch2 parser
|
|
reduces documents to vectors.
|
|
</i></blockquote>
|
|
|
|
<h2><a name=parsing_lexing>Parsing and Lexing</a></h2>
|
|
|
|
<blockquote><i>
|
|
The previous section
|
|
described how you can bypass the parser provided by tsearch2
|
|
and populate your table of documents
|
|
with vectors of your own devising.
|
|
But for those interested in the native tsearch2 facilities,
|
|
we present here an overview of how it goes about
|
|
reducing documents to vectors.
|
|
</i></blockquote>
|
|
|
|
The <tt>to_tsvector()</tt> function reduces documents to vectors
|
|
in two stages.
|
|
First, a <i>parser</i> breaks the input document
|
|
into short sequences of text called <i>tokens</i>.
|
|
Each token is usually a word, space, or piece of punctuation,
|
|
though some parsers return larger and more exotic items
|
|
like HTML tags as single tokens.
|
|
Each token returned by the parser
|
|
is either discarded
|
|
or passed to a <i>dictionary</i> that converts it into a lexeme.
|
|
The resulting lexemes are collected into a vector and returned.
|
|
<p>
|
|
The choice of which parser and dictionaries <tt>to_tsvector()</tt> should use
|
|
is controlled by your choice of <i>configuration</i>.
|
|
The tsearch2 module comes with several configurations,
|
|
and you can define more of your own;
|
|
in fact the creation of a new configuration is illustrated below,
|
|
in the section on position weights.
|
|
<p>
|
|
To learn about parsing in more detail,
|
|
we will study this example:
|
|
|
|
<pre>
|
|
=# <b>select to_tsvector('default',
|
|
'The walls extend upward for well over 100 feet.')</b>
|
|
to_tsvector
|
|
----------------------------------------------------------
|
|
'100':8 'feet':9 'wall':2 'well':6 'extend':3 'upward':4
|
|
(1 row)
|
|
</pre>
|
|
|
|
Unlike the <tt>to_tsvector()</tt> calls used in the above examples,
|
|
this one specifies the <tt>'default'</tt> configuration explicitly.
|
|
When we called <tt>to_tsvector()</tt> in earlier examples
|
|
with only one argument,
|
|
it used the <i>current</i> configuration,
|
|
which is chosen automatically based on your <tt>LOCALE</tt>
|
|
if that locale is mentioned in the <tt>pg_ts_cfg</tt> table
|
|
(which is shown under the first bullet in the description below).
|
|
If your locale is not listed in the table,
|
|
your attempts to use the current configuration will return:
|
|
|
|
<pre>
|
|
ERROR: Can't find tsearch2 config by locale
|
|
</pre>
|
|
|
|
You can always change the current configuration manually
|
|
by calling the <tt>set_curcfg()</tt> function
|
|
described in the section on
|
|
<a href="tsearch2-ref.html#configurations">Configurations</a>
|
|
in the Reference.
|
|
<p>
|
|
Each configuration serves as an index into two different tables:
|
|
in <tt>pg_ts_cfg</tt> it determines
|
|
which parser will break our text into tokens,
|
|
and in <tt>pg_ts_cfgmap</tt>
|
|
it directs each token to a dictionary for processing.
|
|
The steps in detail are:
|
|
|
|
<ul>
|
|
<li class=big>
|
|
<p>First, our text is parsed,
|
|
using the parser listed for our configuration in the <tt>pg_ts_cfg</tt> table.
|
|
We are using the <tt>'default'</tt> configuration,
|
|
so the table tells us to use the <tt>'default'</tt> parser:
|
|
|
|
<pre>
|
|
=# <b>SELECT * FROM pg_ts_cfg WHERE ts_name = 'default'</b>
|
|
ts_name | prs_name | locale
|
|
---------+----------+--------
|
|
default | default | C
|
|
(1 row)
|
|
</pre>
|
|
|
|
So our text will be parsed as though we had called:
|
|
|
|
<pre>
|
|
=# <b>select * from parse('default',
|
|
'The walls extend upward for well over 100 feet.')</b>
|
|
</pre>
|
|
|
|
This breaks the text into a list of tokens
|
|
which are each labelled with an integer type:
|
|
<p align=center>
|
|
The<sub>1</sub>♦<sub>12</sub
|
|
>walls<sub>1</sub>♦<sub>12</sub
|
|
>extend<sub>1</sub>♦<sub>12</sub
|
|
>upward<sub>1</sub>♦<sub>12</sub
|
|
>for<sub>1</sub>♦<sub>12</sub
|
|
>well<sub>1</sub>♦<sub>12</sub
|
|
>over<sub>1</sub>♦<sub>12</sub
|
|
>100<sub>22</sub>♦<sub>12</sub
|
|
>feet<sub>1</sub>.<sub>12</sub>
|
|
<p>
|
|
Each word has been assigned type 1;
|
|
each space (represented here by a diamond) and the period, type 12;
|
|
and the number one hundred, type 22.
|
|
We can retrieve the alias for each type
|
|
through the <tt>token_type</tt> function:
|
|
|
|
<pre>
|
|
=# <b>select * from token_type('default')
|
|
where tokid = 1 or tokid = 12 or tokid = 22</b>
|
|
tokid | alias | descr
|
|
-------+-------+------------------
|
|
1 | lword | Latin word
|
|
12 | blank | Space symbols
|
|
22 | uint | Unsigned integer
|
|
(3 rows)
|
|
</pre>
|
|
|
|
|
|
<li class=big>
|
|
Next, the tokens are assigned to dictionaries
|
|
by looking up their type aliases in <tt>pg_ts_cfgmap</tt>
|
|
to determine which dictionary should process each token.
|
|
Since we are using the <tt>'default'</tt> configuration:
|
|
|
|
<pre>
|
|
=# <b>select * from pg_ts_cfgmap where ts_name = 'default' and
|
|
(tok_alias = 'lword' or tok_alias = 'blank' or tok_alias = 'uint')</b>
|
|
ts_name | tok_alias | dict_name
|
|
---------+-----------+-----------
|
|
default | lword | {en_stem}
|
|
default | uint | {simple}
|
|
(2 rows)
|
|
</pre>
|
|
|
|
Since this map provides no dictionary for <tt>blank</tt> tokens,
|
|
the spaces and period are simply discarded,
|
|
leaving nine tokens,
|
|
which are then numbered by their position:
|
|
<p align=center>
|
|
The<sup>1</sup>
|
|
walls<sup>2</sup>
|
|
extend<sup>3</sup>
|
|
upward<sup>4</sup>
|
|
for<sup>5</sup>
|
|
well<sup>6</sup>
|
|
over<sup>7</sup>
|
|
100<sup>8</sup>
|
|
feet<sup>9</sup>
|
|
|
|
<li class=big>
|
|
Finally, the words are reduced to lexemes by their respective dictionaries.
|
|
The <tt>100</tt> is submitted to the <tt>simple</tt> dictionary,
|
|
which returns tokens unaltered except for making them lowercase:
|
|
|
|
<pre>
|
|
=# <b>select lexize('simple', '100')</b>
|
|
lexize
|
|
--------
|
|
{100}
|
|
(1 row)
|
|
</pre>
|
|
|
|
The other words are submitted to <tt>en_stem</tt>
|
|
which reduces each English word to a linguistic stem,
|
|
and then discards stems which belong to its list of stop words;
|
|
you can see the list of stop words
|
|
in the file whose path is in the <tt>dict_initoption</tt> field
|
|
of the <tt>pg_ts_dict</tt> table entry for <tt>en_stem</tt>.
|
|
The first three words of our text illustrate respectively
|
|
an <tt>en_stem</tt> stop word,
|
|
a word which <tt>en_stem</tt> alters by stemming,
|
|
and a word which <tt>en_stem</tt> leaves alone:
|
|
|
|
<pre>
|
|
=# <b>select lexize('en_stem', 'The')</b>
|
|
lexize
|
|
--------
|
|
{}
|
|
(1 row)
|
|
=# <b>select lexize('en_stem', 'walls')</b>
|
|
lexize
|
|
--------
|
|
{wall}
|
|
(1 row)
|
|
=# <b>select lexize('en_stem', 'extend')</b>
|
|
lexize
|
|
----------
|
|
{extend}
|
|
(1 row)
|
|
</pre>
|
|
|
|
Once <tt>en_stem</tt> is done discarding stop words and stemming the rest,
|
|
we are left with:
|
|
<p align=center>
|
|
wall<sup>2</sup>
|
|
extend<sup>3</sup>
|
|
upward<sup>4</sup>
|
|
well<sup>6</sup>
|
|
100<sup>8</sup>
|
|
feet<sup>9</sup>
|
|
<p>
|
|
Which is precisely the result of the example that began this section.
|
|
</ul>
|
|
Query words are stemmed by the <tt>to_tsquery()</tt> function
|
|
using the same scheme to determine the dictionary for each token,
|
|
with the difference that the query parser recognizes as special
|
|
the boolean operators that separate query words.
|
|
|
|
</body>
|
|
</html>
|
|
|
|
|