Brandon Craig Rhodes
30 June 2003
Updated to 8.2 release by Oleg Bartunov, October 2006
This Guide introduces the reader to the PostgreSQL tsearch2 module, version 2. More formal descriptions of the module's types and functions are provided in the tsearch2 Reference, which is a companion to this document.
First we will examine the tsvector and tsquery types and how they are used to search documents; next, we will use them to build a simple search engine in SQL; and finally, we will study the internals of document conversion and how you might tune the internals to accommodate various searching needs.
Once you have tsearch2 working with PostgreSQL, you should be able to run the examples here exactly as they are typed.
Introduction to FTS with tsearch2
Vectors and Queries
A Simple Search Engine
Ranking and Position Weights
Casting Vectors and Queries
Parsing and Lexing
Additional information
This section introduces the two data types upon which tsearch2 search engines are based, and illustrates their interaction using the simplest possible case. The complex examples we present later on are merely variations and elaborations of this basic mechanism.
The tsearch2 module allows you to index documents by the words they contain, and then perform very efficient searches for documents that contain a given combination of words. Preparing your document index involves two steps:
The default tsearch2 configuration, which we will learn more about later, provides a good example of a process for reducing documents to vectors:
=# SELECT set_curcfg('default') =# SELECT to_tsvector('The air smells of sea water.') to_tsvector ------------------------------------- 'air':2 'sea':5 'smell':3 'water':6 (1 row)Note the complex relationship between this document and its vector. The vector lists only words from the document — spaces and punctuation have disappeared. Common words like the and of have been eliminated. The -s that makes smells a plural has been removed, leaving a lexeme that represents the word in its simplest form. And finally, though the vector remembers the positions in which each word appeared, it does not store the lexemes in that order.
Keeping word positions in your vectors is optional, by the way. The positions are necessary for the tsearch2 ranking functions, which you can use to prioritize documents based on how often each document uses the search terms and whether they appear in close proximity. But if you do not perform ranking, or use your own process that ignores the word positions stored in the vector, then you can save space by stripping them from your vectors:
=# SELECT strip(to_tsvector('The air smells of sea water.')) strip ----------------------------- 'air' 'sea' 'smell' 'water' (1 row)Now that we have a procedure for creating vectors, we can build an indexed table of vectors very simply:
=# CREATE TABLE vectors ( vector tsvector ) =# CREATE INDEX vector_index ON vectors USING gist(vector) =# INSERT INTO vectors VALUES (to_tsvector('The path forks here')) =# INSERT INTO vectors VALUES (to_tsvector('A crawl leads west')) =# INSERT INTO vectors VALUES (to_tsvector('The left fork leads northeast')) =# SELECT * FROM vectors vector ------------------------------------------ 'fork':3 'path':2 'lead':3 'west':4 'crawl':2 'fork':3 'lead':4 'left':2 'northeast':5 (3 rows)Now we can search this collection of document vectors using the @@ operator and a tsquery that specifies the combination of lexemes we are looking for. Note that while vectors simply list lexemes, queries always combine them with the operators ‘&’ and, ‘|’ or, and ‘!’ not, plus parentheses for grouping. Some examples of the query syntax:
‘find documents with the word forks in them’ | 'forks' |
‘... with both forks and leads’ | 'forks & leads' |
‘... with either forks or leads’ | 'forks | leads' |
‘... with either forks or leads,
but without crawl’ | '(forks|leads) & !crawl' |
=# SELECT to_tsquery('(leads|forks) & !crawl') to_tsquery -------------------------------- ( 'lead' | 'fork' ) & !'crawl' (1 row)Again, this is critically important because the search operator @@ only finds exact matches between the words in a query and the words in a vector; if the document vector lists the lexeme fork but the query looks for the plural form forks, the query would not match that document. Thanks to the symmetry between our process for producing vectors and queries, however, the above searches return correct results:
=# SELECT * FROM vectors WHERE vector @@ to_tsquery('(leads|forks) & !crawl') vector ------------------------------------------ 'fork':3 'path':2 'fork':3 'lead':4 'left':2 'northeast':5 (2 rows)You may want to try the other queries shown above, and perhaps invent some of your own.
You should not include stop words in a query, since you cannot search for words you have discarded. If you throw out the word the when building vectors, for example, your index will obviously not know which documents included it. The to_tsquery() function will automatically detect this and give you an error to prevent this mistake:
=# SELECT to_tsquery('the') NOTICE: Query contains only stopword(s) or doesn't contain lexem(s), ignored to_tsquery ------------ (1 row)But if you every build vectors and queries using your own routines, a possibility we will discuss later, then you will need to enforce this rule yourself.
Now that you understand how vectors and queries work together, you are prepared to tackle many additional topics: how to distribute searching across many servers; how to customize the process by which tsearch2 turns documents and queries into lexemes, or use a process of your own; and how to sort and display search results to your users. But before discussing these detailed questions, we will build a simple search engine to see how easily its basic features work together.
In this section we build a simple search engine out of SQL functions that use the vector and query types described in the previous section. While this example is simpler than a search engine that has to interface with the outside world, it will illustrate the basic principles of building a search engine, and better prepare you for developing your own.Building a search engine involves only a few improvements upon the rudimentary vector searches described in the last section.
We can easily construct a simple search engine that accomplishes these goals. First we build a table that, for each document, stores a unique identifier, the full text of the document, and its tsvector:
=# CREATE TABLE docs ( id SERIAL, doc TEXT, vector tsvector ) =# CREATE INDEX docs_index ON docs USING gist(vector);Note that although searches will still work on tables where you have neglected to create a gist() index over your vectors, they will run much more slowly since they will have to compare the query against every document vector in the table.
Because the table we have created stores each document in two different ways — both as text and as a vector — our INSERT statements must provide the document in both forms. While more advanced PostgreSQL programmers might accomplish this with a database trigger or rule, for this simple example we will use a small SQL function:
=# CREATE FUNCTION insdoc(text) RETURNS void LANGUAGE sql AS 'INSERT INTO docs (doc, vector) VALUES ($1, to_tsvector($1));'Now, by calling insdoc() several times, we can populate our table with documents:
=# SELECT insdoc('A low crawl over cobbles leads inward to the west.') =# SELECT insdoc('The canyon runs into a mass of boulders -- dead end.') =# SELECT insdoc('You are crawling over cobbles in a low passage.') =# SELECT insdoc('Cavernous passages lead east, north, and south.') =# SELECT insdoc('To the east a low wide crawl slants up.') =# SELECT insdoc('You are in the south side chamber.') =# SELECT insdoc('The passage here is blocked by a recent cave-in.') =# SELECT insdoc('You are in a splendid chamber thirty feet high.')Now we can build a search function. Its SELECT statement is based upon the same @@ operation illustrated in the previous section. But instead of returning matching vectors, we return for each document its SERIAL identifier, so the user can retrieve it later; a headline that illustrates its use of the search terms; and a ranking with which we also order the results. Our search operation can be coded as a single SELECT statement returning its own kind of table row, which we call a finddoc_t:
=# CREATE TYPE finddoc_t AS (id INTEGER, headline TEXT, rank REAL) =# CREATE FUNCTION finddoc(text) RETURNS SETOF finddoc_t LANGUAGE sql AS ' SELECT id, headline(doc, q), rank(vector, q) FROM docs, to_tsquery($1) AS q WHERE vector @@ q ORDER BY rank(vector, q) DESC'This function is a rather satisfactory search engine. Here is one example search, after which the user fetches the top-ranking document itself; with similar commands you can try queries of your own:
=# SELECT * FROM finddoc('passage|crawl') id | headline | rank ----+-------------------------------------------------------+------ 3 | <b>crawling</b> over cobbles in a low <b>passage</b>. | 0.19 1 | <b>crawl</b> over cobbles leads inward to the west. | 0.1 4 | <b>passages</b> lead east, north, and south. | 0.1 5 | <b>crawl</b> slants up. | 0.1 7 | <b>passage</b> here is blocked by a recent cave-in. | 0.1 (5 rows) =# SELECT doc FROM docs WHERE id = 3 doc ------------------------------------------------- You are crawling over cobbles in a low passage. (1 row)While by default the headline() function surrounds matching words with <b> and </b> in order to distinguish them from the surrounding text, you can provide options that change its behavior; consult the tsearch2 Reference for more details about Headline Functions.
Though a search may match hundreds or thousands of documents, you will usually present only ten or twenty results to the user at a time. This can be most easily accomplished by limiting your query with a LIMIT and an OFFSET clause — to display results ten at a time, for example, your would generate your first page of results with LIMIT 10 OFFSET 0, your second page with LIMIT 10 OFFSET 10, your third page with LIMIT 10 OFFSET 20, and so forth. There are two problems with this approach, however.
The first problem is the strain of running the query over again for every page of results the user views. For small document collections or lightly loaded servers, this may not be a problem; but the impact can be high when a search must repeatedly rank and sort the same ten thousand results on an already busy server. So instead of selecting only one page of results, you will probably use LIMIT and OFFSET to return a few dozen or few hundred results, which you can cache and display to the user one page at a time. Whether a result cache rewards your effort will depend principally on the behavior of your users — how often they even view the second page of results, for instance.
The second issue solved by caching involves consistency. If the database is changing while the user browses their results, then documents might appear and disappear as they page through them. In some cases the user might even miss a particular result — perhaps the one they were looking for — if, say, its rank improves from 31th to 30th after they load results 21–30 but before they view results 31–40. While many databases are static or infrequently updated, and will not present this problem, users searching very dymanic document collections might benefit from the stable results that caches yield.
Having seen the features of a search engine implemented entirely within the database, we will learn about some specific tsearch2 features. First we will look in more detail at document ranking.
When we built our simple search engine, we used the rank() function to order our results. Here we describe tsearch2 ranking in more detail.There are two functions with which tsearch2 can rank search results. They both use the lexeme positions listed in the tsvector, so you cannot rank vectors from which these have been removed with strip(). The rank() function existed in older versions of OpenFTS, and has the feature that you can assign different weights to words from different sections of your document. The rank_cd() uses a recent technique for weighting results and also allows different weight to be given to different sections of your document (since 8.2).
Both ranking functions allow you to specify, as an optional last argument, whether you want their results normalized — whether the rank returned should be adjusted for document length. Specifying a last argument of 0 (zero) makes no adjustment; 1 (one) divides the document rank by the logarithm of the document length; and 2 divides it by the plain length. In all of these examples we omit this optional argument, which is the same as specifying zero — we are making no adjustment for document length.
The rank_cd() function uses an experimental measurement called cover density ranking that rewards documents when they make frequent use of the search terms that are close together in the document. You can read about the algorithm in more detail in Clarke et al., “Relevance Ranking for One to Three Term Queries.” An optional first argument allows you to tune their formula; for details see the section on ranking in the Reference.
Currently tsearch2 supports four different weight labels: 'D', the default weight; and 'A', 'B', and 'C'. All vectors created with to_tsvector() assign the weight 'D' to each position, which as the default is not displayed when you print a vector out.
If you want positions with weights other than 'D', you have two options: either you can author a vector directly through the ::tsvector casting operation, as described in the following section, which lets you give each position whichever weight you want; or you can pass a vector through the setweight() function which sets all of its position weights to a single value. An example of the latter:
=# SELECT vector FROM docs WHERE id = 3 vector ---------------------------------------- 'low':8 'cobbl':5 'crawl':3 'passag':9 (1 row) =# SELECT setweight(vector, 'A') FROM docs WHERE id = 3 setweight -------------------------------------------- 'low':8A 'cobbl':5A 'crawl':3A 'passag':9A (1 row)Merely changing all of the weights in a vector is not very useful, of course, since this results still in all words having the same weight. But if we parse different parts of a document separately, giving each section its own weight, and then concatenate the vectors of each part into a single vector, the result can be very useful. We can construct a simple example in which document titles are given greater weight that text in the body of the document:
=# CREATE TABLE tdocs ( id SERIAL, title TEXT, doc TEXT, vector tsvector ) =# CREATE INDEX tdocs_index ON tdocs USING gist(vector); =# CREATE FUNCTION instdoc(text, text) RETURNS void LANGUAGE sql AS 'INSERT INTO tdocs (title, doc, vector) VALUES ($1, $2, setweight(to_tsvector($1), ''A'') || to_tsvector($2));'Now words from a document title will be weighted differently than those in the main text if we provide the title and body as separate arguments:
=# SELECT instdoc('Spendid Chamber', 'The walls are frozen rivers of orange stone.') instdoc --------- (1 row) =# SELECT vector FROM tdocs vector ------------------------------------------------------------------------------ 'wall':4 'orang':9 'river':7 'stone':10 'frozen':6 'chamber':2A 'spendid':1A (1 row)Note that although the necessity is unusual, you can constrain search terms to only match words from certain sections by following them with a colon and a list of the sections in which the word can occur; by default this list is 'ABCD' so that search terms match words from all sections. For example, here we search for a word both generally, and then looking only for specific weights:
=# SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid') title | doc -----------------+---------------------------------------------- Spendid Chamber | The walls are frozen rivers of orange stone. (1 row) =# SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:A') title | doc -----------------+---------------------------------------------- Spendid Chamber | The walls are frozen rivers of orange stone. (1 row) =# SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:D') title | doc -------+----- (0 rows)
Our examples so far use tsearch2 to parse our documents into vectors. When your application needs absolute control over vector content, you will want to use direct type casting, which is described in the next section.
While tsearch2 has powerful and flexible ways to process documents and turn them into document vectors, you will sometimes want to parse documents on your own and place the results directly in vectors. Here we show you how.In the preceding examples, we used the to_tsvector() function when we needed a document's text reduced to a document vector. We saw that the function stripped whitespace and punctuation, eliminated common words, and altered suffixes to reduce words to a common form. While these operations are often desirable, and while in the sections below we will gain precise control over this process, there are occasions on which you want to avoid the changes that to_tsvector() makes to text and specify explicitly the words that you want in your vectors. Or you may want to create queries directly rather than through to_tsquery().
For example, you may have already developed your own routine for reducing your documents to searchable lexemes, and do not want your carefully generated terms altered by passing them through to_tsvector(). Or you might be developing and debugging parsing routines of your own that you are not ready to load into the database. In either case, you will find that direct insertion is easily accomplished if you simply follow some simple rules.
Vectors are created directly when you cast a string of whitespace separated lexemes to the tsvector type:
=# select 'the only exit is the way you came in'::tsvector tsvector -------------------------------------------------- 'in' 'is' 'the' 'way' 'you' 'came' 'exit' 'only' (1 row)Notice that the conversion interpreted the string simply as a list of lexemes to be included in the vector. Their order was lost, as was the number of times each lexeme appeared. You must keep in mind that directly creating vectors with casting is not an alternate means of parsing; it is a way of directly entering lexemes into a vector without parsing.
Queries can also be created through casting, if you separate lexemes with boolean operators rather than with whitespace. When creating your own vectors and queries, remember that the search operator @@ finds only exact matches between query lexemes and vector lexemes — if they are not exactly the same string, they will not be considered a match.
To include lexeme positions in your vector, write the positions exactly the way tsearch2 displays them when it prints vectors: by following each lexeme with a colon and a comma-separated list of integer positions. If you list a lexeme more than once, then all the positions listed for it are combined into a single list. For example, here are two ways of writing the same vector, depending on whether you mention ‘the’ twice or combine its positions into a list yourself:
=# select 'the:1 only:2 exit:3 is:4 the:5 way:6 you:7 came:8 in:9'::tsvector tsvector -------------------------------------------------------------------- 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2 (1 row) =# select 'the:1,5 only:2 exit:3 is:4 way:6 you:7 came:8 in:9'::tsvector tsvector -------------------------------------------------------------------- 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2 (1 row)Things can get slightly tricky if you want to include apostrophes, backslashes, or spaces inside your lexemes (wanting to include either of the latter would be unusual, but they can be included if you follow the rules). The main problem is that the apostrophe and backslash are important both to PostgreSQL when it is interpreting a string, and to the tsvector conversion function. You may want to review section “String Constants” in the PostgreSQL documentation before proceeding.
When you cast strings directly into vectors:
For the lexeme... | you need the string... | which you can type as: |
nugget | nugget | 'nugget' |
won't | won't | 'won''t' |
pinin' | pinin' | 'pinin''' |
'bout | \'bout | '\\''bout' |
white mist | white\ mist | 'white\\ mist' |
or: | 'white mist' | '''white mist''' |
won't budge | won\'t\ budge | 'won\\''t\\ budge' |
or: | 'won\'t budge' | '''won\\''t budge''' |
back\slashed | back\\slashed | 'back\\\\slashed' |
Position weights are described below and can be written exactly as they will be displayed when you select a weighted vector:
=# select 'weighty:1,3A trivial:2B,4'::tsvector tsvector ------------------------------- 'trivial':2B,4 'weighty':1,3A (1 row)
Note that if you are composing SQL queries in a scripting language like Perl or Python, that itself considers quotes and backslashes special, then you may have another quoting layer to deal with on top of the two layers already shown above. In such cases you may want to write a function that performs the necessary quoting for you.
Having seen how to create vectors of your own, it is time to learn how the native tsearch2 parser reduces documents to vectors.
The previous section described how you can bypass the parser provided by tsearch2 and populate your table of documents with vectors of your own devising. But for those interested in the native tsearch2 facilities, we present here an overview of how it goes about reducing documents to vectors.The to_tsvector() function reduces documents to vectors in two stages. First, a parser breaks the input document into short sequences of text called tokens. Each token is usually a word, space, or piece of punctuation, though some parsers return larger and more exotic items like HTML tags as single tokens. Each token returned by the parser is either discarded or passed to a dictionary that converts it into a lexeme. The resulting lexemes are collected into a vector and returned.
The choice of which parser and dictionaries to_tsvector() should use is controlled by your choice of configuration. The tsearch2 module comes with several configurations, and you can define more of your own; in fact the creation of a new configuration is illustrated below, in the section on position weights.
To learn about parsing in more detail, we will study this example:
=# select to_tsvector('default', 'The walls extend upward for well over 100 feet.') to_tsvector ---------------------------------------------------------- '100':8 'feet':9 'wall':2 'well':6 'extend':3 'upward':4 (1 row)Unlike the to_tsvector() calls used in the above examples, this one specifies the 'default' configuration explicitly. When we called to_tsvector() in earlier examples with only one argument, it used the current configuration, which is chosen automatically based on your LOCALE if that locale is mentioned in the pg_ts_cfg table (which is shown under the first bullet in the description below). If your locale is not listed in the table, your attempts to use the current configuration will return:
ERROR: Can't find tsearch2 config by localeYou can always change the current configuration manually by calling the set_curcfg() function described in the section on Configurations in the Reference.
Each configuration serves as an index into two different tables: in pg_ts_cfg it determines which parser will break our text into tokens, and in pg_ts_cfgmap it directs each token to a dictionary for processing. The steps in detail are:
First, our text is parsed, using the parser listed for our configuration in the pg_ts_cfg table. We are using the 'default' configuration, so the table tells us to use the 'default' parser:
=# SELECT * FROM pg_ts_cfg WHERE ts_name = 'default' ts_name | prs_name | locale ---------+----------+-------- default | default | C (1 row)So our text will be parsed as though we had called:
=# select * from parse('default', 'The walls extend upward for well over 100 feet.')This breaks the text into a list of tokens which are each labelled with an integer type:
The1♦12walls1♦12extend1♦12upward1♦12for1♦12well1♦12over1♦1210022♦12feet1.12
Each word has been assigned type 1; each space (represented here by a diamond) and the period, type 12; and the number one hundred, type 22. We can retrieve the alias for each type through the token_type function:
=# select * from token_type('default') where tokid = 1 or tokid = 12 or tokid = 22 tokid | alias | descr -------+-------+------------------ 1 | lword | Latin word 12 | blank | Space symbols 22 | uint | Unsigned integer (3 rows)
=# select * from pg_ts_cfgmap where ts_name = 'default' and (tok_alias = 'lword' or tok_alias = 'blank' or tok_alias = 'uint') ts_name | tok_alias | dict_name ---------+-----------+----------- default | lword | {en_stem} default | uint | {simple} (2 rows)Since this map provides no dictionary for blank tokens, the spaces and period are simply discarded, leaving nine tokens, which are then numbered by their position:
The1 walls2 extend3 upward4 for5 well6 over7 1008 feet9
=# select lexize('simple', '100') lexize -------- {100} (1 row)The other words are submitted to en_stem which reduces each English word to a linguistic stem, and then discards stems which belong to its list of stop words; you can see the list of stop words in the file whose path is in the dict_initoption field of the pg_ts_dict table entry for en_stem. The first three words of our text illustrate respectively an en_stem stop word, a word which en_stem alters by stemming, and a word which en_stem leaves alone:
=# select lexize('en_stem', 'The') lexize -------- {} (1 row) =# select lexize('en_stem', 'walls') lexize -------- {wall} (1 row) =# select lexize('en_stem', 'extend') lexize ---------- {extend} (1 row)Once en_stem is done discarding stop words and stemming the rest, we are left with:
wall2 extend3 upward4 well6 1008 feet9
Which is precisely the result of the example that began this section.