mirror of
https://git.postgresql.org/git/postgresql.git
synced 2025-01-24 18:55:04 +08:00
1761990e38
I've documented space usage and using CLUSTER command Oleg Bartunov
416 lines
13 KiB
Plaintext
416 lines
13 KiB
Plaintext
Tsearch contrib module contains implementation of new data type txtidx -
|
|
a searchable data type (textual) with indexed access.
|
|
|
|
All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
|
|
(oleg@sai.msu.su).
|
|
|
|
CHANGES:
|
|
|
|
August 29, 2002
|
|
Space usage and using CLUSTER command documented
|
|
August 22, 2002
|
|
Fix works with 'bad' queries
|
|
August 13, 2002
|
|
Use parser of OpenFTS v0.33.
|
|
|
|
IMPORTANT NOTICE:
|
|
|
|
This is a first step of our work on integration of OpenFTS
|
|
full text search engine (http://openfts.sourceforge.net/) into
|
|
PostgreSQL. It's based on our recent development of GiST
|
|
(Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page
|
|
at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST)
|
|
and will works only for PostgreSQL version 7.2 and later.
|
|
|
|
We didn't try to implement a full-featured search engine with
|
|
stable interfaces but rather experiment with various approaches.
|
|
There are many issues remains (most of them just not documented or
|
|
implemented) but we'd like to present a working prototype
|
|
of full text search engine fully integrated into PostgreSQL to
|
|
collect user's feedback and recommendations.
|
|
|
|
INSTALLATION:
|
|
|
|
cd contrib/tsearch
|
|
gmake
|
|
gmake install
|
|
|
|
REGRESSION TEST:
|
|
|
|
gmake installcheck
|
|
|
|
USAGE:
|
|
|
|
psql DATABASE < tsearch.sql (from contrib/tsearch)
|
|
|
|
|
|
INTRODUCTION:
|
|
|
|
This module provides an implementation of a new data type 'txtidx' which is
|
|
a string of a space separated "words". "Words" with spaces
|
|
should be enclosed in apostrophes and apostrophes inside a "word" should be
|
|
escaped by backslash.
|
|
|
|
This is quite different from OpenFTS approach which uses array
|
|
of integers (ID of lexems) and requires storing of lexem-id pairs in database.
|
|
One of the prominent benefit of this new approach is that it's possible now
|
|
to perform full text search in a 'natural' way.
|
|
|
|
Some examples:
|
|
|
|
create table foo (
|
|
titleidx txtidx
|
|
);
|
|
|
|
2 regular words:
|
|
insert into foo values ( 'the are' );
|
|
Word with space:
|
|
insert into foo values ( 'the\\ are' );
|
|
Words with apostrophe:
|
|
insert into foo values ( 'value\'s this' );
|
|
Complex word with apostrophe:
|
|
insert into foo values ( 'value\'s this we \'PostgreSQL site\'' );
|
|
|
|
select * from foo where titleidx @@ '\'PostgreSQL site\' | this';
|
|
select * from foo where titleidx @@ 'value\'s | this';
|
|
select * from foo where titleidx @@ '(the|this)&!we';
|
|
|
|
test=# select 'two words'::txtidx;
|
|
txtidx
|
|
---------------
|
|
'two' 'words'
|
|
(1 row)
|
|
|
|
test=# select 'single\\ word'::txtidx;
|
|
txtidx
|
|
---------------
|
|
'single word'
|
|
(1 row)
|
|
|
|
|
|
FULL TEXT SEARCH:
|
|
|
|
The basic idea of this data type is to use it for full text search inside
|
|
database. If you have a 'text' column title and corresponding column
|
|
titleidx of type 'txtidx', which contains the same information from
|
|
text column, then search on title could be replaced by
|
|
searching on titleidx which would be fast because of indexed access.
|
|
|
|
As a real life example consider database with table 'titles' containing
|
|
titles of mailing list postings in column 'title':
|
|
|
|
create table titles (
|
|
title text
|
|
|
|
);
|
|
|
|
Suppose, you already have a lot of titles and want to do full text search
|
|
on them.
|
|
|
|
First, you need to install contrib/tsearch module (see INSTALLATION and USAGE).
|
|
Add column 'titleidx' of type txtidx, containing space separated words from
|
|
title. It's possible to use function txt2txtidx(title) to fill 'titleidx'
|
|
column (see notice 1):
|
|
|
|
-- add titleidx column of type txtidx
|
|
alter table titles add titleidx txtidx;
|
|
update titles set titleidx=txt2txtidx(title);
|
|
|
|
Create index on titleidx:
|
|
create index t_idx on titles using gist(titleidx);
|
|
|
|
and now you can search all titles with words 'patch' and 'gist':
|
|
select title from titles where titleidx ## 'patch&gist';
|
|
|
|
Here, ## is a new operation defined for type 'txtidx' which could use index
|
|
(if exists) built on titleidx. This operator uses morphology to
|
|
expand query, i.e.
|
|
## 'patches&gist' will find titles with 'patch' and 'gist' also.
|
|
If you want to provide query as is, use operator @@ instead:
|
|
select title from titles where titleidx @@ 'patch&gist';
|
|
but remember, that function txt2txtidx does uses morphology, so you need
|
|
to fill column 'titleidx' using some another way. We hope in future releases
|
|
provide more consistent and convenient interfaces.
|
|
|
|
Query could contains boolean operators &,|,!,() with their usual meaning,
|
|
for example: 'patch&gist&!cvs', 'patch|cvs'.
|
|
Each operation ( ##, @@ ) requires appropriate query type -
|
|
txtidx ## mquery_txt
|
|
txtidx @@ query_txt
|
|
|
|
To see what query actually will be used :
|
|
|
|
test=# select 'patches&gist'::mquery_txt;
|
|
mquery_txt
|
|
------------------
|
|
'patch' & 'gist'
|
|
(1 row)
|
|
|
|
test=# select 'patches&gist'::query_txt;
|
|
query_txt
|
|
--------------------
|
|
'patches' & 'gist'
|
|
(1 row)
|
|
|
|
Notice the difference !
|
|
|
|
You could use trigger to be sure column 'titleidx' is consistent
|
|
with any changes in column 'title':
|
|
|
|
create trigger txtidxupdate before update or insert on titles
|
|
for each row execute procedure tsearch(titleidx, title);
|
|
|
|
This trigger uses the same parser, dictionaries as function
|
|
txt2txtidx (see notice 1).
|
|
Current syntax allows creating trigger for several columns
|
|
you want to be searchable:
|
|
|
|
create trigger txtidxupdate before update or insert on titles
|
|
for each row execute procedure tsearch(titleidx, title1, title2,... );
|
|
|
|
Use function txtidxsize(titleidx) to get the number of "words" in column
|
|
titleidx. To get total number of words in table titles:
|
|
|
|
test=# select sum(txtidxsize(titleidx)) from titles;
|
|
sum
|
|
---------
|
|
1917182
|
|
(1 row)
|
|
|
|
NOTICES:
|
|
|
|
1.
|
|
function txt2txtidx and trigger use parser, dictionaries coming with
|
|
this contrib module on default. Parser is mostly the same as in OpenFTS and
|
|
dictionaries are simple stemmers (sort of Lovin's stemmer which uses a
|
|
longest match algorithm.) for english and russian languages. There is a perl
|
|
script makedict/makedict.pl, which could be used to create specific
|
|
dictionaries from files with endings and stop-words.
|
|
Example files for english and russian languages are available
|
|
from http://www.sai.msu.su/~megera/postgres/gist/tsearch/.
|
|
Run script without parameters to see information about arguments and options.
|
|
|
|
Example:
|
|
cd makedict
|
|
./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \
|
|
-o ../dict/YOURDICT.dct
|
|
|
|
Another options of makedict.pl:
|
|
-f do not execute tolower for any char
|
|
-a function of checking stopword will be work after lemmatize,
|
|
default is before
|
|
|
|
You need to edit dict.h to use your dictionary and, probably,
|
|
morph.c to change mapdict array.
|
|
|
|
Don't forget to do
|
|
make clean; make; make install
|
|
|
|
2.
|
|
txtidx doesn't preserve words ordering (this is not critical for searching)
|
|
for performance reason, for example:
|
|
|
|
test=# select 'page two'::txtidx;
|
|
txtidx
|
|
--------------
|
|
'two' 'page'
|
|
(1 row)
|
|
|
|
3.
|
|
Indexed access provided by txtidx data type isn't always good
|
|
because of internal data structure we use (RD-Tree). Particularly,
|
|
queries like '!gist' will be slower than just a sequential scan,
|
|
because for such queries RD-Tree doesn't provides selectivity on internal
|
|
nodes and all checks should be processed at leaf nodes, i.t. scan of
|
|
full index. You may play with function query_tree to see how effective
|
|
will be index usage:
|
|
|
|
test=# select querytree( 'patch&gist'::query_txt );
|
|
querytree
|
|
------------------
|
|
'patch' & 'gist'
|
|
(1 row)
|
|
|
|
This is an example of "good" query - index will effective for both words.
|
|
|
|
test=# select querytree( 'patch&!gist'::query_txt );
|
|
querytree
|
|
-----------
|
|
'patch'
|
|
(1 row)
|
|
|
|
This means that index is effective only to search word 'patch' and resulted
|
|
rows will be checked against '!gist'.
|
|
|
|
test=# select querytree( 'patch|!gist'::query_txt );
|
|
querytree
|
|
-----------
|
|
T
|
|
(1 row)
|
|
|
|
test=# select querytree( '!gist'::query_txt );
|
|
querytree
|
|
-----------
|
|
T
|
|
(1 row)
|
|
|
|
These two queries will be processed by scanning of full index !
|
|
Very slow !
|
|
|
|
4.
|
|
Following selects produce the same result
|
|
|
|
select title from titles where titleidx @@ 'patch&gist';
|
|
select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist';
|
|
|
|
but the former will be more effective, because of internal optimization
|
|
of query executor.
|
|
|
|
|
|
TODO:
|
|
|
|
Better configurability (as in OpenFTS)
|
|
User's interfaces to parser, dictionaries ...
|
|
Write documentation
|
|
|
|
|
|
BENCHMARKS:
|
|
|
|
We use test collection in our experiments which contains 377905
|
|
titles from various mailing lists stored in our mailware
|
|
project.
|
|
|
|
All runs were performed on IBM ThinkPad T21 notebook with
|
|
PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev
|
|
We didn't do extensive benchmarking and all
|
|
numbers provide for illustration. Actual performance
|
|
is strongly depends on many factors (query, collection, dictionaries
|
|
and hardware).
|
|
|
|
Collection is available for download from
|
|
http://www.sai.msu.su/~megera/postgres/gist/tsearch/mw_titles.gz
|
|
(377905 titles from postgresql mailing lists, about 3Mb).
|
|
|
|
0. install contrib/tsearch module
|
|
1. createdb test
|
|
2. psql test < tsearch.sql (from contrib/tsearch)
|
|
3. zcat mw_titles.gz | psql test
|
|
(it will creates table, copy test data and creates index)
|
|
|
|
Database contains one table:
|
|
|
|
test=# \d titles
|
|
Table "titles"
|
|
Column | Type | Modifiers
|
|
----------+------------------------+-----------
|
|
title | character varying(256) |
|
|
titleidx | txtidx |
|
|
Indexes: t_idx
|
|
|
|
Index was created as:
|
|
create index t_idx on titles using gist(titleidx);
|
|
(notice: this operation takes about 14 minutes on my notebook)
|
|
|
|
Typical select looks like:
|
|
select title from titles where titleidx @@ 'patch&gist';
|
|
|
|
Total number of lexems in collection : 1917182
|
|
|
|
1. We trust index - we consider index is exact and no
|
|
checking against tuples is necessary.
|
|
|
|
update pg_amop set amopreqcheck = false where amopclaid =
|
|
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
|
|
|
using gist indices
|
|
1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00%
|
|
2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82%
|
|
3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00%
|
|
using gist indices (morph)
|
|
1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62%
|
|
2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47%
|
|
3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
|
|
disable gist index
|
|
1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62%
|
|
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00%
|
|
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62%
|
|
traditional like
|
|
1: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.206s 0.10%
|
|
2: title ~* 'gist' and title ~* 'patch' 0.000u 0.010s 0m9.205s 0.10%
|
|
3: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.208s 0.10%
|
|
|
|
2. Need to check results against tuples to avoid possible hash collision.
|
|
|
|
update pg_amop set amopreqcheck = true where amopclaid =
|
|
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
|
|
|
using gist indices
|
|
1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26%
|
|
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00%
|
|
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39%
|
|
using gist indices (morph)
|
|
1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
|
|
2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75%
|
|
3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13%
|
|
|
|
There are no visible difference between these 2 cases but your
|
|
mileage may vary.
|
|
|
|
|
|
NOTES:
|
|
|
|
1. The size of txtidx column should be lesser than size of corresponding column.
|
|
Below some real numbers from test database (link above).
|
|
|
|
a) After loading data
|
|
|
|
-rw------- 1 postgres users 23191552 Aug 29 14:08 53016937
|
|
-rw------- 1 postgres users 81059840 Aug 29 14:08 52639027
|
|
|
|
Table titles (52639027) occupies 80Mb, index on txtidx column (53016937)
|
|
occupies 22Mb. Use contrib/oid2name to get mappings from oid to names.
|
|
After doing
|
|
|
|
test=# select title into titles_tmp from titles;
|
|
SELECT
|
|
|
|
I got size of table 'titles' without txtidx field
|
|
|
|
-rw------- 1 postgres users 30105600 Aug 29 14:14 53016938
|
|
|
|
So, txtidx column itself occupies about 50Mb.
|
|
|
|
b) after running 'vacuum full analyze' I got:
|
|
|
|
-rw------- 1 postgres users 30105600 Aug 29 14:26 53016938
|
|
-rw------- 1 postgres users 36880384 Aug 29 14:26 53016937
|
|
-rw------- 1 postgres users 51494912 Aug 29 14:26 52639027
|
|
|
|
53016938 = titles_tmp
|
|
|
|
So, actual size of 'txtidx' field is 20 Mb ! "quod erat demonstrandum"
|
|
|
|
2. CLUSTER command is highly recommended if you need fast searching.
|
|
For example:
|
|
|
|
test=# cluster t_idx on titles;
|
|
|
|
BUT ! In 7.2 CLUSTER command forgets about other indices and permissions,
|
|
so you need be carefull and rebuild these indices and restore permissions
|
|
after clustering. Also, clustering isn't dynamic, so you'd need to
|
|
use CLUSTER from time to time. In 7.3 CLUSTER command should works
|
|
fine.
|
|
|
|
after clustering:
|
|
|
|
-rw------- 1 postgres users 23404544 Aug 29 14:59 53394850
|
|
-rw------- 1 postgres users 30105600 Aug 29 14:26 53016938
|
|
-rw------- 1 postgres users 50995200 Aug 29 14:45 53394845
|
|
pg@zen:/usr/local/pgsql/data/base/52638986$ oid2name -d test
|
|
All tables from database "test":
|
|
---------------------------------
|
|
53394850 = t_idx
|
|
53394845 = titles
|
|
53016938 = titles_tmp
|
|
|