mirror of
https://git.postgresql.org/git/postgresql.git
synced 2025-02-11 19:20:40 +08:00
Document filtering dictionaries in textsearch.sgml.
While at it, copy-edit the description of prefix-match marker support in synonym dictionaries, and clarify the description of the default unaccent dictionary a bit more.
This commit is contained in:
parent
b5cf172d2c
commit
6a167730e3
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.56.2.1 2010/07/29 19:34:37 petere Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.56.2.2 2010/08/25 21:43:01 tgl Exp $ -->
|
||||
|
||||
<chapter id="textsearch">
|
||||
<title>Full Text Search</title>
|
||||
@ -112,7 +112,7 @@
|
||||
as a sorted array of normalized lexemes. Along with the lexemes it is
|
||||
often desirable to store positional information to use for
|
||||
<firstterm>proximity ranking</firstterm>, so that a document that
|
||||
contains a more <quote>dense</> region of query words is
|
||||
contains a more <quote>dense</> region of query words is
|
||||
assigned a higher rank than one with scattered query words.
|
||||
</para>
|
||||
</listitem>
|
||||
@ -1151,13 +1151,13 @@ MaxFragments=0, FragmentDelimiter=" ... "
|
||||
<screen>
|
||||
SELECT ts_headline('english',
|
||||
'The most common type of search
|
||||
is to find all documents containing given query terms
|
||||
is to find all documents containing given query terms
|
||||
and return them in order of their similarity to the
|
||||
query.',
|
||||
to_tsquery('query & similarity'));
|
||||
ts_headline
|
||||
------------------------------------------------------------
|
||||
containing given <b>query</b> terms
|
||||
containing given <b>query</b> terms
|
||||
and return them in order of their <b>similarity</b> to the
|
||||
<b>query</b>.
|
||||
|
||||
@ -1166,7 +1166,7 @@ SELECT ts_headline('english',
|
||||
is to find all documents containing given query terms
|
||||
and return them in order of their similarity to the
|
||||
query.',
|
||||
to_tsquery('query & similarity'),
|
||||
to_tsquery('query & similarity'),
|
||||
'StartSel = <, StopSel = >');
|
||||
ts_headline
|
||||
-------------------------------------------------------
|
||||
@ -2064,6 +2064,14 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
|
||||
(notice that one token can produce more than one lexeme)
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
a single lexeme with the <literal>TSL_FILTER</> flag set, to replace
|
||||
the original token with a new token to be passed to subsequent
|
||||
dictionaries (a dictionary that does this is called a
|
||||
<firstterm>filtering dictionary</>)
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
an empty array if the dictionary knows the token, but it is a stop word
|
||||
@ -2096,6 +2104,13 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
|
||||
until some dictionary recognizes it as a known word. If it is identified
|
||||
as a stop word, or if no dictionary recognizes the token, it will be
|
||||
discarded and not indexed or searched for.
|
||||
Normally, the first dictionary that returns a non-<literal>NULL</>
|
||||
output determines the result, and any remaining dictionaries are not
|
||||
consulted; but a filtering dictionary can replace the given word
|
||||
with a modified word, which is then passed to subsequent dictionaries.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The general rule for configuring a list of dictionaries
|
||||
is to place first the most narrow, most specific dictionary, then the more
|
||||
general dictionaries, finishing with a very general dictionary, like
|
||||
@ -2112,6 +2127,16 @@ ALTER TEXT SEARCH CONFIGURATION astro_en
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A filtering dictionary can be placed anywhere in the list, except at the
|
||||
end where it'd be useless. Filtering dictionaries are useful to partially
|
||||
normalize words to simplify the task of later dictionaries. For example,
|
||||
a filtering dictionary could be used to remove accents from accented
|
||||
letters, as is done by the
|
||||
<link linkend="unaccent"><filename>contrib/unaccent</></link>
|
||||
extension module.
|
||||
</para>
|
||||
|
||||
<sect2 id="textsearch-stopwords">
|
||||
<title>Stop Words</title>
|
||||
|
||||
@ -2184,7 +2209,7 @@ CREATE TEXT SEARCH DICTIONARY public.simple_dict (
|
||||
Here, <literal>english</literal> is the base name of a file of stop words.
|
||||
The file's full name will be
|
||||
<filename>$SHAREDIR/tsearch_data/english.stop</>,
|
||||
where <literal>$SHAREDIR</> means the
|
||||
where <literal>$SHAREDIR</> means the
|
||||
<productname>PostgreSQL</productname> installation's shared-data directory,
|
||||
often <filename>/usr/local/share/postgresql</> (use <command>pg_config
|
||||
--sharedir</> to determine it if you're not sure).
|
||||
@ -2295,63 +2320,6 @@ SELECT * FROM ts_debug('english', 'Paris');
|
||||
asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
|
||||
</screen>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
An asterisk (<literal>*</literal>) at the end of definition word indicates
|
||||
that definition word is a prefix, and <function>to_tsquery()</function>
|
||||
function will transform that definition to the prefix search format (see
|
||||
<xref linkend="textsearch-parsing-queries">).
|
||||
Notice that it is ignored in <function>to_tsvector()</function>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
|
||||
<programlisting>
|
||||
postgres pgsql
|
||||
postgresql pgsql
|
||||
postgre pgsql
|
||||
gogle googl
|
||||
indices index*
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Results:
|
||||
<screen>
|
||||
=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
|
||||
=# SELECT ts_lexize('syn','indices');
|
||||
ts_lexize
|
||||
-----------
|
||||
{index}
|
||||
(1 row)
|
||||
|
||||
=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
|
||||
=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
|
||||
=# SELECT to_tsquery('tst','indices');
|
||||
to_tsquery
|
||||
------------
|
||||
'index':*
|
||||
(1 row)
|
||||
|
||||
=# SELECT 'indexes are very useful'::tsvector;
|
||||
tsvector
|
||||
---------------------------------
|
||||
'are' 'indexes' 'useful' 'very'
|
||||
(1 row)
|
||||
|
||||
=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
|
||||
?column?
|
||||
----------
|
||||
t
|
||||
(1 row)
|
||||
|
||||
=# SELECT to_tsvector('tst','indices');
|
||||
to_tsvector
|
||||
-------------
|
||||
'index':1
|
||||
(1 row)
|
||||
</screen>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The only parameter required by the <literal>synonym</> template is
|
||||
@ -2374,6 +2342,60 @@ indices index*
|
||||
<literal>true</>, words and tokens are not folded to lower case,
|
||||
but are compared as-is.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
An asterisk (<literal>*</literal>) can be placed at the end of a synonym
|
||||
in the configuration file. This indicates that the synonym is a prefix.
|
||||
The asterisk is ignored when the entry is used in
|
||||
<function>to_tsvector()</function>, but when it is used in
|
||||
<function>to_tsquery()</function>, the result will be a query item with
|
||||
the prefix match marker (see
|
||||
<xref linkend="textsearch-parsing-queries">).
|
||||
For example, suppose we have these entries in
|
||||
<filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
|
||||
<programlisting>
|
||||
postgres pgsql
|
||||
postgresql pgsql
|
||||
postgre pgsql
|
||||
gogle googl
|
||||
indices index*
|
||||
</programlisting>
|
||||
Then we will get these results:
|
||||
<screen>
|
||||
mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
|
||||
mydb=# SELECT ts_lexize('syn','indices');
|
||||
ts_lexize
|
||||
-----------
|
||||
{index}
|
||||
(1 row)
|
||||
|
||||
mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
|
||||
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
|
||||
mydb=# SELECT to_tsvector('tst','indices');
|
||||
to_tsvector
|
||||
-------------
|
||||
'index':1
|
||||
(1 row)
|
||||
|
||||
mydb=# SELECT to_tsquery('tst','indices');
|
||||
to_tsquery
|
||||
------------
|
||||
'index':*
|
||||
(1 row)
|
||||
|
||||
mydb=# SELECT 'indexes are very useful'::tsvector;
|
||||
tsvector
|
||||
---------------------------------
|
||||
'are' 'indexes' 'useful' 'very'
|
||||
(1 row)
|
||||
|
||||
mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
|
||||
?column?
|
||||
----------
|
||||
t
|
||||
(1 row)
|
||||
</screen>
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
<sect2 id="textsearch-thesaurus">
|
||||
@ -3270,7 +3292,7 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
|
||||
(<productname>PostgreSQL</productname> does this automatically when needed.)
|
||||
GiST indexes are lossy because each document is represented in the
|
||||
index by a fixed-length signature. The signature is generated by hashing
|
||||
each word into a random bit in an n-bit string, with all these bits OR-ed
|
||||
each word into a single bit in an n-bit string, with all these bits OR-ed
|
||||
together to produce an n-bit document signature. When two words hash to
|
||||
the same bit position there will be a false match. If all words in
|
||||
the query have matches (real or false) then the table row must be
|
||||
|
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.3.6.3 2010/08/25 02:12:11 tgl Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.3.6.4 2010/08/25 21:43:01 tgl Exp $ -->
|
||||
|
||||
<sect1 id="unaccent">
|
||||
<title>unaccent</title>
|
||||
@ -75,8 +75,10 @@
|
||||
<para>
|
||||
Running the installation script <filename>unaccent.sql</> creates a text
|
||||
search template <literal>unaccent</> and a dictionary <literal>unaccent</>
|
||||
based on it, with default parameters. You can alter the
|
||||
parameters, for example
|
||||
based on it. The <literal>unaccent</> dictionary has the default
|
||||
parameter setting <literal>RULES='unaccent'</>, which makes it immediately
|
||||
usable with the standard <filename>unaccent.rules</> file.
|
||||
If you wish, you can alter the parameter, for example
|
||||
|
||||
<programlisting>
|
||||
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
|
||||
|
Loading…
Reference in New Issue
Block a user