postgresql/contrib/tsearch2
Tom Lane 3e23b68dac Support varlena fields with single-byte headers and unaligned storage.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields.  While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.

Greg Stark with some help from Tom Lane
2007-04-06 04:21:44 +00:00
..
data
docs Add description of new features 2006-10-31 16:23:05 +00:00
expected
gendict
ispell Fix core dump of ispell for case of non-successfull initialization. 2006-12-04 09:26:57 +00:00
my2ispell Fix convertion for 'PFX flag N num' 2006-12-21 17:35:28 +00:00
snowball pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
sql
stopword
wordparser Fix parser bug on Windows with UTF8 encoding and C locale, the reason was 2007-03-22 15:58:24 +00:00
common.c Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
common.h pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
crc32.c
crc32.h
dict_ex.c Fix bug http://archives.postgresql.org/pgsql-bugs/2006-10/msg00258.php. 2006-11-20 14:03:30 +00:00
dict_ispell.c
dict_snowball.c Fix bug http://archives.postgresql.org/pgsql-bugs/2006-10/msg00258.php. 2006-11-20 14:03:30 +00:00
dict_syn.c Fix uninitialized-variable bug. 2007-03-28 01:28:34 +00:00
dict_thesaurus.c Wording cleanup for error messages. Also change can't -> cannot. 2007-02-01 19:10:30 +00:00
dict.c pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
dict.h pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
ginidx.c Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
gistidx.c Fix up several contrib modules that were using varlena datatypes in not-so-obvious 2007-02-28 22:44:38 +00:00
gistidx.h Fix up several contrib modules that were using varlena datatypes in not-so-obvious 2007-02-28 22:44:38 +00:00
Makefile Replace useless uses of := by = in makefiles. 2007-02-09 15:56:00 +00:00
prs_dcfg.c pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
query_cleanup.c
query_cleanup.h pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
query_gist.c pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
query_rewrite.c Fix up several contrib modules that were using varlena datatypes in not-so-obvious 2007-02-28 22:44:38 +00:00
query_support.c Fix up several contrib modules that were using varlena datatypes in not-so-obvious 2007-02-28 22:44:38 +00:00
query_util.c Fix up several contrib modules that were using varlena datatypes in not-so-obvious 2007-02-28 22:44:38 +00:00
query_util.h
query.c Fix up several contrib modules that were using varlena datatypes in not-so-obvious 2007-02-28 22:44:38 +00:00
query.h Fix up several contrib modules that were using varlena datatypes in not-so-obvious 2007-02-28 22:44:38 +00:00
rank.c Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
README.tsearch2 Fix two typos. 2006-11-08 19:06:15 +00:00
snmap.c
snmap.h
stopword.c Add checking of end of line in parsing stopword list. Thanks to sharp eyes of Tom lane 2007-03-26 13:57:07 +00:00
thesaurus
ts_cfg.c Support varlena fields with single-byte headers and unaligned storage. 2007-04-06 04:21:44 +00:00
ts_cfg.h
ts_lexize.c pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
ts_locale.c Fix localization support for multibyte encoding and C locale. 2007-01-15 15:16:28 +00:00
ts_locale.h Fix localization support for multibyte encoding and C locale. 2007-01-15 15:16:28 +00:00
ts_stat.c Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
ts_stat.h Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
tsearch.sql.in
tsvector_op.c Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
tsvector.c Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
tsvector.h Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). 2007-02-27 23:48:10 +00:00
untsearch.sql.in
wparser_def.c
wparser.c Fix caching of unsuccessful initialization of parser or configuration. 2007-04-02 11:42:04 +00:00
wparser.h

Tsearch2 - full text search extension for PostgreSQL

   [1]Online version of this document is available

   Tsearch2  -  is the full text engine, fully integrated into PostgreSQL
   RDBMS.

Main features

     * Full online update
     * Supports multiple table driven configurations
     * flexible  and  rich linguistic support (dictionaries, stop words),
       thesaurus
     * full multibyte (UTF-8) support
     * Sophisticated  ranking  functions  with  support  of proximity and
       structure information (rank, rank_cd)
     * Index support (GiST and Gin) with concurrency and recovery support
     * Rich query language with query rewriting support
     * Headline support (text fragments with highlighted search terms)
     * Ability to plug-in custom dictionaries and parsers
     * Template  generator  for  tsearch2  dictionaries  with [2]snowball
       stemmer support
     * It is mature (5 years of development)

   Tsearch2,  in a nutshell, provides FTS operator (contains) for the new
   data  types,  representing  document  (tsvector)  and query (tsquery).
   Table  driven  configuration  allows creation of custom searches using
   standard SQL commands.

   tsvector is a searchable data type, representing document. It is a set
   of  unique  words  along  with  their  positional  information  in the
   document,  organized  in a special structure optimized for fast access
   and  lookup. Each entry could be labelled to reflect its importance in
   document.

   tsquery  is  a  data  type for textual queries with support of boolean
   operators.  It  consists of lexemes (optionally labelled) with boolean
   operators between.

   Table driven configuration allows to specify:
     * parser, which used to break document onto lexemes
     * what lexemes to index and the way they are processed
     * dictionaries to be used along with stop words recognition.

OpenFTS vs Tsearch2

   [3]OpenFTS  is  a middleware between application and database. OpenFTS
   uses  tsearch2  as  a  storage and database engine as a query executor
   (searching).   Everything  else,  i.e.  parsing  of  documents,  query
   processing, linguistics, carry outs on client side. That's why OpenFTS
   has  its own configuration table (fts_conf) and works with its own set
   of dictionaries. OpenFTS is more flexible, because it could be used in
   multi-server  architecture  with  separate  machines for repository of
   documents  (documents  could  be  stored  in filesystem), database and
   query engine.

   See [4]Documentation Roadmap for links to documentation.

Authors

     * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
     * Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia

Contributors

     * Robert   John   Shepherd   and   Andrew   J.   Kopciuch  submitted
       "Introduction  to  tsearch" (Robert - tsearch v1, Andrew - tsearch
       v2)
     * Brandon   Craig   Rhodes  wrote  "Tsearch2  Guide"  and  "Tsearch2
       Reference" and proposed new naming convention for tsearch V2

Sponsors

     * ABC Startsiden - compound words support
     * University of Mannheim for UTF-8 support (in 8.2)
     * jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
       Inverted index (in 8.2)
     * Georgia  Public  Library  Service  and LibLime, Inc. for Thesaurus
       dictionary
     * PostGIS community - GiST Concurrency and Recovery

   The  authors are grateful to the Russian Foundation for Basic Research
   and Delta-Soft Ltd., Moscow, Russia for support.

Limitations

     * Length of lexeme < 2K
     * Length of tsvector (lexemes + positions) < 1Mb
     * The number of lexemes < 4^32
     * 0< Positional information < 16383
     * No more than 256 positions per lexeme
     * The number of nodes ( lexemes + operations) in tsquery < 32768

References

     * GiST development site -
       [6]http://www.sai.msu.su/~megera/postgres/gist
     * GiN development - [7]http://www.sigaev.ru/gin/
     * OpenFTS home page - [8]http://openfts.sourceforge.net/
     * Mailing list -
       [9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
       ral

Documentation Roadmap

     * Several docs are available from docs/ subdirectory
          + "Tsearch V2 Introduction" by Andrew Kopciuch
          + "Tsearch2 Guide" by Brandon Rhodes
          + "Tsearch2 Reference" by Brandon Rhodes
     * Readme.gendict in gendict/ subdirectory
          + Also, check [10]Gendict tutorial
     * Check [11]tsearch2 Wiki pages for various documentation

Support

   Authors  urgently  recommend  people  to  use  [12]openfts-general  or
   [13]pgsql-general mailing lists for questions and discussions.

Development History

   Latest news

   To the PostgreSQL 8.2 release we added:
     * multibyte (UTF-8) support
     * Thesaurus dictionary
     * Query rewriting
     * rank_cd  relevation  function  now  support  different  weights of
       lexemes
     * GiN support adds scalability of tsearch2

   Pre-tsearch era
          Development  of  OpenFTS  began in 2000 after realizing that we
          need  a  search engine optimized for online updates with access
          to  metadata  from  the  database. This is essential for online
          news agencies, web portals, digital libraries, etc. Most search
          engines  available utilize an inverted index which is very fast
          for  searching  but  very  slow for online updates. Incremental
          updates  of  an  inverted  index  is a complex engineering task
          while  we  needed something light, free and with the ability to
          access  metadata  from  the  database. The last requirement was
          very important because in a real life application search engine
          should  always  consult  metadata  (  topic,  permissions, date
          range,  version,  etc.).  We  extensively  use  PostgreSQL as a
          database  backend and have no intention to move from it, so the
          problem  was  to find a data structure and a fast way to access
          it.  PostgreSQL  has  rather  unique data type for storing sets
          (think  about  words) - arrays, but lacks index access to them.
          During our research we found a paper of Joseph Hellerstein, who
          introduced  an  interesting  data structure suitable for sets -
          RD-tree  (Russian  Doll  tree). Further research lead us to the
          idea to use GiST for implementing RD-tree, but at that time the
          GiST  code  was untouched for a long time and contained several
          bugs.  After  work  on  improving  GiST  for  version  7.0.3 of
          PostgreSQL  was done, we were able to implement RD-Tree and use
          it  for index access to arrays of integers. This implementation
          was  ideally  suited  for  small  arrays and eliminated complex
          joins,  but  was practically useless for indexing large arrays.
          The  next improvement came from an idea to represent a document
          by  a  single bit-signature, a so-called superimposed signature
          (see "Index Structures for Databases Containing Data Items with
          Set-valued  Attributes",  1997,  Sven  Helmer  for details). We
          developed  the  contrib/intarray  module and used it for full
          text indexing.

   tsearch v1
          It was inconvenient to use integer id's instead of words, so we
          introduced  a new data type called 'txtidx' - a searchable data
          type  (textual)  with  indexed access. This was a first step of
          our  work  on  an  implementation of a built-in PostgreSQL full
          text search engine. Even though tsearch v1 had many features of
          a  search  engine it lacked configuration support and relevance
          ranking.  People were encouraged to use OpenFTS, which provided
          relevance  ranking based on positional information and flexible
          configuration.  OpenFTS  v.0.34  is  the  last version based on
          tsearch v1.

   tsearch V2
          People  recognized  tsearch  as  a  powerful tool for full text
          searching  and  insisted  on  adding  ranking  support,  better
          configurability,  etc.  We already thought about moving most of
          the  features  of  OpenFTS to tsearch, and in the early 2003 we
          decided  to  work  on  a  new  version of tsearch. We abandoned
          auxiliary  index  tables  which  were  used by OpenFTS to store
          positional  information  and  modified the txtidx type to store
          them  internally.  We added table-driven configuration, support
          of  ispell  dictionaries,  snowball stemmers and the ability to
          specify  which types of lexemes to index. Now, it's possible to
          generate  headlines of documents with highlighted search terms.
          These  changes make tsearch more user friendly and turn it into
          a  really  powerful  full  text  search  engine. Brandon Rhodes
          proposed  to  rename  tsearch  functions for consistency and we
          renamed  txtidx  type  to tsvector and other things as well. To
          allow  users  of tsearch v1 smooth upgrade, we named the module
          as tsearch2. Since version 0.35 OpenFTS uses tsearch2.

References

   1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
   2. http://snowball.tartarus.org/
   3. http://openfts.sourceforge.net/
   4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
   5. http:www.jfg-networks.com/
   6. http://www.sai.msu.su/~megera/postgres/gist
   7. http://www.sigaev.ru/gin/
   8. http://openfts.sourceforge.net/
   9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
  10. http://www.sai.msu.su/~megera/wiki/Gendict
  11. http://www.sai.msu.su/~megera/wiki/Tsearch2
  12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
  13. http://archives.postgresql.org/pgsql-general/