postgresql/contrib/tsearch2/README.tsearch2

Tsearch2 - full text search extension for PostgreSQL

   [1]Online version of this document is available

   Tsearch2  -  is the full text engine, fully integrated into PostgreSQL
   RDBMS.

Main features

     * Full online update
     * Supports multiple table driven configurations
     * flexible  and  rich linguistic support (dictionaries, stop words),
       thesaurus
     * full multibyte (UTF-8) support
     * Sophisticated  ranking  functions  with  support  of proximity and
       structure information (rank, rank_cd)
     * Index support (GiST and Gin) with concurrency and recovery support
     * Rich query language with query rewriting support
     * Headline support (text fragments with highlighted search terms)
     * Ability to plug-in custom dictionaries and parsers
     * Template  generator  for  tsearch2  dictionaries  with [2]snowball
       stemmer support
     * It is mature (5 years of development)

   Tsearch2,  in a nutshell, provides FTS operator (contains) for the new
   data  types,  representing  document  (tsvector)  and query (tsquery).
   Table  driven  configuration  allows creation of custom searches using
   standard SQL commands.

   tsvector is a searchable data type, representing document. It is a set
   of  unique  words  along  with  their  positional  information  in the
   document,  organized  in a special structure optimized for fast access
   and  lookup. Each entry could be labelled to reflect its importance in
   document.

   tsquery  is  a  data  type for textual queries with support of boolean
   operators.  It  consists of lexemes (optionally labelled) with boolean
   operators between.

   Table driven configuration allows to specify:
     * parser, which used to break document onto lexemes
     * what lexemes to index and the way they are processed
     * dictionaries to be used along with stop words recognition.

OpenFTS vs Tsearch2

   [3]OpenFTS  is  a middleware between application and database. OpenFTS
   uses  tsearch2  as  a  storage and database engine as a query executor
   (searching).   Everything  else,  i.e.  parsing  of  documents,  query
   processing, linguistics, carry outs on client side. That's why OpenFTS
   has  its own configuration table (fts_conf) and works with its own set
   of dictionaries. OpenFTS is more flexible, because it could be used in
   multi-server  architecture  with  separate  machines for repository of
   documents  (documents  could  be  stored  in filesystem), database and
   query engine.

   See [4]Documentation Roadmap for links to documentation.

Authors

     * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
     * Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia

Contributors

     * Robert   John   Shepherd   and   Andrew   J.   Kopciuch  submitted
       "Introduction  to  tsearch" (Robert - tsearch v1, Andrew - tsearch
       v2)
     * Brandon   Craig   Rhodes  wrote  "Tsearch2  Guide"  and  "Tsearch2
       Reference" and proposed new naming convention for tsearch V2

Sponsors

     * ABC Startsiden - compound words support
     * University of Mannheim for UTF-8 support (in 8.2)
     * jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
       Inverted index (in 8.2)
     * Georgia  Public  Library  Service  and LibLime, Inc. for Thesaurus
       dictionary
     * PostGIS community - GiST Concurrency and Recovery

   The  authors are grateful to the Russian Foundation for Basic Research
   and Delta-Soft Ltd., Moscow, Russia for support.

Limitations

     * Length of lexeme < 2K
     * Length of tsvector (lexemes + positions) < 1Mb
     * The number of lexemes < 4^32
     * 0< Positional information < 16383
     * No more than 256 positions per lexeme
     * The number of nodes ( lexemes + operations) in tsquery < 32768

References

     * GiST development site -
       [6]http://www.sai.msu.su/~megera/postgres/gist
     * GiN development - [7]http://www.sigaev.ru/gin/
     * OpenFTS home page - [8]http://openfts.sourceforge.net/
     * Mailing list -
       [9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
       ral

Documentation Roadmap

     * Several docs are available from docs/ subdirectory
          + "Tsearch V2 Introduction" by Andrew Kopciuch
          + "Tsearch2 Guide" by Brandon Rhodes
          + "Tsearch2 Reference" by Brandon Rhodes
     * Readme.gendict in gendict/ subdirectory
          + Also, check [10]Gendict tutorial
     * Check [11]tsearch2 Wiki pages for various documentation

Support

   Authors  urgently  recommend  people  to  use  [12]openfts-general  or
   [13]pgsql-general mailing lists for questions and discussions.

Development History

   Latest news

   To the PostgreSQL 8.2 release we added:
     * multibyte (UTF-8) support
     * Thesaurus dictionary
     * Query rewriting
     * rank_cd  relevation  function  now  support  different  weights of
       lexemes
     * GiN support adds scalability of tsearch2

   Pre-tsearch era
          Development  of  OpenFTS  began in 2000 after realizing that we
          need  a  search engine optimized for online updates with access
          to  metadata  from  the  database. This is essential for online
          news agencies, web portals, digital libraries, etc. Most search
          engines  available utilize an inverted index which is very fast
          for  searching  but  very  slow for online updates. Incremental
          updates  of  an  inverted  index  is a complex engineering task
          while  we  needed something light, free and with the ability to
          access  metadata  from  the  database. The last requirement was
          very important because in a real life application search engine
          should  always  consult  metadata  (  topic,  permissions, date
          range,  version,  etc.).  We  extensively  use  PostgreSQL as a
          database  backend and have no intention to move from it, so the
          problem  was  to find a data structure and a fast way to access
          it.  PostgreSQL  has  rather  unique data type for storing sets
          (think  about  words) - arrays, but lacks index access to them.
          During our research we found a paper of Joseph Hellerstein, who
          introduced  an  interesting  data structure suitable for sets -
          RD-tree  (Russian  Doll  tree). Further research lead us to the
          idea to use GiST for implementing RD-tree, but at that time the
          GiST  code  was untouched for a long time and contained several
          bugs.  After  work  on  improving  GiST  for  version  7.0.3 of
          PostgreSQL  was done, we were able to implement RD-Tree and use
          it  for index access to arrays of integers. This implementation
          was  ideally  suited  for  small  arrays and eliminated complex
          joins,  but  was practically useless for indexing large arrays.
          The  next improvement came from an idea to represent a document
          by  a  single bit-signature, a so-called superimposed signature
          (see "Index Structures for Databases Containing Data Items with
          Set-valued  Attributes",  1997,  Sven  Helmer  for details). We
          developed  the  contrib/intarray  module and used it for full
          text indexing.

   tsearch v1
          It was inconvenient to use integer id's instead of words, so we
          introduced  a new data type called 'txtidx' - a searchable data
          type  (textual)  with  indexed access. This was a first step of
          our  work  on  an  implementation of a built-in PostgreSQL full
          text search engine. Even though tsearch v1 had many features of
          a  search  engine it lacked configuration support and relevance
          ranking.  People were encouraged to use OpenFTS, which provided
          relevance  ranking based on positional information and flexible
          configuration.  OpenFTS  v.0.34  is  the  last version based on
          tsearch v1.

   tsearch V2
          People  recognized  tsearch  as  a  powerful tool for full text
          searching  and  insisted  on  adding  ranking  support,  better
          configurability,  etc.  We already thought about moving most of
          the  features  of  OpenFTS to tsearch, and in the early 2003 we
          decided  to  work  on  a  new  version of tsearch. We abandoned
          auxiliary  index  tables  which  were  used by OpenFTS to store
          positional  information  and  modified the txtidx type to store
          them  internally.  We added table-driven configuration, support
          of  ispell  dictionaries,  snowball stemmers and the ability to
          specify  which types of lexemes to index. Now, it's possible to
          generate  headlines of documents with highlighted search terms.
          These  changes make tsearch more user friendly and turn it into
          a  really  powerful  full  text  search  engine. Brandon Rhodes
          proposed  to  rename  tsearch  functions for consistency and we
          renamed  txtidx  type  to tsvector and other things as well. To
          allow  users  of tsearch v1 smooth upgrade, we named the module
          as tsearch2. Since version 0.35 OpenFTS uses tsearch2.

References

   1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
   2. http://snowball.tartarus.org/
   3. http://openfts.sourceforge.net/
   4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
   5. http:www.jfg-networks.com/
   6. http://www.sai.msu.su/~megera/postgres/gist
   7. http://www.sigaev.ru/gin/
   8. http://openfts.sourceforge.net/
   9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
  10. http://www.sai.msu.su/~megera/wiki/Gendict
  11. http://www.sai.msu.su/~megera/wiki/Tsearch2
  12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
  13. http://archives.postgresql.org/pgsql-general/
tsearch2 module 2003-07-21 18:27:44 +08:00			`Tsearch2 - full text search extension for PostgreSQL`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`[1]Online version of this document is available`

			`Tsearch2 - is the full text engine, fully integrated into PostgreSQL`
			`RDBMS.`

			`Main features`

			`* Full online update`
			`* Supports multiple table driven configurations`
			`* flexible and rich linguistic support (dictionaries, stop words),`
			`thesaurus`
			`* full multibyte (UTF-8) support`
			`* Sophisticated ranking functions with support of proximity and`
			`structure information (rank, rank_cd)`
			`* Index support (GiST and Gin) with concurrency and recovery support`
			`* Rich query language with query rewriting support`
			`* Headline support (text fragments with highlighted search terms)`
			`* Ability to plug-in custom dictionaries and parsers`
			`* Template generator for tsearch2 dictionaries with [2]snowball`
			`stemmer support`
			`* It is mature (5 years of development)`

			`Tsearch2, in a nutshell, provides FTS operator (contains) for the new`
			`data types, representing document (tsvector) and query (tsquery).`
			`Table driven configuration allows creation of custom searches using`
tsearch2 module 2003-07-21 18:27:44 +08:00			`standard SQL commands.`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00
			`tsvector is a searchable data type, representing document. It is a set`
			`of unique words along with their positional information in the`
			`document, organized in a special structure optimized for fast access`
			`and lookup. Each entry could be labelled to reflect its importance in`
			`document.`

			`tsquery is a data type for textual queries with support of boolean`
			`operators. It consists of lexemes (optionally labelled) with boolean`
			`operators between.`

			`Table driven configuration allows to specify:`
			`* parser, which used to break document onto lexemes`
			`* what lexemes to index and the way they are processed`
			`* dictionaries to be used along with stop words recognition.`
OpenFTS vs Tsearch2 comment 2003-08-06 22:53:01 +08:00
			`OpenFTS vs Tsearch2`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`[3]OpenFTS is a middleware between application and database. OpenFTS`
			`uses tsearch2 as a storage and database engine as a query executor`
			`(searching). Everything else, i.e. parsing of documents, query`
			`processing, linguistics, carry outs on client side. That's why OpenFTS`
			`has its own configuration table (fts_conf) and works with its own set`
			`of dictionaries. OpenFTS is more flexible, because it could be used in`
			`multi-server architecture with separate machines for repository of`
			`documents (documents could be stored in filesystem), database and`
			`query engine.`

			`See [4]Documentation Roadmap for links to documentation.`
OpenFTS vs Tsearch2 comment 2003-08-06 22:53:01 +08:00
tsearch2 module 2003-07-21 18:27:44 +08:00			`Authors`

			`* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`* Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia`

tsearch2 module 2003-07-21 18:27:44 +08:00			`Contributors`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`* Robert John Shepherd and Andrew J. Kopciuch submitted`
			`"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch`
tsearch2 module 2003-07-21 18:27:44 +08:00			`v2)`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2`
tsearch2 module 2003-07-21 18:27:44 +08:00			`Reference" and proposed new naming convention for tsearch V2`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`Sponsors`

			`* ABC Startsiden - compound words support`
			`* University of Mannheim for UTF-8 support (in 8.2)`
			`* jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized`
			`Inverted index (in 8.2)`
			`* Georgia Public Library Service and LibLime, Inc. for Thesaurus`
			`dictionary`
			`* PostGIS community - GiST Concurrency and Recovery`

			`The authors are grateful to the Russian Foundation for Basic Research`
			`and Delta-Soft Ltd., Moscow, Russia for support.`

tsearch2 module 2003-07-21 18:27:44 +08:00			`Limitations`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`* Length of lexeme < 2K`
			`* Length of tsvector (lexemes + positions) < 1Mb`
			`* The number of lexemes < 4^32`
			`* 0< Positional information < 16383`
			`* No more than 256 positions per lexeme`
			`* The number of nodes ( lexemes + operations) in tsquery < 32768`

tsearch2 module 2003-07-21 18:27:44 +08:00			`References`

			`* GiST development site -`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`[6]http://www.sai.msu.su/~megera/postgres/gist`
			`* GiN development - [7]http://www.sigaev.ru/gin/`
			`* OpenFTS home page - [8]http://openfts.sourceforge.net/`
tsearch2 module 2003-07-21 18:27:44 +08:00			`* Mailing list -`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`[9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene`
			`ral`

tsearch2 module 2003-07-21 18:27:44 +08:00			`Documentation Roadmap`

			`* Several docs are available from docs/ subdirectory`
			`+ "Tsearch V2 Introduction" by Andrew Kopciuch`
			`+ "Tsearch2 Guide" by Brandon Rhodes`
			`+ "Tsearch2 Reference" by Brandon Rhodes`
			`* Readme.gendict in gendict/ subdirectory`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`+ Also, check [10]Gendict tutorial`
			`* Check [11]tsearch2 Wiki pages for various documentation`

tsearch2 module 2003-07-21 18:27:44 +08:00			`Support`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`Authors urgently recommend people to use [12]openfts-general or`
			`[13]pgsql-general mailing lists for questions and discussions.`
tsearch2 module 2003-07-21 18:27:44 +08:00
			`Development History`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`Latest news`

			`To the PostgreSQL 8.2 release we added:`
			`* multibyte (UTF-8) support`
			`* Thesaurus dictionary`
			`* Query rewriting`
			`* rank_cd relevation function now support different weights of`
			`lexemes`
			`* GiN support adds scalability of tsearch2`

tsearch2 module 2003-07-21 18:27:44 +08:00			`Pre-tsearch era`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`Development of OpenFTS began in 2000 after realizing that we`
			`need a search engine optimized for online updates with access`
			`to metadata from the database. This is essential for online`
tsearch2 module 2003-07-21 18:27:44 +08:00			`news agencies, web portals, digital libraries, etc. Most search`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`engines available utilize an inverted index which is very fast`
			`for searching but very slow for online updates. Incremental`
			`updates of an inverted index is a complex engineering task`
			`while we needed something light, free and with the ability to`
			`access metadata from the database. The last requirement was`
			`very important because in a real life application search engine`
			`should always consult metadata ( topic, permissions, date`
			`range, version, etc.). We extensively use PostgreSQL as a`
			`database backend and have no intention to move from it, so the`
			`problem was to find a data structure and a fast way to access`
			`it. PostgreSQL has rather unique data type for storing sets`
			`(think about words) - arrays, but lacks index access to them.`
			`During our research we found a paper of Joseph Hellerstein, who`
			`introduced an interesting data structure suitable for sets -`
			`RD-tree (Russian Doll tree). Further research lead us to the`
			`idea to use GiST for implementing RD-tree, but at that time the`
Fix two typos. 2006-11-09 03:06:15 +08:00			`GiST code was untouched for a long time and contained several`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`bugs. After work on improving GiST for version 7.0.3 of`
			`PostgreSQL was done, we were able to implement RD-Tree and use`
			`it for index access to arrays of integers. This implementation`
			`was ideally suited for small arrays and eliminated complex`
			`joins, but was practically useless for indexing large arrays.`
			`The next improvement came from an idea to represent a document`
			`by a single bit-signature, a so-called superimposed signature`
			`(see "Index Structures for Databases Containing Data Items with`
			`Set-valued Attributes", 1997, Sven Helmer for details). We`
Fix two typos. 2006-11-09 03:06:15 +08:00			`developed the contrib/intarray module and used it for full`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`text indexing.`

tsearch2 module 2003-07-21 18:27:44 +08:00			`tsearch v1`
			`It was inconvenient to use integer id's instead of words, so we`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`introduced a new data type called 'txtidx' - a searchable data`
			`type (textual) with indexed access. This was a first step of`
			`our work on an implementation of a built-in PostgreSQL full`
tsearch2 module 2003-07-21 18:27:44 +08:00			`text search engine. Even though tsearch v1 had many features of`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`a search engine it lacked configuration support and relevance`
			`ranking. People were encouraged to use OpenFTS, which provided`
			`relevance ranking based on positional information and flexible`
			`configuration. OpenFTS v.0.34 is the last version based on`
tsearch2 module 2003-07-21 18:27:44 +08:00			`tsearch v1.`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00
tsearch2 module 2003-07-21 18:27:44 +08:00			`tsearch V2`
New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`People recognized tsearch as a powerful tool for full text`
			`searching and insisted on adding ranking support, better`
			`configurability, etc. We already thought about moving most of`
			`the features of OpenFTS to tsearch, and in the early 2003 we`
			`decided to work on a new version of tsearch. We abandoned`
			`auxiliary index tables which were used by OpenFTS to store`
			`positional information and modified the txtidx type to store`
			`them internally. We added table-driven configuration, support`
			`of ispell dictionaries, snowball stemmers and the ability to`
			`specify which types of lexemes to index. Now, it's possible to`
			`generate headlines of documents with highlighted search terms.`
			`These changes make tsearch more user friendly and turn it into`
			`a really powerful full text search engine. Brandon Rhodes`
			`proposed to rename tsearch functions for consistency and we`
			`renamed txtidx type to tsvector and other things as well. To`
			`allow users of tsearch v1 smooth upgrade, we named the module`
			`as tsearch2. Since version 0.35 OpenFTS uses tsearch2.`
tsearch2 module 2003-07-21 18:27:44 +08:00
			`References`

New README, forgotten when docs was updated 2006-11-09 00:00:29 +08:00			`1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html`
			`2. http://snowball.tartarus.org/`
			`3. http://openfts.sourceforge.net/`
			`4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm`
			`5. http:www.jfg-networks.com/`
			`6. http://www.sai.msu.su/~megera/postgres/gist`
			`7. http://www.sigaev.ru/gin/`
			`8. http://openfts.sourceforge.net/`
			`9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general`
			`10. http://www.sai.msu.su/~megera/wiki/Gendict`
			`11. http://www.sai.msu.su/~megera/wiki/Tsearch2`
			`12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general`
			`13. http://archives.postgresql.org/pgsql-general/`