postgresql/contrib/pg_trgm/README.pg_trgm

trgm - Trigram matching for PostgreSQL
--------------------------------------

Introduction

	This module is sponsored by Delta-Soft Ltd., Moscow, Russia.

	The pg_trgm contrib module provides functions and index classes
	for determining the similarity of text based on trigram
	matching.

Definitions

	Trigram (or Trigraph)

	A trigram is a set of three consecutive characters taken
	from a string.  A string is considered to have two spaces
	prefixed and one space suffixed when determining the set
	of trigrams that comprise the string.

	eg. The set of trigrams in the word "cat" is "  c", " ca", 
	"at " and "cat".

Public Functions

	real similarity(text, text)

	Returns a number that indicates how closely matches the two
	arguments are.  A zero result indicates that the two words
	are completely dissimilar, and a result of one indicates that
	the two words are identical.

	real show_limit()

	Returns the current similarity threshold used by the '%'
	operator.  This in effect sets the minimum similarity between
	two words in order that they be considered similar enough to
	be misspellings of each other, for example.

	real set_limit(real)

	Sets the current similarity threshold that is used by the '%'
	operator, and is returned by the show_limit() function.

	text[] show_trgm(text)

	Returns an array of all the trigrams of the supplied text
	parameter.

Public Operators

	text % text (returns boolean)

	The '%' operator returns TRUE if its two arguments have a similarity
	that is greater than the similarity threshold set by set_limit(). It
	will return FALSE if the similarity is less than the current
	threshold.

Public Index Operator Classes

	gist_trgm_ops

	The pg_trgm module comes with an index operator class that allows a
	developer to create an index over a text column for the purpose
	of very fast similarity searches.

	To use this index, the '%' operator must be used and an appropriate
	similarity threshold for the application must be set.

	eg.

	CREATE TABLE test_trgm (t text);
	CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
	
	At this point, you will have an index on the t text column that you
	can use for similarity searching.

	eg.

	SELECT
		t,
		similarity(t, 'word') AS sml
	FROM
		test_trgm
	WHERE
		t % 'word'
	ORDER BY
		sml DESC, t;

	This will return all values in the text column that are sufficiently
	similar to 'word', sorted from best match to worst.  The index will
	be used to make this a fast operation over very large data sets.

Tsearch2 Integration

	Trigram matching is a very useful tool when used in conjunction
	with a text index created by the Tsearch2 contrib module. (See
	contrib/tsearch2)

	The first step is to generate an auxiliary table containing all
	the unique words in the Tsearch2 index:

	CREATE TABLE words AS SELECT word FROM
		stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');

	Where 'documents' is a table that has a text field 'bodytext'
	that TSearch2 is used to search.  The use of the 'simple' dictionary
	with the to_tsvector function, instead of just using the already
	existing vector is to avoid creating a list of already stemmed
	words.  This way, only the original, unstemmed words are added
	to the word list.

	Next, create a trigram index on the word column:

	CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
	or
	CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);

	Now, a SELECT query similar to the example above can be used to
	suggest spellings for misspelled words in user search terms. A
	useful extra clause is to ensure that the similar words are also
	of similar length to the misspelled word.

	Note: Since the 'words' table has been generated as a separate,
	static table, it will need to be periodically regenerated so that
	it remains up to date with the word list in the Tsearch2 index.

Authors

	Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
	Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
       
Contributors

	Christopher Kings-Lynne wrote this README file

References

	Tsearch2 Development Site
	http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
	
	GiST Development Site
	http://www.sai.msu.su/~megera/postgres/gist/
trgm - Trigram matching for PostgreSQL -------------------------------------- The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching. 2004-06-01 01:18:12 +08:00			`trgm - Trigram matching for PostgreSQL`
			`--------------------------------------`

			`Introduction`

			`This module is sponsored by Delta-Soft Ltd., Moscow, Russia.`

			`The pg_trgm contrib module provides functions and index classes`
			`for determining the similarity of text based on trigram`
			`matching.`

			`Definitions`

			`Trigram (or Trigraph)`

			`A trigram is a set of three consecutive characters taken`
			`from a string. A string is considered to have two spaces`
			`prefixed and one space suffixed when determining the set`
			`of trigrams that comprise the string.`

			`eg. The set of trigrams in the word "cat" is " c", " ca",`
			`"at " and "cat".`

			`Public Functions`

			`real similarity(text, text)`

			`Returns a number that indicates how closely matches the two`
			`arguments are. A zero result indicates that the two words`
			`are completely dissimilar, and a result of one indicates that`
			`the two words are identical.`

			`real show_limit()`

			`Returns the current similarity threshold used by the '%'`
			`operator. This in effect sets the minimum similarity between`
			`two words in order that they be considered similar enough to`
			`be misspellings of each other, for example.`

			`real set_limit(real)`

			`Sets the current similarity threshold that is used by the '%'`
			`operator, and is returned by the show_limit() function.`

			`text[] show_trgm(text)`

			`Returns an array of all the trigrams of the supplied text`
			`parameter.`

			`Public Operators`

			`text % text (returns boolean)`

			`The '%' operator returns TRUE if its two arguments have a similarity`
			`that is greater than the similarity threshold set by set_limit(). It`
			`will return FALSE if the similarity is less than the current`
			`threshold.`

			`Public Index Operator Classes`

			`gist_trgm_ops`

			`The pg_trgm module comes with an index operator class that allows a`
			`developer to create an index over a text column for the purpose`
			`of very fast similarity searches.`

			`To use this index, the '%' operator must be used and an appropriate`
			`similarity threshold for the application must be set.`

			`eg.`

			`CREATE TABLE test_trgm (t text);`
			`CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);`

			`At this point, you will have an index on the t text column that you`
			`can use for similarity searching.`

			`eg.`

			`SELECT`
			`t,`
			`similarity(t, 'word') AS sml`
			`FROM`
			`test_trgm`
			`WHERE`
			`t % 'word'`
			`ORDER BY`
			`sml DESC, t;`

			`This will return all values in the text column that are sufficiently`
			`similar to 'word', sorted from best match to worst. The index will`
			`be used to make this a fast operation over very large data sets.`

			`Tsearch2 Integration`

			`Trigram matching is a very useful tool when used in conjunction`
			`with a text index created by the Tsearch2 contrib module. (See`
			`contrib/tsearch2)`

			`The first step is to generate an auxiliary table containing all`
			`the unique words in the Tsearch2 index:`

This adds mention of my latest tweak to the tsearch2/pg_trgm integration. It is much better to create a word list of unstemmed words than stemmed ones. Chris K-L 2004-11-27 08:01:02 +08:00			`CREATE TABLE words AS SELECT word FROM`
			`stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');`

			`Where 'documents' is a table that has a text field 'bodytext'`
			`that TSearch2 is used to search. The use of the 'simple' dictionary`
			`with the to_tsvector function, instead of just using the already`
			`existing vector is to avoid creating a list of already stemmed`
			`words. This way, only the original, unstemmed words are added`
			`to the word list.`
trgm - Trigram matching for PostgreSQL -------------------------------------- The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching. 2004-06-01 01:18:12 +08:00
			`Next, create a trigram index on the word column:`

			`CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);`
Add GIN support for pg_trgm. From Guillaume Smet <guillaume.smet@gmail.com> with minor editorization by me. 2007-03-14 22:15:40 +08:00			`or`
			`CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);`
trgm - Trigram matching for PostgreSQL -------------------------------------- The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching. 2004-06-01 01:18:12 +08:00
			`Now, a SELECT query similar to the example above can be used to`
			`suggest spellings for misspelled words in user search terms. A`
			`useful extra clause is to ensure that the similar words are also`
			`of similar length to the misspelled word.`

			`Note: Since the 'words' table has been generated as a separate,`
			`static table, it will need to be periodically regenerated so that`
			`it remains up to date with the word list in the Tsearch2 index.`

			`Authors`

			`Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia`
			`Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia`

			`Contributors`

			`Christopher Kings-Lynne wrote this README file`

			`References`

			`Tsearch2 Development Site`
			`http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/`

			`GiST Development Site`
			`http://www.sai.msu.su/~megera/postgres/gist/`