Get rid of USE_WIDE_UPPER_LOWER dependency in trigram construction.

contrib/pg_trgm's make_trigrams() was coded to ignore multibyte character boundaries and just make trigrams from bytes if USE_WIDE_UPPER_LOWER wasn't defined. This is a bit odd, since there's no obvious reason why trigram compaction rules should depend on the presence of towlower() and friends. What's more, there was an Assert() that would fail if that code path was fed any multibyte characters. We need to do something about this since the pending regex-indexing patch has an assumption that you get just one "trgm" from any three characters. The best solution seems to be to remove the USE_WIDE_UPPER_LOWER dependency, which shouldn't really have been there in the first place. The second loop in make_trigrams() is now just a fast path and not a potentially incompatible algorithm. If there is anybody still using Postgres on machines without wcstombs() or towlower(), and they have non-ASCII data indexed by pg_trgm, they'll need to REINDEX those indexes after pg_upgrade to 9.3, else searches may fail incorrectly. It seems likely that there are no such installations, though. In passing, rename cnt_trigram to compact_trigram, which seems to better describe its functionality, and improve make_trigrams' test for whether it has to use the slow path or not (per a suggestion from Alexander Korotkov).
2024-12-27 08:39:28 +08:00 · 2013-04-07 14:45:33 -04:00 · 2013-04-07 14:45:33 -04:00 · 7844608e54
commit 7844608e54
parent faf4726c9f
1 changed files with 10 additions and 7 deletions
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@ -109,9 +109,13 @@ find_word(char *str, int lenstr, char **endword, int *charlen)
 	return beginword;
 }

-#ifdef USE_WIDE_UPPER_LOWER
+/*
+ * Reduce a trigram (three possibly multi-byte characters) to a trgm,
+ * which is always exactly three bytes.  If we have three single-byte
+ * characters, we just use them as-is; otherwise we form a hash value.
+ */
 static void
-cnt_trigram(trgm *tptr, char *str, int bytelen)
+compact_trigram(trgm *tptr, char *str, int bytelen)
 {
 	if (bytelen == 3)
 	{
@ -131,7 +135,6 @@ cnt_trigram(trgm *tptr, char *str, int bytelen)
 		CPTRGM(tptr, &crc);
 	}
 }
-#endif

 /*
 * Adds trigrams from words (already padded).
@ -144,16 +147,16 @@ make_trigrams(trgm *tptr, char *str, int bytelen, int charlen)
 	if (charlen < 3)
 		return tptr;

-#ifdef USE_WIDE_UPPER_LOWER
-	if (pg_database_encoding_max_length() > 1)
+	if (bytelen > charlen)
 	{
+		/* Find multibyte character boundaries and apply compact_trigram */
 		int			lenfirst = pg_mblen(str),
 					lenmiddle = pg_mblen(str + lenfirst),
 					lenlast = pg_mblen(str + lenfirst + lenmiddle);

 		while ((ptr - str) + lenfirst + lenmiddle + lenlast <= bytelen)
 		{
-			cnt_trigram(tptr, ptr, lenfirst + lenmiddle + lenlast);
+			compact_trigram(tptr, ptr, lenfirst + lenmiddle + lenlast);

 			ptr += lenfirst;
 			tptr++;
@ -164,8 +167,8 @@ make_trigrams(trgm *tptr, char *str, int bytelen, int charlen)
 		}
 	}
 	else
-#endif
 	{
+		/* Fast path when there are no multibyte characters */
 		Assert(bytelen == charlen);

 		while (ptr - str < bytelen - 2 /* number of trigrams = strlen - 2 */ )