mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-12-15 08:20:16 +08:00
Make citext's equality and hashing functions collation-insensitive.
This is an ugly hack to get around the fact that significant parts of the core backend assume they don't need to worry about passing collation to equality and hashing functions. That's true for the core string datatypes, but citext should ideally have equality behavior that depends on the specified collation's LC_CTYPE. However, there's no chance of fixing the core before 9.2, so we'll have to live with this compromise arrangement for now. Per bug #6053 from Regina Obe. The code changes in this commit should be reverted in full once the core code is up to speed, but be careful about reverting the docs changes: I fixed a number of obsolete statements while at it.
This commit is contained in:
parent
1bcdd66315
commit
3ebc061c18
@ -4,6 +4,7 @@
|
||||
#include "postgres.h"
|
||||
|
||||
#include "access/hash.h"
|
||||
#include "catalog/pg_collation.h"
|
||||
#include "fmgr.h"
|
||||
#include "utils/builtins.h"
|
||||
#include "utils/formatting.h"
|
||||
@ -48,8 +49,16 @@ citextcmp(text *left, text *right, Oid collid)
|
||||
*rcstr;
|
||||
int32 result;
|
||||
|
||||
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), collid);
|
||||
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), collid);
|
||||
/*
|
||||
* We must do our str_tolower calls with DEFAULT_COLLATION_OID, not the
|
||||
* input collation as you might expect. This is so that the behavior of
|
||||
* citext's equality and hashing functions is not collation-dependent. We
|
||||
* should change this once the core infrastructure is able to cope with
|
||||
* collation-dependent equality and hashing functions.
|
||||
*/
|
||||
|
||||
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
|
||||
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);
|
||||
|
||||
result = varstr_cmp(lcstr, strlen(lcstr),
|
||||
rcstr, strlen(rcstr),
|
||||
@ -93,7 +102,7 @@ citext_hash(PG_FUNCTION_ARGS)
|
||||
char *str;
|
||||
Datum result;
|
||||
|
||||
str = str_tolower(VARDATA_ANY(txt), VARSIZE_ANY_EXHDR(txt), PG_GET_COLLATION());
|
||||
str = str_tolower(VARDATA_ANY(txt), VARSIZE_ANY_EXHDR(txt), DEFAULT_COLLATION_OID);
|
||||
result = hash_any((unsigned char *) str, strlen(str));
|
||||
pfree(str);
|
||||
|
||||
@ -122,8 +131,8 @@ citext_eq(PG_FUNCTION_ARGS)
|
||||
|
||||
/* We can't compare lengths in advance of downcasing ... */
|
||||
|
||||
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), PG_GET_COLLATION());
|
||||
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), PG_GET_COLLATION());
|
||||
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
|
||||
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);
|
||||
|
||||
/*
|
||||
* Since we only care about equality or not-equality, we can avoid all the
|
||||
@ -152,8 +161,8 @@ citext_ne(PG_FUNCTION_ARGS)
|
||||
|
||||
/* We can't compare lengths in advance of downcasing ... */
|
||||
|
||||
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), PG_GET_COLLATION());
|
||||
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), PG_GET_COLLATION());
|
||||
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
|
||||
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);
|
||||
|
||||
/*
|
||||
* Since we only care about equality or not-equality, we can avoid all the
|
||||
|
@ -58,9 +58,9 @@ SELECT * FROM tab WHERE lower(col) = LOWER(?);
|
||||
The <type>citext</> data type allows you to eliminate calls
|
||||
to <function>lower</> in SQL queries, and allows a primary key to
|
||||
be case-insensitive. <type>citext</> is locale-aware, just
|
||||
like <type>text</>, which means that the comparison of upper case and
|
||||
like <type>text</>, which means that the matching of upper case and
|
||||
lower case characters is dependent on the rules of
|
||||
the <literal>LC_CTYPE</> locale setting. Again, this behavior is
|
||||
the database's <literal>LC_CTYPE</> setting. Again, this behavior is
|
||||
identical to the use of <function>lower</> in queries. But because it's
|
||||
done transparently by the data type, you don't have to remember to do
|
||||
anything special in your queries.
|
||||
@ -97,17 +97,25 @@ SELECT * FROM users WHERE nick = 'Larry';
|
||||
|
||||
<sect2>
|
||||
<title>String Comparison Behavior</title>
|
||||
|
||||
<para>
|
||||
<type>citext</> performs comparisons by converting each string to lower
|
||||
case (as though <function>lower</> were called) and then comparing the
|
||||
results normally. Thus, for example, two strings are considered equal
|
||||
if <function>lower</> would produce identical results for them.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In order to emulate a case-insensitive collation as closely as possible,
|
||||
there are <type>citext</>-specific versions of a number of the comparison
|
||||
there are <type>citext</>-specific versions of a number of string-processing
|
||||
operators and functions. So, for example, the regular expression
|
||||
operators <literal>~</> and <literal>~*</> exhibit the same behavior when
|
||||
applied to <type>citext</>: they both compare case-insensitively.
|
||||
applied to <type>citext</>: they both match case-insensitively.
|
||||
The same is true
|
||||
for <literal>!~</> and <literal>!~*</>, as well as for the
|
||||
<literal>LIKE</> operators <literal>~~</> and <literal>~~*</>, and
|
||||
<literal>!~~</> and <literal>!~~*</>. If you'd like to match
|
||||
case-sensitively, you can always cast to <type>text</> before comparing.
|
||||
case-sensitively, you can cast the operator's arguments to <type>text</>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -168,10 +176,10 @@ SELECT * FROM users WHERE nick = 'Larry';
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
<type>citext</>'s behavior depends on
|
||||
<type>citext</>'s case-folding behavior depends on
|
||||
the <literal>LC_CTYPE</> setting of your database. How it compares
|
||||
values is therefore determined when
|
||||
<application>initdb</> is run to create the cluster. It is not truly
|
||||
values is therefore determined when the database is created.
|
||||
It is not truly
|
||||
case-insensitive in the terms defined by the Unicode standard.
|
||||
Effectively, what this means is that, as long as you're happy with your
|
||||
collation, you should be happy with <type>citext</>'s comparisons. But
|
||||
@ -181,6 +189,20 @@ SELECT * FROM users WHERE nick = 'Larry';
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
As of <productname>PostgreSQL</> 9.1, you can attach a
|
||||
<literal>COLLATE</> specification to <type>citext</> columns or data
|
||||
values. Currently, <type>citext</> operators will honor a non-default
|
||||
<literal>COLLATE</> specification while comparing case-folded strings,
|
||||
but the initial folding to lower case is always done according to the
|
||||
database's <literal>LC_CTYPE</> setting (that is, as though
|
||||
<literal>COLLATE "default"</> were given). This may be changed in a
|
||||
future release so that both steps follow the input <literal>COLLATE</>
|
||||
specification.
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
<type>citext</> is not as efficient as <type>text</> because the
|
||||
@ -198,11 +220,11 @@ SELECT * FROM users WHERE nick = 'Larry';
|
||||
contexts. The standard answer is to use the <type>text</> type and
|
||||
manually use the <function>lower</> function when you need to compare
|
||||
case-insensitively; this works all right if case-insensitive comparison
|
||||
is needed only infrequently. If you need case-insensitive most of
|
||||
the time and case-sensitive infrequently, consider storing the data
|
||||
is needed only infrequently. If you need case-insensitive behavior most
|
||||
of the time and case-sensitive infrequently, consider storing the data
|
||||
as <type>citext</> and explicitly casting the column to <type>text</>
|
||||
when you want case-sensitive comparison. In either situation, you
|
||||
will need two indexes if you want both types of searches to be fast.
|
||||
when you want case-sensitive comparison. In either situation, you will
|
||||
need two indexes if you want both types of searches to be fast.
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
@ -210,8 +232,8 @@ SELECT * FROM users WHERE nick = 'Larry';
|
||||
<para>
|
||||
The schema containing the <type>citext</> operators must be
|
||||
in the current <varname>search_path</> (typically <literal>public</>);
|
||||
if it is not, a normal case-sensitive <type>text</> comparison
|
||||
is performed.
|
||||
if it is not, the normal case-sensitive <type>text</> operators
|
||||
will be invoked instead.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
Loading…
Reference in New Issue
Block a user