1999-07-12 06:59:26 +08:00
|
|
|
postgresql 6.5.1 multi-byte (MB) support README July 11 1999
|
1998-03-15 15:39:04 +08:00
|
|
|
|
|
|
|
Tatsuo Ishii
|
|
|
|
t-ishii@sra.co.jp
|
|
|
|
http://www.sra.co.jp/people/t-ishii/PostgreSQL/
|
|
|
|
|
1998-06-16 15:29:54 +08:00
|
|
|
0. Introduction
|
1998-03-15 15:39:04 +08:00
|
|
|
|
1998-04-28 01:10:50 +08:00
|
|
|
The MB support is intended for allowing PostgreSQL to handle
|
|
|
|
multi-byte character sets such as EUC(Extended Unix Code), Unicode and
|
|
|
|
Mule internal code. With the MB enabled you can use multi-byte
|
1999-04-13 17:29:46 +08:00
|
|
|
character sets in regexp ,LIKE and some functions. The default
|
|
|
|
encoding system chosen is determined while initializing your
|
|
|
|
PostgreSQL installation using initdb(1). Note that this can be
|
|
|
|
overridden when you create a database using createdb(1) or create
|
|
|
|
database SQL command. So you could have multiple databases with
|
|
|
|
different encoding systems.
|
1998-03-15 15:39:04 +08:00
|
|
|
|
1998-04-28 01:10:50 +08:00
|
|
|
MB also fixes some problems concerning with 8-bit single byte
|
1998-03-15 15:39:04 +08:00
|
|
|
character sets including ISO8859. (I would not say all of problems
|
|
|
|
have been fixed. I just confirmed that the regression test ran fine
|
|
|
|
and a few French characters could be used with the patch. Please let
|
|
|
|
me know if you find any problem while using 8-bit characters)
|
|
|
|
|
1998-06-16 15:29:54 +08:00
|
|
|
1. How to use
|
1998-03-15 15:39:04 +08:00
|
|
|
|
1998-12-19 02:01:10 +08:00
|
|
|
run configure with the mb option:
|
1998-04-28 01:10:50 +08:00
|
|
|
|
|
|
|
% configure --with-mb=encoding_system
|
1998-03-15 15:39:04 +08:00
|
|
|
|
|
|
|
where encoding_system is one of:
|
|
|
|
|
1998-12-19 02:01:10 +08:00
|
|
|
SQL_ASCII ASCII
|
1998-04-28 01:10:50 +08:00
|
|
|
EUC_JP Japanese EUC
|
|
|
|
EUC_CN Chinese EUC
|
|
|
|
EUC_KR Korean EUC
|
|
|
|
EUC_TW Taiwan EUC
|
|
|
|
UNICODE Unicode(UTF-8)
|
|
|
|
MULE_INTERNAL Mule internal
|
1998-07-24 11:32:46 +08:00
|
|
|
LATIN1 ISO 8859-1 English and some European languages
|
|
|
|
LATIN2 ISO 8859-2 English and some European languages
|
|
|
|
LATIN3 ISO 8859-3 English and some European languages
|
|
|
|
LATIN4 ISO 8859-4 English and some European languages
|
|
|
|
LATIN5 ISO 8859-5 English and some European languages
|
1999-04-13 17:29:46 +08:00
|
|
|
KOI8 KOI8-R
|
1999-07-12 06:59:26 +08:00
|
|
|
WIN Windows CP1251
|
|
|
|
ALT Windows CP866
|
1998-03-15 15:39:04 +08:00
|
|
|
|
|
|
|
Example:
|
|
|
|
|
1998-04-28 01:10:50 +08:00
|
|
|
% configure --with-mb=EUC_JP
|
|
|
|
|
|
|
|
If MB is disabled, nothing is changed except better supporting for
|
1998-03-15 15:39:04 +08:00
|
|
|
8-bit single byte character sets.
|
|
|
|
|
1998-07-24 11:32:46 +08:00
|
|
|
2. How to set encoding
|
|
|
|
|
|
|
|
initdb command defines the default encoding for a PostgreSQL
|
|
|
|
installation. For example:
|
|
|
|
|
|
|
|
% initdb -e EUC_JP
|
|
|
|
|
|
|
|
sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
|
|
|
|
Note that you can use "-pgencoding" instead of "-e" if you like longer
|
|
|
|
option string:-) If no -e or -pgencoding option is given, the encoding
|
|
|
|
specified at the compile time is used.
|
|
|
|
|
|
|
|
You can create a database with a different encoding.
|
|
|
|
|
|
|
|
% createdb -E EUC_KR korean
|
|
|
|
|
|
|
|
will create a database named "korean" with EUC_KR encoding. The
|
|
|
|
another way to accomplish this is to use a SQL command:
|
|
|
|
|
|
|
|
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
|
|
|
|
|
1998-12-19 02:01:10 +08:00
|
|
|
The encoding for a database is represented as "encoding" column in the
|
|
|
|
pg_database system catalog.
|
|
|
|
|
|
|
|
datname |datdba|encoding|datpath
|
|
|
|
-------------+------+--------+-------------
|
|
|
|
template1 | 1739| 1|template1
|
|
|
|
postgres | 1739| 0|postgres
|
|
|
|
euc_jp | 1739| 1|euc_jp
|
|
|
|
euc_kr | 1739| 3|euc_kr
|
|
|
|
euc_cn | 1739| 2|euc_cn
|
|
|
|
unicode | 1739| 5|unicode
|
|
|
|
mule_internal| 1739| 6|mule_internal
|
|
|
|
|
|
|
|
A number in the encoding column is "encoding id" and can be translated
|
|
|
|
to the encoding name using pg_encoding command.
|
|
|
|
|
|
|
|
$ pg_encoding 1
|
|
|
|
EUC_JP
|
|
|
|
|
|
|
|
If an argument to pg_encoding is not a number, then it is regarded as
|
|
|
|
an encoding name and pg_encoding will return the encoding id.
|
|
|
|
|
|
|
|
$ pg_encoding EUC_JP
|
|
|
|
1
|
|
|
|
|
1998-07-24 11:32:46 +08:00
|
|
|
3. PGCLIENTENCODING
|
1998-06-16 15:29:54 +08:00
|
|
|
|
|
|
|
If an environment variable PGCLIENTENCODING is defined on the
|
|
|
|
frontend, automatic encoding translation is done by the backend. For
|
|
|
|
example, if the backend has been compiled with MB=EUC_JP and
|
|
|
|
PGCLIENTENCODING=SJIS(Shift JIS: yet another Japanese encoding
|
|
|
|
system), then any SJIS strings coming from the frontend would be
|
|
|
|
translated to EUC_JP before going into the parser. Outputs from the
|
|
|
|
backend would be translated to SJIS of course.
|
|
|
|
|
|
|
|
Supported encodings for PGCLIENTENCODING are:
|
|
|
|
|
1998-12-19 02:01:10 +08:00
|
|
|
SQL_ASCII ASCII
|
1998-06-16 15:29:54 +08:00
|
|
|
EUC_JP Japanese EUC
|
|
|
|
SJIS Yet another Japanese encoding
|
|
|
|
EUC_CN Chinese EUC
|
|
|
|
EUC_KR Korean EUC
|
|
|
|
EUC_TW Taiwan EUC
|
1999-04-13 17:29:46 +08:00
|
|
|
BIG5 Traditional Chinese
|
1998-06-16 15:29:54 +08:00
|
|
|
MULE_INTERNAL Mule internal
|
1998-07-24 11:32:46 +08:00
|
|
|
LATIN1 ISO 8859-1 English and some European languages
|
|
|
|
LATIN2 ISO 8859-2 English and some European languages
|
|
|
|
LATIN3 ISO 8859-3 English and some European languages
|
|
|
|
LATIN4 ISO 8859-4 English and some European languages
|
|
|
|
LATIN5 ISO 8859-5 English and some European languages
|
1999-04-13 17:29:46 +08:00
|
|
|
KOI8 KOI8-R
|
1999-07-12 06:59:26 +08:00
|
|
|
WIN Windows CP1251
|
|
|
|
ALT Windows CP866
|
|
|
|
WIN1250 Windows CP1250 (Czech)
|
1998-06-16 15:29:54 +08:00
|
|
|
|
|
|
|
Note that UNICODE is not supported(yet). Also note that the
|
|
|
|
translation is not always possible. Suppose you choose EUC_JP for the
|
1999-04-13 17:29:46 +08:00
|
|
|
backend, LATIN1 for the frontend, then some Japanese characters cannot
|
1998-06-16 15:29:54 +08:00
|
|
|
be translated into latin. In this case, a letter cannot be represented
|
|
|
|
in the Latin character set, would be transformed as:
|
|
|
|
|
|
|
|
(HEXA DECIMAL)
|
|
|
|
|
|
|
|
3. SET CLIENT_ENCODING TO command
|
|
|
|
|
|
|
|
Actually setting the frontend side encoding information is done by a
|
|
|
|
new command:
|
|
|
|
|
|
|
|
SET CLIENT_ENCODING TO 'encoding';
|
|
|
|
|
|
|
|
where encoding is one of the encodings those can be set to
|
1998-07-24 11:32:46 +08:00
|
|
|
PGCLIENTENCODING. Also you can use SQL92 syntax "SET NAMES" for this
|
|
|
|
purpose:
|
|
|
|
|
|
|
|
SET NAMES 'encoding';
|
|
|
|
|
|
|
|
To query the current the frontend encoding:
|
1998-06-16 15:29:54 +08:00
|
|
|
|
|
|
|
SHOW CLIENT_ENCODING;
|
|
|
|
|
|
|
|
To return to the default encoding:
|
|
|
|
|
|
|
|
RESET CLIENT_ENCODING;
|
|
|
|
|
|
|
|
This would reset the frontend encoding to same as the backend
|
1999-04-13 17:29:46 +08:00
|
|
|
encoding, thus no encoding translation would be performed.
|
1998-06-16 15:29:54 +08:00
|
|
|
|
|
|
|
4. References
|
1998-03-15 15:39:04 +08:00
|
|
|
|
|
|
|
These are good sources to start learning various kind of encoding
|
|
|
|
systems.
|
|
|
|
|
|
|
|
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
|
|
|
|
Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
|
|
|
|
appear in section 3.2.
|
|
|
|
|
|
|
|
Unicode: http://www.unicode.org/
|
|
|
|
The homepage of UNICODE.
|
|
|
|
|
|
|
|
RFC 2044
|
|
|
|
UTF-8 is defined here.
|
|
|
|
|
1998-06-16 15:29:54 +08:00
|
|
|
5. History
|
|
|
|
|
1999-07-12 06:59:26 +08:00
|
|
|
July 11, 1999
|
|
|
|
* Add support for WIN1250 (Windows Czech) as a client encoding
|
|
|
|
(contributed by Pavel Behal)
|
|
|
|
* fix some compiler warnings (contributed by Tomoaki Nishiyama)
|
|
|
|
|
1999-04-13 17:29:46 +08:00
|
|
|
Mar 23, 1999
|
|
|
|
* Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
|
|
|
|
(thanks Oleg Broytmann for testing)
|
|
|
|
* Fix problem with MB and locale
|
|
|
|
|
1999-02-03 02:51:40 +08:00
|
|
|
Jan 26, 1999
|
1999-04-13 17:29:46 +08:00
|
|
|
* Add support for Big5 for fronend encoding
|
1999-02-03 02:51:40 +08:00
|
|
|
(you need to create a database with EUC_TW to use Big5)
|
|
|
|
* Add regression test case for EUC_TW
|
|
|
|
(contributed by Jonah Kuo <jonahkuo@mail.ttn.com.tw>)
|
|
|
|
|
1998-12-19 02:01:10 +08:00
|
|
|
Dec 15, 1998
|
|
|
|
* Bugs related to SQL_ASCII support fixed
|
|
|
|
|
|
|
|
Nov 5, 1998
|
|
|
|
* 6.4 release. In this version, pg_database has "encoding"
|
|
|
|
column that represents the database encoding
|
|
|
|
|
1998-07-24 11:32:46 +08:00
|
|
|
Jul 22, 1998
|
|
|
|
* determine encoding at initdb/createdb rather than compile time
|
|
|
|
* support for PGCLIENTENCODING when issuing COPY command
|
|
|
|
* support for SQL92 syntax "SET NAMES"
|
|
|
|
* support for LATIN2-5
|
|
|
|
* add UNICODE regression test case
|
|
|
|
* new test suite for MB
|
|
|
|
* clean up source files
|
|
|
|
|
|
|
|
Jun 5, 1998
|
1998-06-16 15:29:54 +08:00
|
|
|
* add support for the encoding translation between the backend
|
|
|
|
and the frontend
|
|
|
|
* new command SET CLIENT_ENCODING etc. added
|
|
|
|
* add support for LATIN1 character set
|
|
|
|
* enhance 8 bit cleaness
|
1998-03-15 15:39:04 +08:00
|
|
|
|
1998-04-28 01:10:50 +08:00
|
|
|
April 21, 1998 some enhancements/fixes
|
|
|
|
* character_length(), position(), substring() are now aware of
|
|
|
|
multi-byte characters
|
|
|
|
* add octet_length()
|
|
|
|
* add --with-mb option to configure
|
|
|
|
* new regression tests for EUC_KR
|
|
|
|
(contributed by "Soonmyung. Hong" <hong@lunaris.hanmesoft.co.kr>)
|
|
|
|
* add some test cases to the EUC_JP regression test
|
|
|
|
* fix problem in regress/regress.sh in case of System V
|
|
|
|
* fix toupper(), tolower() to handle 8bit chars
|
|
|
|
|
|
|
|
Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
|
|
|
|
|
1998-03-15 15:39:04 +08:00
|
|
|
Mar 10, 1998 PL2 released
|
|
|
|
* add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
|
|
|
|
* add an English document (this file)
|
|
|
|
* fix problems concerning 8-bit single byte characters
|
|
|
|
|
|
|
|
Mar 1, 1998 PL1 released
|
1999-07-12 06:59:26 +08:00
|
|
|
|
|
|
|
Appendix:
|
|
|
|
|
|
|
|
[Here is a good documentation explaining how to use WIN1250 on
|
|
|
|
Windows/ODBC from Pavel Behal. Please note that Installation step 1)
|
|
|
|
is not necceary in 6.5.1 -- Tatsuo]
|
|
|
|
|
|
|
|
Version: 0.91 for PgSQL 6.5
|
|
|
|
Author: Pavel Behal
|
|
|
|
Revised by: Tatsuo Ishii
|
|
|
|
Email: behal@opf.slu.cz
|
|
|
|
Licence: The Same as PostgreSQL
|
|
|
|
|
|
|
|
Sorry for my Eglish and C code, I'm not native :-)
|
|
|
|
|
|
|
|
!!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
|
|
|
|
|
|
|
Instalation:
|
|
|
|
------------
|
|
|
|
1) Change three affected files in source directories
|
|
|
|
(I don't have time to create proper patch diffs, I don't know how)
|
|
|
|
2) Compile with enabled locale and multibyte set to LATIN2
|
|
|
|
3) Setup properly your instalation, do not forget to create locale
|
|
|
|
variables in your profile (environment). Ex. (may not be exactly true):
|
|
|
|
LC_ALL=cs_CZ.ISO8859-2
|
|
|
|
LC_COLLATE=cs_CZ.ISO8859-2
|
|
|
|
LC_CTYPE=cs_CZ.ISO8859-2
|
|
|
|
LC_MONETARY=cs_CZ.ISO8859-2
|
|
|
|
LC_NUMERIC=cs_CZ.ISO8859-2
|
|
|
|
LC_TIME=cs_CZ.ISO8859-2
|
|
|
|
4) You have to start the postmaster with locales set!
|
|
|
|
5) Try it with Czech language, it have to sort
|
|
|
|
5) Install ODBC driver for PgSQL into your M$ Windows
|
|
|
|
6) Setup properly your data source. Include this line in your ODBC
|
|
|
|
configuration dialog in field "Connect Settings:" :
|
|
|
|
SET CLIENT_ENCODING = 'WIN1250';
|
|
|
|
7) Now try it again, but in Windows with ODBC.
|
|
|
|
|
|
|
|
Description:
|
|
|
|
------------
|
|
|
|
- Depends on proper system locales, tested with RH6.0 and Slackware 3.6,
|
|
|
|
with cs_CZ.iso8859-2 loacle
|
|
|
|
- Never try to set-up server multibyte database encoding to WIN1250,
|
|
|
|
always use LATIN2 instead. There is not WIN1250 locale in Unix
|
|
|
|
- WIN1250 encoding is useable only for M$W ODBC clients. The characters are
|
|
|
|
on thy fly re-coded, to be displayed and stored back properly
|
|
|
|
|
|
|
|
Important:
|
|
|
|
----------
|
|
|
|
- it reorders your sort order depending on your LC_... setting, so don't be
|
|
|
|
confused with regression tests, they don't use locale
|
|
|
|
- "ch" is corectly sorted only in some newer locales (Ex. RH6.0)
|
|
|
|
- you have to insert money as '162,50' (with comma in aphostrophes!)
|
|
|
|
- not tested properly
|