glibc/localedata/charmaps
Jules Bertholet 25c9c3789e localedata: Fix several issues with the set of characters considered 0-width [BZ #31370]
= `Default_Ignorable_Code_Point`s should have width 0 =

Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters
with the `Default_Ignorable_Code_Point` property

> should be rendered as completely invisible (and non advancing, i.e. “zero width”),
if not explicitly supported in rendering.

Hence, `wcwidth()` should give them all a width of 0, with two exceptions:

- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
- U+115F HANGUL CHOSEONG FILLER needs a carveout
  due to the unique behavior of the conjoining Korean jamo characters.
  One composed Hangul "syllable block" like 퓛
  is made up of two to three individual component characters, or "jamo".
  These are all assigned an `East_Asian_Width` of `Wide`
  by Unicode, which would normally mean they would all be assigned
  width 2 by glibc; a combination of (leading choseong jamo) +
  (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6.
  However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong,
  assigning them all width 0,
  to ensure that the complete block has width 2 + 0 + 0 = 2 as it should.
  U+115F is meant for use in syllable blocks
  that are intentionally missing a leading jamo;
  it must be assigned a width of 2 even though it has no visible display
  to ensure that the complete block has width 2.

However, `wcwidth()` currently (before this patch)
incorrectly assigns non-zero width to
U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER;
this commit fixes that.

Unicode spec references:
- Hangul:  §3.12 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646 and
  §18.6 https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028
- `Default_Ignorable_Code_Point`: §5.21 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095.

= Non-`Default_Ignorable_Code_Point` format controls should be visible =

The Unicode Standard, §5.21 - Characters Ignored for Display
(https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095)
says the following:

> A small number of format characters (General_Category = Cf )
> are also not given the Default_Ignorable_Code_Point property.
> This may surprise implementers, who often assume
> that all format characters are generally ignored in fallback display.
> The exact list of these exceptional format characters
> can be found in the Unicode Character Database.
> There are, however, three important sets of such format characters to note:
>
> - prepended concatenation marks
> - interlinear annotation characters
> - Egyptian hieroglyph format controls
>
> The prepended concatenation marks always have a visible display.
> See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls)
> for more discussion of the use and display of these signs.
>
> The other two notable sets of format characters that exceptionally are not ignored
> in fallback display consist of the interlinear annotation characters,
> U+FFF9 INTERLINEAR ANNOTATION ANCHOR through
> U+FFFB INTERLINEAR ANNOTATION TERMINATOR,
> and the Egyptian hieroglyph format controls,
> U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through
> U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE.
> These characters should have a visible glyph display for fallback rendering,
> because if they are not displayed,
> it is too easy to misread the resulting displayed text.
> See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials),
> as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs)
> for more discussion of the use and display of these characters.

glibc currently correctly assigns non-zero width to the prepended concatenation marks,
but it incorrectly gives zero width to the interlinear annotation characters
(which a generic terminal cannot interpret)
and the Egyptian hieroglyph format controls
(which are not widely supported in rendering implementations at present).
This commit fixes both these issues as well.

= Derive Hangul syllable type from Unicode data =

Previosuly, the jungseong and jongseong jamo ranges
were hard-coded into the script. With this commit, they are instead parsed
from the HangulSyllableType.txt data file published by Unicode.
This does not affect the end result.

Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>
2024-05-15 14:31:06 +02:00
..
ANSI_X3.4-1968 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
ANSI_X3.110-1983 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
ARMSCII-8 Update. 2002-04-19 07:49:16 +00:00
ASMO_449 Update. 2000-06-24 07:08:49 +00:00
BIG5 Remove trailing whitespace from localedata. 2013-06-07 14:56:03 +00:00
BIG5-HKSCS Update BIG5-HKSCS charmap to HKSCS-2008 2013-06-11 17:02:59 +02:00
BRF * iconvdata/brf.c: New file. 2007-01-20 00:23:01 +00:00
BS_4730 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
BS_VIEWDATA locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CP737 Update. 2000-07-31 23:14:10 +00:00
CP770 Add support for CP770, CP771, CP772, CP773, and CP774 2011-05-09 23:15:39 -04:00
CP771 Add support for CP770, CP771, CP772, CP773, and CP774 2011-05-09 23:15:39 -04:00
CP772 Add support for CP770, CP771, CP772, CP773, and CP774 2011-05-09 23:15:39 -04:00
CP773 Add support for CP770, CP771, CP772, CP773, and CP774 2011-05-09 23:15:39 -04:00
CP774 Add support for CP770, CP771, CP772, CP773, and CP774 2011-05-09 23:15:39 -04:00
CP775 Update. 2000-07-31 23:14:10 +00:00
CP949 * sysdeps/unix/bsd/bsd4.4/freebsd/bits/typesizes.h: New file. 2002-11-03 01:21:26 +00:00
CP1125 2002-12-03 Roland McGrath <roland@redhat.com> 2002-12-03 20:56:06 +00:00
CP1250 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CP1251 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CP1252 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CP1253 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CP1254 Bug 21399: Fix CP1254 comment for U+00EC 2017-04-19 08:10:35 -04:00
CP1255 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CP1256 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CP1257 Update. 2000-06-24 07:08:49 +00:00
CP1258 Update. 2000-06-24 07:08:49 +00:00
CP10007 localedata: change M$ to Microsoft 2016-08-10 00:49:14 +08:00
CSA_Z243.4-1985-1 Update. 2000-06-24 07:08:49 +00:00
CSA_Z243.4-1985-2 Update. 2000-06-24 07:08:49 +00:00
CSA_Z243.4-1985-GR Update. 2000-06-24 07:08:49 +00:00
CSN_369103 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
CWI Update. 2000-06-24 07:08:49 +00:00
DEC-MCS Update. 2000-06-24 07:08:49 +00:00
DIN_66003 Update. 2000-06-24 07:08:49 +00:00
DS_2089 Update. 2000-06-24 07:08:49 +00:00
EBCDIC-AT-DE Update. 2000-06-24 07:08:49 +00:00
EBCDIC-AT-DE-A Update. 2000-06-24 07:08:49 +00:00
EBCDIC-CA-FR Update. 2000-06-24 07:08:49 +00:00
EBCDIC-DK-NO locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
EBCDIC-DK-NO-A Update. 2000-06-24 07:08:49 +00:00
EBCDIC-ES Update. 2000-06-24 07:08:49 +00:00
EBCDIC-ES-A Update. 2000-06-24 07:08:49 +00:00
EBCDIC-ES-S Update. 2000-06-24 07:08:49 +00:00
EBCDIC-FI-SE Update. 2000-06-24 07:08:49 +00:00
EBCDIC-FI-SE-A Update. 2000-06-24 07:08:49 +00:00
EBCDIC-FR Update. 2000-06-24 07:08:49 +00:00
EBCDIC-IS-FRISS Update. 2000-06-24 07:08:49 +00:00
EBCDIC-IT Update. 2000-06-24 07:08:49 +00:00
EBCDIC-PT Update. 2000-06-24 07:08:49 +00:00
EBCDIC-UK Update. 2000-06-24 07:08:49 +00:00
EBCDIC-US Update. 2000-06-24 07:08:49 +00:00
ECMA-CYRILLIC Update. 2000-06-24 07:08:49 +00:00
ES Update. 2000-06-24 07:08:49 +00:00
ES2 Update. 2000-06-24 07:08:49 +00:00
EUC-JISX0213 Update. 2004-08-02 22:33:57 +00:00
EUC-JP * sysdeps/unix/bsd/bsd4.4/freebsd/bits/typesizes.h: New file. 2002-11-03 01:21:26 +00:00
EUC-JP-MS Update. 2003-08-14 21:48:29 +00:00
EUC-KR [BZ #3954] 2007-02-16 01:25:28 +00:00
EUC-TW Update. 2002-04-20 07:58:48 +00:00
GB2312 Update. 2000-09-14 20:43:43 +00:00
GB18030 add GB18030-2022 charmap and test the entire GB18030 charmap [BZ #30243] 2023-08-29 19:02:30 +02:00
GB_1988-80 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
GBK localedata: GBK: add mapping for 0x80->Euro sign [BZ #20864] 2016-11-26 17:20:22 -05:00
GEORGIAN-ACADEMY Update. 2001-08-04 08:05:56 +00:00
GEORGIAN-PS Update. 2001-08-04 08:05:56 +00:00
GOST_19768-74 Update. 2000-06-24 07:08:49 +00:00
GREEK7 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
GREEK7-OLD Update. 2000-06-24 07:08:49 +00:00
GREEK-CCITT locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
HP-GREEK8 [BZ #5465] 2007-12-12 08:06:15 +00:00
HP-ROMAN8 Update. 2000-06-24 07:08:49 +00:00
HP-ROMAN9 [BZ #5427] 2007-12-08 01:38:23 +00:00
HP-THAI8 [BZ #5465] 2007-12-12 08:06:15 +00:00
HP-TURKISH8 [BZ #5465] 2007-12-12 08:06:15 +00:00
IBM037 Update. 2000-06-24 07:08:49 +00:00
IBM038 Update. 2000-06-24 07:08:49 +00:00
IBM256 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM273 localedata: Make IBM273 compatible with ISO-8859-1 [BZ #23290] 2018-06-14 22:34:10 +02:00
IBM274 Update. 2000-06-24 07:08:49 +00:00
IBM275 Update. 2000-06-24 07:08:49 +00:00
IBM277 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM278 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM280 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM281 Update. 2000-06-24 07:08:49 +00:00
IBM284 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM285 Update. 2000-06-24 07:08:49 +00:00
IBM290 Update. 2000-06-24 07:08:49 +00:00
IBM297 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM420 Update. 2000-06-24 07:08:49 +00:00
IBM423 Update. 2000-06-24 07:08:49 +00:00
IBM424 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM437 Update. 2000-06-24 07:08:49 +00:00
IBM500 Update. 2000-06-24 07:08:49 +00:00
IBM850 Update. 2000-06-24 07:08:49 +00:00
IBM851 Update. 2000-06-24 07:08:49 +00:00
IBM852 Update. 2000-06-24 07:08:49 +00:00
IBM855 Update. 2000-06-24 07:08:49 +00:00
IBM856 Update. 2002-04-19 07:49:16 +00:00
IBM857 Update. 2000-06-24 07:08:49 +00:00
IBM858 Add new codepage charmaps/IBM858 [BZ #21084] 2017-09-14 15:50:57 +02:00
IBM860 Update. 2000-06-24 07:08:49 +00:00
IBM861 Update. 2000-06-24 07:08:49 +00:00
IBM862 Update. 2000-06-24 07:08:49 +00:00
IBM863 Update. 2000-06-24 07:08:49 +00:00
IBM864 Update. 2000-06-24 07:08:49 +00:00
IBM865 Update. 2000-06-24 07:08:49 +00:00
IBM866 Update. 2000-06-24 07:08:49 +00:00
IBM866NAV Remove trailing whitespace from localedata. 2013-06-07 14:56:03 +00:00
IBM868 Update. 2000-06-24 07:08:49 +00:00
IBM869 Update. 2000-06-24 07:08:49 +00:00
IBM870 Add correct mapping for <U017E>, <U017A>, <U017D> and <U0179>. Add missing mappings. 2002-08-29 09:45:29 +00:00
IBM871 Update. 2000-06-24 07:08:49 +00:00
IBM874 Update. 2004-07-20 07:39:10 +00:00
IBM875 charmaps: IBM875: fix mapping of iota/upsilon variants [BZ #18453] 2016-05-07 19:55:55 -04:00
IBM880 Update. 2000-06-24 07:08:49 +00:00
IBM891 Update. 2000-06-24 07:08:49 +00:00
IBM903 Update. 2000-06-24 07:08:49 +00:00
IBM904 Update. 2000-06-24 07:08:49 +00:00
IBM905 Update. 2000-06-24 07:08:49 +00:00
IBM918 Update. 2000-06-24 07:08:49 +00:00
IBM922 Update. 2002-04-19 07:49:16 +00:00
IBM1004 Update. 2000-06-24 07:08:49 +00:00
IBM1026 Update. 2000-06-24 07:08:49 +00:00
IBM1047 Update. 2003-06-16 17:18:06 +00:00
IBM1124 Update. 2002-04-19 07:49:16 +00:00
IBM1129 Update. 2002-04-19 07:49:16 +00:00
IBM1132 Update. 2002-04-19 07:49:16 +00:00
IBM1133 Update. 2002-04-19 07:49:16 +00:00
IBM1160 Update. 2002-04-19 07:49:16 +00:00
IBM1161 Update. 2002-04-19 07:49:16 +00:00
IBM1162 Update. 2002-04-19 07:49:16 +00:00
IBM1163 Update. 2002-04-19 07:49:16 +00:00
IBM1164 Update. 2002-04-19 07:49:16 +00:00
IEC_P27-1 Update. 2000-06-24 07:08:49 +00:00
INIS Update. 2000-06-24 07:08:49 +00:00
INIS-8 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
INIS-CYRILLIC Update. 2000-06-24 07:08:49 +00:00
INVARIANT Update. 2000-06-24 07:08:49 +00:00
ISIRI-3342 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
ISO_646.BASIC Update. 2000-06-24 07:08:49 +00:00
ISO_646.IRV Update. 2000-06-24 07:08:49 +00:00
ISO_2033-1983 locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
ISO_5427 Update. 2000-06-24 07:08:49 +00:00
ISO_5427-EXT locale: Remove obsolete repertoire map references 2015-07-21 03:54:02 -04:00
ISO_5428 2002-10-18 Bruno Haible <bruno@clisp.org> 2002-11-04 19:22:53 +00:00
ISO_6937 Update. 2000-06-24 07:08:49 +00:00
ISO_6937-2-25 Update. 2000-06-24 07:08:49 +00:00
ISO_6937-2-ADD Update. 2000-06-24 07:08:49 +00:00
ISO_8859-1,GL Remove trailing whitespace from localedata. 2013-06-07 14:56:03 +00:00
ISO_8859-SUPP Update. 2000-06-24 07:08:49 +00:00
ISO_10367-BOX Update. 2000-06-24 07:08:49 +00:00
ISO_10646 Update. 1999-08-15 21:14:40 +00:00
ISO_11548-1 * iconvdata/Makefile (modules): Add ISO_11548-1. 2005-11-16 08:16:43 +00:00
ISO-8859-1 Update. 2000-06-24 18:53:49 +00:00
ISO-8859-2 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-3 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-4 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-5 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-6 Update. 2000-10-01 19:54:15 +00:00
ISO-8859-7 Update. 2004-08-02 22:33:57 +00:00
ISO-8859-8 Update. 2000-10-01 19:54:15 +00:00
ISO-8859-9 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-9E * iconvdata/Makefile (modules): Add ISO8859-9E. 2007-09-30 04:00:26 +00:00
ISO-8859-10 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-11 Update. 2004-08-02 22:33:57 +00:00
ISO-8859-13 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-14 Update. 2000-06-24 07:08:49 +00:00
ISO-8859-15 Update. 2003-06-16 17:18:06 +00:00
ISO-8859-16 Update. 2003-06-16 17:18:06 +00:00
ISO-IR-90 Update. 2000-06-24 07:08:49 +00:00
ISO-IR-197 Update. 2000-10-01 18:59:42 +00:00
ISO-IR-209 Fix the Unicode values for LETTER H WITH CARON. 2002-04-24 20:14:49 +00:00
IT Update. 2000-06-24 07:08:49 +00:00
JIS_C6220-1969-JP
JIS_C6220-1969-RO Update. 2000-06-24 07:08:49 +00:00
JIS_C6229-1984-A
JIS_C6229-1984-B Update. 2000-06-24 07:08:49 +00:00
JIS_C6229-1984-B-ADD
JIS_C6229-1984-HAND
JIS_C6229-1984-HAND-ADD
JIS_C6229-1984-KANA
JIS_X0201 Update. 2000-06-24 07:08:49 +00:00
JOHAB [BZ #3954] 2007-02-16 01:25:28 +00:00
JUS_I.B1.002 Update. 2000-06-24 07:08:49 +00:00
JUS_I.B1.003-MAC Update. 2000-06-24 07:08:49 +00:00
JUS_I.B1.003-SERB Update. 2000-06-24 07:08:49 +00:00
KOI8-R Update. 2000-06-24 07:08:49 +00:00
KOI8-RU * iconvdata/Makefile (modules): Add KOI8-RU. 2007-09-29 04:35:54 +00:00
KOI8-T * sysdeps/unix/bsd/bsd4.4/freebsd/bits/typesizes.h: New file. 2002-11-03 01:21:26 +00:00
KOI8-U * sysdeps/unix/bsd/bsd4.4/freebsd/bits/typesizes.h: New file. 2002-11-03 01:21:26 +00:00
KOI-8 Update. 2000-06-24 07:08:49 +00:00
KSC5636 Update. 2000-06-24 07:08:49 +00:00
LATIN-GREEK Update. 2000-06-24 07:08:49 +00:00
LATIN-GREEK-1 Update. 2000-06-24 07:08:49 +00:00
MAC-CENTRALEUROPE [BZ #4972] 2007-09-23 17:21:22 +00:00
MAC-CYRILLIC Update. 2001-05-24 08:50:10 +00:00
MAC-IS Update. 2000-06-24 07:08:49 +00:00
MAC-SAMI * sysdeps/unix/bsd/bsd4.4/freebsd/bits/typesizes.h: New file. 2002-11-03 01:21:26 +00:00
MAC-UK Update. 2000-06-24 07:08:49 +00:00
MACINTOSH Update. 2002-01-18 06:26:02 +00:00
MIK [BZ #2625] 2006-05-01 14:11:50 +00:00
MSZ_7795.3 Update. 2000-06-24 07:08:49 +00:00
NATS-DANO Update. 2000-06-24 07:08:49 +00:00
NATS-DANO-ADD
NATS-SEFI Update. 2000-06-24 07:08:49 +00:00
NATS-SEFI-ADD
NC_NC00-10 Update. 2000-06-24 07:08:49 +00:00
NEXTSTEP Update. 2000-06-24 07:08:49 +00:00
NF_Z_62-010 Update. 2000-06-24 07:08:49 +00:00
NF_Z_62-010_1973 Update. 2000-06-24 07:08:49 +00:00
NS_4551-1 Update. 2000-06-24 07:08:49 +00:00
NS_4551-2 Update. 2000-06-24 07:08:49 +00:00
PT Update. 2000-06-24 07:08:49 +00:00
PT2 Update. 2000-06-24 07:08:49 +00:00
PT154 Update. 2004-04-17 22:34:18 +00:00
RK1048 Update. 2004-04-17 22:34:18 +00:00
SAMI Update. 2000-06-24 07:08:49 +00:00
SAMI-WS2 Update. 2001-10-01 00:14:14 +00:00
SEN_850200_B Update. 2000-06-24 07:08:49 +00:00
SEN_850200_C Update. 2000-06-24 07:08:49 +00:00
SHIFT_JIS Update. 2001-06-17 22:44:18 +00:00
SHIFT_JISX0213 Update. 2004-08-02 22:33:57 +00:00
T.61-7BIT Update. 2000-06-24 07:08:49 +00:00
T.61-8BIT Update. 2000-06-24 07:08:49 +00:00
T.101-G2 Update. 2000-06-24 07:08:49 +00:00
TCVN5712-1 Update. 2001-12-10 01:37:56 +00:00
TIS-620 * sysdeps/unix/bsd/bsd4.4/freebsd/bits/typesizes.h: New file. 2002-11-03 01:21:26 +00:00
TSCII Update. 2002-09-24 04:19:03 +00:00
UTF-8 localedata: Fix several issues with the set of characters considered 0-width [BZ #31370] 2024-05-15 14:31:06 +02:00
VIDEOTEX-SUPPL Update. 2000-06-24 07:08:49 +00:00
VISCII Update. 2000-10-27 21:07:08 +00:00
WINDOWS-31J * sunrpc/xdr_stdio.c (xdrstdio_getlong, xdrstdio_putlong): 2005-03-22 09:14:36 +00:00