Handle non-ASCII identifiers in Ada
Ada allows non-ASCII identifiers, and GNAT supports several such
encodings. This patch adds the corresponding support to gdb.
GNAT encodes non-ASCII characters using special symbol names.
For character sets like Latin-1, where all characters are a single
byte, it uses a "U" followed by the hex for the character. So, for
example, thorn would be encoded as "Ufe" (0xFE being lower case
thorn).
For wider characters, despite what the manual says (it claims
Shift-JIS and EUC can be used), in practice recent versions only
support Unicode. Here, characters in the base plane are represented
using "Wxxxx" and characters outside the base plane using
"WWxxxxxxxx".
GNAT has some further quirks here. Ada is case-insensitive, and GNAT
emits symbols that have been case-folded. For characters in ASCII,
and for all characters in non-Unicode character sets, lower case is
used. For Unicode, however, characters that fit in a single byte are
converted to lower case, but all others are converted to upper case.
Furthermore, there is a bug in GNAT where two symbols that differ only
in the case of "Y WITH DIAERESIS" (and potentially others, I did not
check exhaustively) can be used in one program. I chose to omit
handling this case from gdb, on the theory that it is hard to figure
out the logic, and anyway if the bug is ever fixed, we'll regret
having a heuristic.
This patch introduces a new "ada source-charset" setting. It defaults
to Latin-1, as that is GNAT's default. This setting controls how "U"
characters are decoded -- W/WW are always handled as UTF-32.
The ada_tag_name_from_tsd change is needed because this function will
read memory from the inferior and interpret it -- and this caused an
encoding failure on PPC when running a test that tries to read
uninitialized memory.
This patch implements its own UTF-32-based case folder. This avoids
host platform quirks, and is relatively simple. A short Python
program to generate the case-folding table is included. It simply
relies on whatever version of Unicode is used by the host Python,
which seems basically acceptable.
Test cases for UTF-8, Latin-1, and Latin-3 are included. This
exercises most of the new code paths, aside from Y WITH DIAERESIS as
noted above.
2022-02-04 01:42:07 +08:00
|
|
|
#!/usr/bin/env python3
|
|
|
|
|
|
|
|
# Generate Unicode case-folding table for Ada.
|
|
|
|
|
2024-01-12 23:30:44 +08:00
|
|
|
# Copyright (C) 2022-2024 Free Software Foundation, Inc.
|
Handle non-ASCII identifiers in Ada
Ada allows non-ASCII identifiers, and GNAT supports several such
encodings. This patch adds the corresponding support to gdb.
GNAT encodes non-ASCII characters using special symbol names.
For character sets like Latin-1, where all characters are a single
byte, it uses a "U" followed by the hex for the character. So, for
example, thorn would be encoded as "Ufe" (0xFE being lower case
thorn).
For wider characters, despite what the manual says (it claims
Shift-JIS and EUC can be used), in practice recent versions only
support Unicode. Here, characters in the base plane are represented
using "Wxxxx" and characters outside the base plane using
"WWxxxxxxxx".
GNAT has some further quirks here. Ada is case-insensitive, and GNAT
emits symbols that have been case-folded. For characters in ASCII,
and for all characters in non-Unicode character sets, lower case is
used. For Unicode, however, characters that fit in a single byte are
converted to lower case, but all others are converted to upper case.
Furthermore, there is a bug in GNAT where two symbols that differ only
in the case of "Y WITH DIAERESIS" (and potentially others, I did not
check exhaustively) can be used in one program. I chose to omit
handling this case from gdb, on the theory that it is hard to figure
out the logic, and anyway if the bug is ever fixed, we'll regret
having a heuristic.
This patch introduces a new "ada source-charset" setting. It defaults
to Latin-1, as that is GNAT's default. This setting controls how "U"
characters are decoded -- W/WW are always handled as UTF-32.
The ada_tag_name_from_tsd change is needed because this function will
read memory from the inferior and interpret it -- and this caused an
encoding failure on PPC when running a test that tries to read
uninitialized memory.
This patch implements its own UTF-32-based case folder. This avoids
host platform quirks, and is relatively simple. A short Python
program to generate the case-folding table is included. It simply
relies on whatever version of Unicode is used by the host Python,
which seems basically acceptable.
Test cases for UTF-8, Latin-1, and Latin-3 are included. This
exercises most of the new code paths, aside from Y WITH DIAERESIS as
noted above.
2022-02-04 01:42:07 +08:00
|
|
|
|
|
|
|
# This file is part of GDB.
|
|
|
|
|
|
|
|
# This program is free software; you can redistribute it and/or modify
|
|
|
|
# it under the terms of the GNU General Public License as published by
|
|
|
|
# the Free Software Foundation; either version 3 of the License, or
|
|
|
|
# (at your option) any later version.
|
|
|
|
|
|
|
|
# This program is distributed in the hope that it will be useful,
|
|
|
|
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
# GNU General Public License for more details.
|
|
|
|
|
|
|
|
# You should have received a copy of the GNU General Public License
|
|
|
|
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
|
|
|
# This generates the ada-casefold.h header.
|
|
|
|
# Usage:
|
|
|
|
# python ada-unicode.py
|
|
|
|
|
|
|
|
import gdbcopyright
|
|
|
|
|
|
|
|
# The start of the current range of case-conversions we are
|
|
|
|
# processing. If RANGE_START is None, then we're outside of a range.
|
|
|
|
range_start = None
|
|
|
|
# End of the current range.
|
|
|
|
range_end = None
|
|
|
|
# The delta between RANGE_START and the upper-case variant of that
|
|
|
|
# character.
|
|
|
|
upper_delta = None
|
|
|
|
# The delta between RANGE_START and the lower-case variant of that
|
|
|
|
# character.
|
|
|
|
lower_delta = None
|
|
|
|
|
|
|
|
# All the ranges found and completed so far.
|
|
|
|
# Each entry is a tuple of the form (START, END, UPPER_DELTA, LOWER_DELTA).
|
|
|
|
all_ranges = []
|
|
|
|
|
|
|
|
|
|
|
|
def finish_range():
|
|
|
|
global range_start
|
|
|
|
global range_end
|
|
|
|
global upper_delta
|
|
|
|
global lower_delta
|
|
|
|
if range_start is not None:
|
|
|
|
all_ranges.append((range_start, range_end, upper_delta, lower_delta))
|
|
|
|
range_start = None
|
|
|
|
range_end = None
|
|
|
|
upper_delta = None
|
|
|
|
lower_delta = None
|
|
|
|
|
|
|
|
|
|
|
|
def process_codepoint(val):
|
|
|
|
global range_start
|
|
|
|
global range_end
|
|
|
|
global upper_delta
|
|
|
|
global lower_delta
|
|
|
|
c = chr(val)
|
|
|
|
low = c.lower()
|
|
|
|
up = c.upper()
|
|
|
|
# U+00DF ("LATIN SMALL LETTER SHARP S", aka eszsett) traditionally
|
|
|
|
# upper-cases to the two-character string "SS" (the capital form
|
|
|
|
# is a relatively recent addition -- 2017). Our simple scheme
|
|
|
|
# can't handle this, so we skip it. Also, because our approach
|
|
|
|
# just represents runs of characters with identical folding
|
|
|
|
# deltas, this change must terminate the current run.
|
|
|
|
if (c == low and c == up) or len(low) != 1 or len(up) != 1:
|
|
|
|
finish_range()
|
|
|
|
return
|
|
|
|
updelta = ord(up) - val
|
|
|
|
lowdelta = ord(low) - val
|
|
|
|
if range_start is not None and (updelta != upper_delta or lowdelta != lower_delta):
|
|
|
|
finish_range()
|
|
|
|
if range_start is None:
|
|
|
|
range_start = val
|
|
|
|
upper_delta = updelta
|
|
|
|
lower_delta = lowdelta
|
|
|
|
range_end = val
|
|
|
|
|
|
|
|
|
|
|
|
for c in range(0, 0x10FFFF):
|
|
|
|
process_codepoint(c)
|
|
|
|
|
|
|
|
with open("ada-casefold.h", "w") as f:
|
|
|
|
print(
|
|
|
|
gdbcopyright.copyright("ada-unicode.py", "UTF-32 case-folding for GDB"),
|
|
|
|
file=f,
|
|
|
|
)
|
2023-07-28 01:32:38 +08:00
|
|
|
print("", file=f)
|
Handle non-ASCII identifiers in Ada
Ada allows non-ASCII identifiers, and GNAT supports several such
encodings. This patch adds the corresponding support to gdb.
GNAT encodes non-ASCII characters using special symbol names.
For character sets like Latin-1, where all characters are a single
byte, it uses a "U" followed by the hex for the character. So, for
example, thorn would be encoded as "Ufe" (0xFE being lower case
thorn).
For wider characters, despite what the manual says (it claims
Shift-JIS and EUC can be used), in practice recent versions only
support Unicode. Here, characters in the base plane are represented
using "Wxxxx" and characters outside the base plane using
"WWxxxxxxxx".
GNAT has some further quirks here. Ada is case-insensitive, and GNAT
emits symbols that have been case-folded. For characters in ASCII,
and for all characters in non-Unicode character sets, lower case is
used. For Unicode, however, characters that fit in a single byte are
converted to lower case, but all others are converted to upper case.
Furthermore, there is a bug in GNAT where two symbols that differ only
in the case of "Y WITH DIAERESIS" (and potentially others, I did not
check exhaustively) can be used in one program. I chose to omit
handling this case from gdb, on the theory that it is hard to figure
out the logic, and anyway if the bug is ever fixed, we'll regret
having a heuristic.
This patch introduces a new "ada source-charset" setting. It defaults
to Latin-1, as that is GNAT's default. This setting controls how "U"
characters are decoded -- W/WW are always handled as UTF-32.
The ada_tag_name_from_tsd change is needed because this function will
read memory from the inferior and interpret it -- and this caused an
encoding failure on PPC when running a test that tries to read
uninitialized memory.
This patch implements its own UTF-32-based case folder. This avoids
host platform quirks, and is relatively simple. A short Python
program to generate the case-folding table is included. It simply
relies on whatever version of Unicode is used by the host Python,
which seems basically acceptable.
Test cases for UTF-8, Latin-1, and Latin-3 are included. This
exercises most of the new code paths, aside from Y WITH DIAERESIS as
noted above.
2022-02-04 01:42:07 +08:00
|
|
|
for r in all_ranges:
|
|
|
|
print(f" {{{r[0]}, {r[1]}, {r[2]}, {r[3]}}},", file=f)
|