Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.
This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.
Fixes commit 9b7091415af47082664717210ac49d51551456ab ("x86-64:
Update _dl_tlsdesc_dynamic to preserve AMX registers").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 145097dff170507fe73190e8e41194f5b5f7e6bf)
If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need. Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.
Related to commit befe2d3c4dec8be2cdd01a47132e47bdb7020922
("x86-64: Don't use SSE resolvers for ISA level 3 or above").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 59585ddaa2d44f22af04bb4b8bd4ad1e302c4c02)
On SPR, it improves atanh bench performance by:
Before After Improvement
reciprocal-throughput 15.1715 14.8628 2%
latency 57.1941 56.1883 2%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c7c4a5906f326f1290b1c2413a83c530564ec4b8)
On SPR, it improves sinh bench performance by:
Before After Improvement
reciprocal-throughput 14.2017 11.815 17%
latency 36.4917 35.2114 4%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit dded0d20f67ba1925ccbcb9cf28f0c75febe0dbe)
On Skylake, it improves tanh bench performance by:
Before After Improvement
max 110.89 95.826 14%
min 20.966 20.157 4%
mean 30.9601 29.8431 4%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c6352111c72a20b3588ae304dd99b63e25dd6d85)
The SIGCANCEL signal handler should not issue __syscall_do_cancel,
which calls __do_cancel and __pthread_unwind, if the cancellation
is already in proces (and libgcc unwind is not reentrant). Any
cancellation signal received after is ignored.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 360cce0b066f34e85e473c04cdc16e6fa426021b)
The new initializer and struct layout does not initialize the
__g_signals field in the old struct layout before the change in
commit c36fc50781995e6758cae2b6927839d0157f213c ("nptl: Remove
g_refs from condition variables"). Bring back fields at the end
of struct __pthread_cond_s, so that they are again zero-initialized.
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit dbc5a50d12eff4cb3f782129029d04b8a76f58e7)
Due to the extensible nature of the rseq area we can't explictly
initialize fields that are not part of the ABI yet. It was agreed with
upstream that all new fields will be documented as zero initialized by
userspace. Future kernels configured with CONFIG_DEBUG_RSEQ will
validate the content of all fields during registration.
Replace the explicit field initialization with a memset of the whole
rseq area which will cover fields as they are added to future kernels.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 689a62a4217fae78b9ce0db781dc2a421f2b1ab4)
Use prefer_sve_ifuncs for SVE memset just like memcpy.
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 0f044be1dae5169d0e57f8d487b427863aeadab4)
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes. Performance of random memset benchmark improves
by ~2% on Neoverse V1.
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 163b1bbb76caba4d9673c07940c5930a1afa7548)
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps. As a result on targets that support them, more load-pair
instructions are used in exp. Exp10 is improved by moving invlog10_2N later
so that neglog10_2hiN and neglog10_2loN can be loaded using load-pair.
The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2. Exp10 improves by 1.5%.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 5afaf99edb326fd9f36eb306a828d129a3a1d7f7)
Test that when we return from a function that enabled GCS at runtime
we get SIGSEGV. Also test that ucontext contains GCS block with the
GCS pointer.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
These tests validate that GCS tunable works as expected depending
on the GCS markings in the test binaries.
Tests validate both static and dynamically linked binaries.
These new tests are AArch64 specific. Moreover, they are included only
if linker supports the "-z gcs=<value>" option. If built, these tests
will run on systems with and without HWCAP_GCS. In the latter case the
tests will be reported as UNSUPPORTED.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reduce number of MOV/MOVPRFXs and use unpredicated FMUL.
Replace MUL with LSL. Speedup on Neoverse V1: 6%.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit f5ff34cb3c75ec1061c75bb9188b3c1176426947)
Use unpredicted muls, and improve memory access.
7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1,
for exp, exp2 and cosh respectively.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit c0ff447edf19bd4630fe79adf5e8b896405b059f)
Use unpredicated muls, use lanewise mla's and improve memory access.
1% regression in throughput microbenchmark on Neoverse V1.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit 8f0e7fe61e0a2ad5ed777933703ce09053810ec4)
In some cases, an IFUNC resolver may need to access the gp pointer to
access global variables. Such an object may have l_relocated == 0 at
this time. In this case, an IFUNC resolver will fail to access a global
variable and cause a SIGSEGV.
This patch fixes this issue by relaxing the check of l_relocated in
elf_machine_runtime_setup, but added a check for SHARED case to avoid
using this code in static-linked executables. Such object have already
set up the gp pointer in load_gp function and l->l_scope will be NULL if
it is a pie object. So if we use these code to set up the gp pointer
again for static-pie, it will causing a SIGSEGV in glibc as original bug
on BZ #31317.
I have also reproduced and checked BZ #31317 using the mold commit
bed5b1731b ("illumos: Treat absolute symbols specially"), this patch can
fix the issue.
Also, we used the wrong gp pointer previously because ref->st_value is
not the relocated address but just the offset from the base address of
ELF. An edge case may happen if we reference gp pointer in a IFUNC
resolver in a PIE object, but it will not happen in compiler-generated
codes since -pie will disable relax to gp. In this case, the GP will be
initialized incorrectly since the ref->st_value is not the address after
relocation. This patch fixes this issue by adding the l->l_addr to
ref->st_value to get the relocated address for the gp pointer. We don't
use SYMBOL_ADDRESS macro here because __global_pointer$ is a special
symbol that has SHN_ABS type, but it will use PC-relative addressing in
the load_gp function using lla.
Closes: BZ #32269
Fixes: 96d1b9ac23 ("RISC-V: Fix the static-PIE non-relocated object check")
Co-authored-by: Vivian Wang <dramforever@live.com>
Signed-off-by: Yangyu Chen <cyy@cyyself.name>
(cherry picked from commit 3fd2ff7685e3ee85c8cd2896f28ad62f67d7c483)
A number of fma tests started to fail on hppa when gcc was changed to
use Ranger rather than EVRP. Eventually I found that the value of
a1 + u.d in this is block of code was being computed in FE_TOWARDZERO
mode and not the original rounding mode:
if (TININESS_AFTER_ROUNDING)
{
w.d = a1 + u.d;
if (w.ieee.exponent == 109)
return w.d * 0x1p-108;
}
This caused the exponent value to be wrong and the wrong return path
to be used.
Here we add an optimization barrier after the rounding mode is reset
to ensure that the previous value of a1 + u.d is not reused.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
When compiling a program that includes <bits/floatn.h> using a clang version
between 3.4 (included) and 3.8.1 (included), clang will fail with `unknown type
name '__float128'; did you mean '__cfloat128'?`. This changes fixes the clang
prerequirements macro call in floatn.h to check for clang 3.9 instead of 3.4,
since support for __float128 was actually enabled in 3.9 by:
commit 50f29e06a1b6a38f0bba9360cbff72c82d46cdd4
Author: Nemanja Ivanovic <nemanja.i.ibm@gmail.com>
Date: Wed Apr 13 09:49:45 2016 +0000
Enable support for __float128 in Clang
This fixes bug 32694.
Signed-off-by: koraynilay <koray.fra@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 29803ed3ce420f01e7c567c97fc8945d5e5e5992)
Intel compiler always defines __INTEL_LLVM_COMPILER. When SYCL is
enabled by -fsycl, it also defines SYCL_LANGUAGE_VERSION. Since Intel
SYCL compiler doesn't support _Float128:
https://github.com/intel/llvm/issues/16903
define __HAVE_FLOAT128 to 0 for Intel SYCL compiler.
This fixes BZ #32723.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 5a4573be6f96ff49111bb6cae767676b5aafa7a8)
The syscall pkey_alloc can return ENOSPC to indicate either that all
keys are in use or that the system runs in a mode in which memory
protection keys are disabled. In such case the test should not fail and
just return unsupported.
This matches the behaviour of the generic tst-pkey.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 60f2d6be657aa8c663ee14bd266d343ae0f35afb)
Include the space needed to store the length of the message itself, in
addition to the message string. This resolves BZ #32582.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Linux 6.13 was released with a change that overwrites those bytes.
This means that the check_unused subtest fails.
Update the manual accordingly.
Tested-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
- Add GCS marking to some of the tests when target supports GCS
- Fix tst-ro-dynamic-mod.map linker script to avoid removing
GNU properties
- Add header with macros for GNU properties
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Allocate GCS based on the stack size, this can be used for coroutines
(makecontext) and thread creation (if the kernel allows user allocated
GCS).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Unlike for BTI, the kernel does not process GCS properties so update
GL(dl_aarch64_gcs) before the GCS status is set.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
check_gcs is called for each dependency of a DSO, but the GNU property
of the ld.so is not processed so ldso->l_mach.gcs may not be correct.
Just assume ld.so is GCS compatible independently of the ELF marking.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
- Handle GCS marking
- Use l_searchlist.r_list for gcs (allows using the
same function for static exe)
Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Use the dynamic linker start code to enable GCS in the dynamic linked
case after _dl_start returns and before _dl_start_user which marks
the point after which user code may run.
Like in the static linked case this ensures that GCS is enabled on a
top level stack frame.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Use the ARCH_SETUP_TLS hook to enable GCS in the static linked case.
The system call must be inlined and then GCS is enabled on a top
level stack frame that does not return and has no exception handlers
above it.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This tunable controls Guarded Control Stack (GCS) for the process.
0 = disabled: do not enable GCS
1 = enforced: check markings and fail if any binary is not marked
2 = optional: check markings but keep GCS off if a binary is unmarked
3 = override: enable GCS, markings are ignored
By default it is 0, so GCS is disabled, value 1 will enable GCS.
The status is stored into GL(dl_aarch64_gcs) early and only applied
later, since enabling GCS is tricky: it must happen on a top level
stack frame. Using GL instead of GLRO because it may need updates
depending on loaded libraries that happen after readonly protection
is applied, however library marking based GCS setting is not yet
implemented.
Describe new tunable in the manual.
Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Changed the makecontext logic: previously the first setcontext jumped
straight to the user callback function and the return address is set
to __startcontext. This does not work when GCS is enabled as the
integrity of the return address is protected, so instead the context
is setup such that setcontext jumps to __startcontext which calls the
user callback (passed in x20).
The map_shadow_stack syscall is used to allocate a suitably sized GCS
(which includes some reserved area to account for altstack signal
handlers and otherwise supports maximum number of 16 byte aligned
stack frames on the given stack) however the GCS is never freed as
the lifetime of ucontext and related stack is user managed.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Userspace ucontext needs to store GCSPR, it does not have to be
compatible with the kernel ucontext. For now we use the linux
struct gcs_context layout but only use the gcspr field from it.
Similar implementation to the longjmp code, supports switching GCS
if the target GCS is capped, and unwinding a continuous GCS to a
previous state.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This implementations ensures that longjmp across different stacks
works: it scans for GCS cap token and switches GCS if necessary
then the target GCSPR is restored with a GCSPOPM loop once the
current GCSPR is on the same GCS.
This makes longjmp linear time in the number of jumped over stack
frames when GCS is enabled.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The target specific internal __longjmp is called with a __jmp_buf
argument which has its size exposed in the ABI. On aarch64 this has
no space left, so GCSPR cannot be restored in longjmp in the usual
way, which is needed for the Guarded Control Stack (GCS) extension.
setjmp is implemented via __sigsetjmp which has a jmp_buf argument
however it is also called with __pthread_unwind_buf_t argument cast
to jmp_buf (in cancellation cleanup code built with -fno-exception).
The two types, jmp_buf and __pthread_unwind_buf_t, have common bits
beyond the __jmp_buf field and there is unused space there which we
can use for saving GCSPR.
For this to work some bits of those two generic types have to be
reserved for target specific use and the generic code in glibc has
to ensure that __longjmp is always called with a __jmp_buf that is
embedded into one of those two types. Morally __longjmp should be
changed to take jmp_buf as argument, but that is an intrusive change
across targets.
Note: longjmp is never called with __pthread_unwind_buf_t from user
code, only the internal __libc_longjmp is called with that type and
thus the two types could have separate longjmp implementations on a
target. We don't rely on this now (but might in the future given that
cancellation unwind does not need to restore GCSPR).
Given the above this patch finds an unused slot for GCSPR. This
placement is not exposed in the ABI so it may change in the future.
This is also very target ABI specific so the generic types cannot
be easily changed to clearly mark the reserved fields.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>