15287 Commits

Author SHA1 Message Date
Wilco Dijkstra
03e0cad3a0 math: Improve layout of exp/exp10 data
GCC aligns global data to 16 bytes if their size is >= 16 bytes.  This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps.  As a result on targets that support them, more load-pair
instructions are used in exp.

The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 5afaf99edb326fd9f36eb306a828d129a3a1d7f7)
(cherry picked from commit 5a08d049dc5037e89eb95bb1506652f0043fa39e)
2025-02-28 15:04:26 +00:00
Wilco Dijkstra
4f3b4e7b52 AArch64: Use prefer_sve_ifuncs for SVE memset
Use prefer_sve_ifuncs for SVE memset just like memcpy.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 0f044be1dae5169d0e57f8d487b427863aeadab4)
2025-02-28 15:04:26 +00:00
Wilco Dijkstra
c0aebc61b0 AArch64: Add SVE memset
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes.  Performance of random memset benchmark improves
by ~2% on Neoverse V1.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 163b1bbb76caba4d9673c07940c5930a1afa7548)
2025-02-28 15:04:24 +00:00
Wilco Dijkstra
eb2d69db2d math: Improve layout of expf data
GCC aligns global data to 16 bytes if their size is >= 16 bytes.  This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 44fa9c1080fe6a9539f0d2345b9d2ae37b8ee57a)
2025-02-28 15:03:30 +00:00
Wilco Dijkstra
2c9bfcf3ed AArch64: Remove zva_128 from memset
Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]).  Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit
e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special
case too.

[1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html

Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com>
(cherry picked from commit a08d9a52f967531a77e1824c23b5368c6434a72d)
2025-02-28 15:03:30 +00:00
Wilco Dijkstra
37243ba10a AArch64: Optimize memset
Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes.  Remove unnecessary code for ZVA sizes
other than 64 and 128.  Performance of random memset benchmark improves by 24%
on Neoverse N1.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit cec3aef32412779e207f825db0d057ebb4628ae8)
2025-02-28 15:03:28 +00:00
Wilco Dijkstra
4909fb4d96 AArch64: Improve generic strlen
Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates.  Change final
size computation to avoid increasing latency.  On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3dc426b642dcafdbc11a99f2767e081d086f5fc7)
2025-02-28 15:03:04 +00:00
Florian Weimer
7971add7ee Fix underallocation of abort_msg_s struct (CVE-2025-0395)
Include the space needed to store the length of the message itself, in
addition to the message string.  This resolves BZ #32582.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 68ee0f704cb81e9ad0a78c644a83e1e9cd2ee578)

Conflict in sysdeps/posix/libc_fatal.c due to missing cleanup after
backtrace removal.
2025-01-22 18:30:43 +01:00
Florian Weimer
53b3a3a26f elf: Support recursive use of dynamic TLS in interposed malloc
It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS).  The previous workaround from
commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.

This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update.  This
is possible because an interposed malloc has a fixed module ID and
TLS slot.  (It cannot be unloaded.)  If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc.  It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.

The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.

All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.

Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 018f0fc3b818d4d1460a4e2384c24802504b1d20)
2025-01-11 08:19:41 -08:00
H.J. Lu
f739705944 sysdeps/x86/Makefile: Split and sort tests
Put each test on a separate line and sort tests.

(cherry picked from commit 7e03e0de7e7c2de975b5c5e18f5a4b0c75816674)
2025-01-10 08:49:54 -08:00
Noah Goldstein
27296daa25 x86: Only align destination to 1x VEC_SIZE in memset 4x loop
Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on
performance other than potentially resulting in an additional
iteration of the loop.
1x maintains aligned stores (the only reason to align in this case)
and doesn't incur any unnecessary loop iterations.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>

(cherry picked from commit 9469261cf1924d350feeec64d2c80cafbbdcdd4d)
2025-01-10 08:49:42 -08:00
Szabolcs Nagy
147a830307 elf: Fix slow tls access after dlopen [BZ #19924]
In short: __tls_get_addr checks the global generation counter and if
the current dtv is older then _dl_update_slotinfo updates dtv up to the
generation of the accessed module. So if the global generation is newer
than generation of the module then __tls_get_addr keeps hitting the
slow dtv update path. The dtv update path includes a number of checks
to see if any update is needed and this already causes measurable tls
access slow down after dlopen.

It may be possible to detect up-to-date dtv faster.  But if there are
many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at
least walking the slotinfo list.

This patch tries to update the dtv to the global generation instead, so
after a dlopen the tls access slow path is only hit once.  The modules
with larger generation than the accessed one were not necessarily
synchronized before, so additional synchronization is needed.

This patch uses acquire/release synchronization when accessing the
generation counter.

Note: in the x86_64 version of dl-tls.c the generation is only loaded
once, since relaxed mo is not faster than acquire mo load.

I have not benchmarked this. Tested by Adhemerval Zanella on aarch64,
powerpc, sparc, x86 who reported that it fixes the performance issue
of bug 19924.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit d2123d68275acc0f061e73d5f86ca504e0d5a344)
2025-01-10 08:49:31 -08:00
H.J. Lu
f1c33665e5 x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643]
The old Intel software developer manual specified that the low byte of
EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of
CPUDID leaf 2 was needed to retrieve the complete cache information. The
newer Intel manual has been changed to that it should always return 1
and be ignored.  If the lower byte isn't 1, CPUID leaf 2 can't be used.
In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead.  If
CPUID leaf 4 doesn't contain the cache information, cache information
isn't available at all.  This addresses BZ #30643.

(cherry picked from commit 1493622f4f9048ffede3fbedb64695efa49d662a)
2025-01-10 08:49:19 -08:00
H.J. Lu
78a9a50bf2 x86_64: Add log1p with FMA
On Skylake, it changes log1p bench performance by:

        Before       After     Improvement
max     63.349       58.347       8%
min     4.448        5.651        -30%
mean    12.0674      10.336       14%

The minimum code path is

 if (hx < 0x3FDA827A)                          /* x < 0.41422  */
    {
      if (__glibc_unlikely (ax >= 0x3ff00000))           /* x <= -1.0 */
        {
	   ...
        }
      if (__glibc_unlikely (ax < 0x3e200000))           /* |x| < 2**-29 */
        {
          math_force_eval (two54 + x);          /* raise inexact */
          if (ax < 0x3c900000)                  /* |x| < 2**-54 */
            {
	      ...
            }
          else
            return x - x * x * 0.5;

FMA and non-FMA code sequences look similar.  Non-FMA version is slightly
faster.  Since log1p is called by asinh and atanh, it improves asinh
performance by:

        Before       After     Improvement
max     75.645       63.135       16%
min     10.074       10.071       0%
mean    15.9483      14.9089      6%

and improves atanh performance by:

        Before       After     Improvement
max     91.768       75.081       18%
min     15.548       13.883       10%
mean    18.3713      16.8011      8%

(cherry picked from commit a8ecb126d4c26c52f4ad828c566afe4043a28155)
2025-01-10 08:49:05 -08:00
H.J. Lu
f3a9a9facc x86_64: Add expm1 with FMA
On Skylake, it improves expm1 bench performance by:

        Before       After     Improvement
max     70.204       68.054       3%
min     20.709       16.2         22%
mean    22.1221      16.7367      24%

NB: Add

extern long double __expm1l (long double);
extern long double __expm1f128 (long double);

for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.

(cherry picked from commit 1b214630ce6f7e0099b8b6f87246246739b079cf)
2025-01-10 08:48:56 -08:00
H.J. Lu
d3bc49260c x86_64: Add log2 with FMA
On Skylake, it improves log2 bench performance by:

        Before       After     Improvement
max     208.779      63.827       69%
min     9.977        6.55         34%
mean    10.366       6.8191       34%

(cherry picked from commit f6b10ed8e9a00de49d0951e760cc2b5288862b47)
2025-01-10 08:48:46 -08:00
H.J. Lu
abfa5410a0 x86_64: Sort fpu/multiarch/Makefile
Sort Makefile variables using scripts/sort-makefile-lines.py.

No code generation changes observed in libm.  No regressions on x86_64.

(cherry picked from commit 881546979d0219c18337e1b4f4d00cfacab13c40)
2025-01-10 08:48:30 -08:00
Florian Weimer
fbcde5ea46 x86: Avoid integer truncation with large cache sizes (bug 32470)
Some hypervisors report 1 TiB L3 cache size.  This results
in some variables incorrectly getting zeroed, causing crashes
in memcpy/memmove because invariants are violated.

(cherry picked from commit 61c3450db96dce96ad2b24b4f0b548e6a46d68e5)
2024-12-17 19:07:38 +01:00
Michael Jeanson
314419978f nptl: initialize cpu_id_start prior to rseq registration
When adding explicit initialization of rseq fields prior to
registration, I glossed over the fact that 'cpu_id_start' is also
documented as initialized by user-space.

While current kernels don't validate the content of this field on
registration, future ones could.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
(cherry picked from commit d9f40387d3305d97e30a8cf8724218c42a63680a)
2024-12-06 16:08:55 +00:00
Michael Jeanson
b1f62861bf nptl: initialize rseq area prior to registration
Per the rseq syscall documentation, 3 fields are required to be
initialized by userspace prior to registration, they are 'cpu_id',
'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread'
is cleared on all architectures, explicitly set those 3 fields prior to
registration.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 97f60abd25628425971f07e9b0e7f8eec0741235)
2024-12-06 16:08:55 +00:00
Maciej W. Rozycki
c3a3240b64 nptl: Use <support/check.h> facilities in tst-setuid3
Remove local FAIL macro in favor to FAIL_EXIT1 from <support/check.h>,
which provides equivalent reporting, with the name of the file and the
line number within of the failure site additionally included.  Remove
FAIL_ERR altogether and include ": %m" explicitly with the format string
supplied to FAIL_EXIT1 as there seems little value to have a separate
macro just for this.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 8c98195af6e6f1ce21743fc26c723e0f7e45bcf2)
2024-08-30 15:28:59 -04:00
Florian Weimer
b41034cebf Add mremap tests
Add tests for MREMAP_MAYMOVE and MREMAP_FIXED.  On Linux, also test
MREMAP_DONTUNMAP.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit ff0320bec2810192d453c579623482fab87bfa01)
2024-08-01 17:15:36 +02:00
H.J. Lu
ab482a557d linux: Update the mremap C implementation [BZ #31968]
Update the mremap C implementation to support the optional argument for
MREMAP_DONTUNMAP added in Linux 5.7 since it may not always be correct
to implement a variadic function as a non-variadic function on all Linux
targets.  Return MAP_FAILED and set errno to EINVAL for unknown flag bits.
This fixes BZ #31968.

Note: A test must be added when a new flag bit is introduced.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 6c40cb0e9f893d49dc7caee580a055de53562206)
2024-08-01 16:59:32 +02:00
Stafford Horne
f908fa07c6 misc: Add support for Linux uio.h RWF_NOAPPEND flag
In Linux 6.9 a new flag is added to allow for Per-io operations to
disable append mode even if a file was opened with the flag O_APPEND.
This is done with the new RWF_NOAPPEND flag.

This caused two test failures as these tests expected the flag 0x00000020
to be unused.  Adding the flag definition now fixes these tests on Linux
6.9 (v6.9-rc1).

  FAIL: misc/tst-preadvwritev2
  FAIL: misc/tst-preadvwritev64v2

This patch adds the flag, adjusts the test and adds details to
documentation.

Link: https://lore.kernel.org/all/20200831153207.GO3265@brightrain.aerifal.cx/
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 3db9d208dd5f30b12900989c6d2214782b8e2011)
2024-07-20 19:30:25 +02:00
Florian Weimer
b0471d1605 Linux: Make __rseq_size useful for feature detection (bug 31965)
The __rseq_size value is now the active area of struct rseq
(so 20 initially), not the full struct size including padding
at the end (32 initially).

Update misc/tst-rseq to print some additional diagnostics.

Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
(cherry picked from commit 2e456ccf0c34a056e3ccafac4a0c7effef14d918)
2024-07-16 17:59:45 +02:00
Adhemerval Zanella
bce1d80721 elf: Make dl-rseq-symbols Linux only
And avoid a Hurd build failures.

Checked on x86_64-linux-gnu.

(cherry picked from commit 9fc639f654dc004736836613be703e6bed0c36a8)
2024-07-16 17:25:28 +02:00
Michael Jeanson
72f4c0533c nptl: fix potential merge of __rseq_* relro symbols
While working on a patch to add support for the extensible rseq ABI, we
came across an issue where a new 'const' variable would be merged with
the existing '__rseq_size' variable. We tracked this to the use of
'-fmerge-all-constants' which allows the compiler to merge identical
constant variables. This means that all 'const' variables in a compile
unit that are of the same size and are initialized to the same value can
be merged.

In this specific case, on 32 bit systems 'unsigned int' and 'ptrdiff_t'
are both 4 bytes and initialized to 0 which should trigger the merge.
However for reasons we haven't delved into when the attribute 'section
(".data.rel.ro")' is added to the mix, only variables of the same exact
types are merged. As far as we know this behavior is not specified
anywhere and could change with a new compiler version, hence this patch.

Move the definitions of these variables into an assembler file and add
hidden writable aliases for internal use. This has the added bonus of
removing the asm workaround to set the values on rseq registration.

Tested on Debian 12 with GCC 12.2.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 2b92982e2369d292560793bee8e730f695f48ff3)
2024-07-16 17:25:28 +02:00
Stefan Liebler
a70c55a91b s390x: Fix segfault in wcsncmp [BZ #31934]
The z13/vector-optimized wcsncmp implementation segfaults if n=1
and there is only one character (equal on both strings) before
the page end.  Then it loads and compares one character and misses
to check n again.  The following load fails.

This patch removes the extra load and compare of the first character
and just start with the loop which uses vector-load-to-block-boundary.
This code-path also checks n.

With this patch both tests are passing:
- the simplified one mentioned in the bugzilla 31934
- the full one in Florian Weimer's patch:
"manual: Document a GNU extension for strncmp/wcsncmp"
(https://patchwork.sourceware.org/project/glibc/patch/874j9eml6y.fsf@oldenburg.str.redhat.com/):
On s390x-linux-gnu (z16), the new wcsncmp test fails due to bug 31934.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

(cherry picked from commit 9b7651410375ec8848a1944992d663d514db4ba7)
2024-07-16 10:30:45 +02:00
Florian Weimer
85242ed63a login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701)
These structs describe file formats under /var/log, and should not
depend on the definition of _TIME_BITS.  This is achieved by
defining __WORDSIZE_TIME64_COMPAT32 to 1 on 32-bit ports that
support 32-bit time_t values (where __time_t is 32 bits).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 9abdae94c7454c45e02e97e4ed1eb1b1915d13d8)
2024-05-02 19:30:48 +02:00
Florian Weimer
eeeaf0fe2d login: Check default sizes of structs utmp, utmpx, lastlog
The default <utmp-size.h> is for ports with a 64-bit time_t.
Ports with a 32-bit time_t or with __WORDSIZE_TIME64_COMPAT32=1
need to override it.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 4d4da5aab936504b2d3eca3146e109630d9093c4)
2024-05-02 19:30:48 +02:00
Adhemerval Zanella
5b48737556 sparc: Remove 64 bit check on sparc32 wordsize (BZ 27574)
The sparc32 is always 32 bits.

Checked on sparcv9-linux-gnu.

(cherry picked from commit dd57f5e7b652772499cb220d78157c1038d24f06)
2024-05-02 19:30:48 +02:00
Florian Weimer
78ad1b1b0d powerpc: Fix ld.so address determination for PCREL mode (bug 31640)
This seems to have stopped working with some GCC 14 versions,
which clobber r2.  With other compilers, the kernel-provided
r2 value is still available at this point.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
(cherry picked from commit 14e56bd4ce15ac2d1cc43f762eb2e6b83fec1afe)
2024-04-14 10:34:10 +02:00
Wilco Dijkstra
24de733967 AArch64: Check kernel version for SVE ifuncs
Old Linux kernels disable SVE after every system call.  Calling the
SVE-optimized memcpy afterwards will then cause a trap to reenable SVE.
As a result, applications with a high use of syscalls may run slower with
the SVE memcpy.  This is true for kernels between 4.15.0 and before 6.2.0,
except for 5.14.0 which was patched.  Avoid this by checking the kernel
version and selecting the SVE ifunc on modern kernels.

Parse the kernel version reported by uname() into a 24-bit kernel.major.minor
value without calling any library functions.  If uname() is not supported or
if the version format is not recognized, assume the kernel is modern.

Tested-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 2e94e2f5d2bf2de124c8ad7da85463355e54ccb2)
2024-04-08 17:08:03 +01:00
Szabolcs Nagy
b77e357bf9 aarch64: fix check for SVE support in assembler
Due to GCC bug 110901 -mcpu can override -march setting when compiling
asm code and thus a compiler targetting a specific cpu can fail the
configure check even when binutils gas supports SVE.

The workaround is that explicit .arch directive overrides both -mcpu
and -march, and since that's what the actual SVE memcpy uses the
configure check should use that too even if the GCC issue is fixed
independently.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 73c26018ed0ecd9c807bb363cc2c2ab4aca66a82)
2024-04-08 16:53:05 +01:00
Andreas Schwab
88e96e7b5f aarch64: correct CFI in rawmemchr (bug 31113)
The .cfi_return_column directive changes the return column for the whole
FDE range.  But the actual intent is to tell the unwinder that the value
in x30 (lr) now resides in x15 after the move, and that is expressed by
the .cfi_register directive.

(cherry picked from commit 3f798427884fa57770e8e2291cf58d5918254bb5)
2024-04-08 16:52:55 +01:00
Wilco Dijkstra
5a5211b438 AArch64: Remove Falkor memcpy
The latest implementations of memcpy are actually faster than the Falkor
implementations [1], so remove the falkor/phecda ifuncs for memcpy and
the now unused IS_FALKOR/IS_PHECDA defines.

[1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 2f5524cc5381eb75fef55f7901bb907bd5628333)
2024-04-08 16:52:39 +01:00
Wilco Dijkstra
1da0176152 AArch64: Add memset_zva64
Add a specialized memset for the common ZVA size of 64 to avoid the
overhead of reading the ZVA size.  Since the code is identical to
__memset_falkor, remove the latter.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 3d7090f14b13312320e425b27dcf0fe72de026fd)
2024-04-08 16:51:34 +01:00
Wilco Dijkstra
f9ae26cbbe AArch64: Cleanup emag memset
Cleanup emag memset - merge the memset_base64.S file, remove
the unused ZVA code (since it is disabled on emag).

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 9627ab99b50d250c6dd3001a3355aa03692f7fe5)
2024-04-08 16:50:46 +01:00
Wilco Dijkstra
6bb4121873 AArch64: Cleanup ifuncs
Cleanup ifuncs.  Remove uses of libc_hidden_builtin_def, use ENTRY rather than
ENTRY_ALIGN, remove unnecessary defines and conditional compilation.  Rename
strlen_mte to strlen_generic.  Remove rtld-memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 9fd3409842b3e2d31cff5dbd6f96066c430f0aa2)
2024-04-08 16:49:14 +01:00
Wilco Dijkstra
3bac018a98 AArch64: Add support for MOPS memcpy/memmove/memset
Add support for MOPS in cpu_features and INIT_ARCH.  Add ifuncs using MOPS for
memcpy, memmove and memset (use .inst for now so it works with all binutils
versions without needing complex configure and conditional compilation).

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 2bd00179885928fd95fcabfafc50e7b5c6e660d2)
2024-04-08 16:47:51 +01:00
Joseph Myers
aaa0fc324d Add HWCAP2_MOPS from Linux 6.5 to AArch64 bits/hwcap.h
Linux 6.5 adds a new AArch64 HWCAP2 value, HWCAP2_MOPS.  Add it to
glibc's bits/hwcap.h.

Tested with build-many-glibcs.py for aarch64-linux-gnu.

(cherry picked from commit ff5d2abd18629e0efac41e31699cdff3be0e08fa)
2024-04-08 16:47:23 +01:00
Wilco Dijkstra
f45608f6d7 AArch64: Improve SVE memcpy and memmove
Improve SVE memcpy by copying 2 vectors if the size is small enough.
This improves performance of random memcpy by ~9% on Neoverse V1, and
33-64 byte copies are ~16% faster.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit d2d3f3720ce627a4fe154d8dd14db716a32bcc6e)
2024-04-08 16:46:36 +01:00
Wilco Dijkstra
600098c58a AArch64: Improve strrchr
Use shrn for narrowing the mask which simplifies code and speeds up small
strings.  Unroll the first search loop to improve performance on large
strings.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 55599d480437dcf129b41b95be32b48f2a9e5da9)
2024-04-08 16:46:23 +01:00
Wilco Dijkstra
7cbcc95927 AArch64: Optimize strnlen
Optimize strnlen using the shrn instruction and improve the main loop.
Small strings are around 10% faster, large strings are 40% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit ad098893ba3c3344a5f2f6ab1627c47204afdb47)
2024-04-08 16:46:11 +01:00
Wilco Dijkstra
daa13ed0ac AArch64: Optimize strlen
Optimize strlen by unrolling the main loop.  Large strings are 64% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 03c8ce5000198947a4dd7b2c14e5131738fda62b)
2024-04-08 16:45:59 +01:00
Wilco Dijkstra
90d9b8897b AArch64: Optimize strcpy
Unroll the main loop.  Large strings are around 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 349e48c01e85bd96006860084e76d322e6ca02f1)
2024-04-08 16:45:50 +01:00
Wilco Dijkstra
8ee69bd2df AArch64: Improve strchrnul
Unroll the main loop, which improves performance slightly.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 09ebd8549b2ce5a3a6c0c7c5f3e62227faf50a99)
2024-04-08 16:45:31 +01:00
Wilco Dijkstra
2e4cdcbd35 AArch64: Optimize strchr
Simplify calculation of the mask using shrn.  Unroll the main loop.
Small strings are 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 51541a229740801882490177fa178e49264b13fb)
2024-04-08 16:45:20 +01:00
Wilco Dijkstra
ef62fc2223 AArch64: Improve strlen_asimd
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment.
Performance improves slightly as a result.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 1bbb1a2022e126f21810d3d0ebe0a975d5243e43)
2024-04-08 16:42:24 +01:00
Wilco Dijkstra
1296aa9b0b AArch64: Optimize memrchr
Optimize the main loop - large strings are 43% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 00776241776e67fc666b896c1e85770f4f3ec1e1)
2024-04-08 16:42:14 +01:00