glibc

mirror of git://sourceware.org/git/glibc.git synced 2025-04-18 14:30:43 +08:00

Author	SHA1	Message	Date
Wilco Dijkstra	03e0cad3a0	math: Improve layout of exp/exp10 data GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp_data struct slightly so that the fields are better aligned and without gaps. As a result on targets that support them, more load-pair instructions are used in exp. The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on Neoverse V2. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 5afaf99edb326fd9f36eb306a828d129a3a1d7f7) (cherry picked from commit 5a08d049dc5037e89eb95bb1506652f0043fa39e)	2025-02-28 15:04:26 +00:00
Wilco Dijkstra	4f3b4e7b52	AArch64: Use prefer_sve_ifuncs for SVE memset Use prefer_sve_ifuncs for SVE memset just like memcpy. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit 0f044be1dae5169d0e57f8d487b427863aeadab4)	2025-02-28 15:04:26 +00:00
Wilco Dijkstra	c0aebc61b0	AArch64: Add SVE memset Add SVE memset based on the generic memset with predicated load for sizes < 16. Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned stores for the last 64 bytes. Performance of random memset benchmark improves by ~2% on Neoverse V1. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit 163b1bbb76caba4d9673c07940c5930a1afa7548)	2025-02-28 15:04:24 +00:00
Wilco Dijkstra	eb2d69db2d	math: Improve layout of expf data GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp2f_data struct slightly so that the fields are better aligned. As a result on targets that support them, load-pair instructions accessing poly_scaled and invln2_scaled are now 16-byte aligned. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 44fa9c1080fe6a9539f0d2345b9d2ae37b8ee57a)	2025-02-28 15:03:30 +00:00
Wilco Dijkstra	2c9bfcf3ed	AArch64: Remove zva_128 from memset Remove ZVA 128 support from memset - the new memset no longer guarantees count >= 256, which can result in underflow and a crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA size of 128 and its memcpy implementation was removed in commit e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special case too. [1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com> (cherry picked from commit a08d9a52f967531a77e1824c23b5368c6434a72d)	2025-02-28 15:03:30 +00:00
Wilco Dijkstra	37243ba10a	AArch64: Optimize memset Improve small memsets by avoiding branches and use overlapping stores. Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes other than 64 and 128. Performance of random memset benchmark improves by 24% on Neoverse N1. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit cec3aef32412779e207f825db0d057ebb4628ae8)	2025-02-28 15:03:28 +00:00
Wilco Dijkstra	4909fb4d96	AArch64: Improve generic strlen Improve performance by handling another 16 bytes before entering the loop. Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final size computation to avoid increasing latency. On Neoverse V1 performance of the random strlen benchmark improves by 4.6%. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 3dc426b642dcafdbc11a99f2767e081d086f5fc7)	2025-02-28 15:03:04 +00:00
Florian Weimer	7971add7ee	Fix underallocation of abort_msg_s struct (CVE-2025-0395) Include the space needed to store the length of the message itself, in addition to the message string. This resolves BZ #32582. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 68ee0f704cb81e9ad0a78c644a83e1e9cd2ee578) Conflict in sysdeps/posix/libc_fatal.c due to missing cleanup after backtrace removal.	2025-01-22 18:30:43 +01:00
Florian Weimer	53b3a3a26f	elf: Support recursive use of dynamic TLS in interposed malloc It turns out that quite a few applications use bundled mallocs that have been built to use global-dynamic TLS (instead of the recommended initial-exec TLS). The previous workaround from commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some free (NULL) calls in _dl_update_slotinfo") does not fix all encountered cases unfortunatelly. This change avoids the TLS generation update for recursive use of TLS from a malloc that was called during a TLS update. This is possible because an interposed malloc has a fixed module ID and TLS slot. (It cannot be unloaded.) If an initially-loaded module ID is encountered in __tls_get_addr and the dynamic linker is already in the middle of a TLS update, use the outdated DTV, thus avoiding another call into malloc. It's still necessary to update the DTV to the most recent generation, to get out of the slow path, which is why the check for recursion is needed. The bookkeeping is done using a global counter instead of per-thread flag because TLS access in the dynamic linker is tricky. All this will go away once the dynamic linker stops using malloc for TLS, likely as part of a change that pre-allocates all TLS during pthread_create/dlopen. Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls access after dlopen [BZ #19924]"). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 018f0fc3b818d4d1460a4e2384c24802504b1d20)	2025-01-11 08:19:41 -08:00
H.J. Lu	f739705944	sysdeps/x86/Makefile: Split and sort tests Put each test on a separate line and sort tests. (cherry picked from commit 7e03e0de7e7c2de975b5c5e18f5a4b0c75816674)	2025-01-10 08:49:54 -08:00
Noah Goldstein	27296daa25	x86: Only align destination to 1x VEC_SIZE in memset 4x loop Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on performance other than potentially resulting in an additional iteration of the loop. 1x maintains aligned stores (the only reason to align in this case) and doesn't incur any unnecessary loop iterations. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit 9469261cf1924d350feeec64d2c80cafbbdcdd4d)	2025-01-10 08:49:42 -08:00
Szabolcs Nagy	147a830307	elf: Fix slow tls access after dlopen [BZ #19924 ] In short: __tls_get_addr checks the global generation counter and if the current dtv is older then _dl_update_slotinfo updates dtv up to the generation of the accessed module. So if the global generation is newer than generation of the module then __tls_get_addr keeps hitting the slow dtv update path. The dtv update path includes a number of checks to see if any update is needed and this already causes measurable tls access slow down after dlopen. It may be possible to detect up-to-date dtv faster. But if there are many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at least walking the slotinfo list. This patch tries to update the dtv to the global generation instead, so after a dlopen the tls access slow path is only hit once. The modules with larger generation than the accessed one were not necessarily synchronized before, so additional synchronization is needed. This patch uses acquire/release synchronization when accessing the generation counter. Note: in the x86_64 version of dl-tls.c the generation is only loaded once, since relaxed mo is not faster than acquire mo load. I have not benchmarked this. Tested by Adhemerval Zanella on aarch64, powerpc, sparc, x86 who reported that it fixes the performance issue of bug 19924. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit d2123d68275acc0f061e73d5f86ca504e0d5a344)	2025-01-10 08:49:31 -08:00
H.J. Lu	f1c33665e5	x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643 ] The old Intel software developer manual specified that the low byte of EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of CPUDID leaf 2 was needed to retrieve the complete cache information. The newer Intel manual has been changed to that it should always return 1 and be ignored. If the lower byte isn't 1, CPUID leaf 2 can't be used. In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead. If CPUID leaf 4 doesn't contain the cache information, cache information isn't available at all. This addresses BZ #30643. (cherry picked from commit 1493622f4f9048ffede3fbedb64695efa49d662a)	2025-01-10 08:49:19 -08:00
H.J. Lu	78a9a50bf2	x86_64: Add log1p with FMA On Skylake, it changes log1p bench performance by: Before After Improvement max 63.349 58.347 8% min 4.448 5.651 -30% mean 12.0674 10.336 14% The minimum code path is if (hx < 0x3FDA827A) /* x < 0.41422 / { if (__glibc_unlikely (ax >= 0x3ff00000)) / x <= -1.0 / { ... } if (__glibc_unlikely (ax < 0x3e200000)) / \|x\| < 2*-29 / { math_force_eval (two54 + x); /* raise inexact / if (ax < 0x3c900000) / \|x\| < 2*-54 / { ... } else return x - x * x * 0.5; FMA and non-FMA code sequences look similar. Non-FMA version is slightly faster. Since log1p is called by asinh and atanh, it improves asinh performance by: Before After Improvement max 75.645 63.135 16% min 10.074 10.071 0% mean 15.9483 14.9089 6% and improves atanh performance by: Before After Improvement max 91.768 75.081 18% min 15.548 13.883 10% mean 18.3713 16.8011 8% (cherry picked from commit a8ecb126d4c26c52f4ad828c566afe4043a28155)	2025-01-10 08:49:05 -08:00
H.J. Lu	f3a9a9facc	x86_64: Add expm1 with FMA On Skylake, it improves expm1 bench performance by: Before After Improvement max 70.204 68.054 3% min 20.709 16.2 22% mean 22.1221 16.7367 24% NB: Add extern long double __expm1l (long double); extern long double __expm1f128 (long double); for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is defined since __expm1 may be expanded in their declarations which causes the build failure. (cherry picked from commit 1b214630ce6f7e0099b8b6f87246246739b079cf)	2025-01-10 08:48:56 -08:00
H.J. Lu	d3bc49260c	x86_64: Add log2 with FMA On Skylake, it improves log2 bench performance by: Before After Improvement max 208.779 63.827 69% min 9.977 6.55 34% mean 10.366 6.8191 34% (cherry picked from commit f6b10ed8e9a00de49d0951e760cc2b5288862b47)	2025-01-10 08:48:46 -08:00
H.J. Lu	abfa5410a0	x86_64: Sort fpu/multiarch/Makefile Sort Makefile variables using scripts/sort-makefile-lines.py. No code generation changes observed in libm. No regressions on x86_64. (cherry picked from commit 881546979d0219c18337e1b4f4d00cfacab13c40)	2025-01-10 08:48:30 -08:00
Florian Weimer	fbcde5ea46	x86: Avoid integer truncation with large cache sizes (bug 32470) Some hypervisors report 1 TiB L3 cache size. This results in some variables incorrectly getting zeroed, causing crashes in memcpy/memmove because invariants are violated. (cherry picked from commit 61c3450db96dce96ad2b24b4f0b548e6a46d68e5)	2024-12-17 19:07:38 +01:00
Michael Jeanson	314419978f	nptl: initialize cpu_id_start prior to rseq registration When adding explicit initialization of rseq fields prior to registration, I glossed over the fact that 'cpu_id_start' is also documented as initialized by user-space. While current kernels don't validate the content of this field on registration, future ones could. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> (cherry picked from commit d9f40387d3305d97e30a8cf8724218c42a63680a)	2024-12-06 16:08:55 +00:00
Michael Jeanson	b1f62861bf	nptl: initialize rseq area prior to registration Per the rseq syscall documentation, 3 fields are required to be initialized by userspace prior to registration, they are 'cpu_id', 'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread' is cleared on all architectures, explicitly set those 3 fields prior to registration. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit 97f60abd25628425971f07e9b0e7f8eec0741235)	2024-12-06 16:08:55 +00:00
Maciej W. Rozycki	c3a3240b64	nptl: Use <support/check.h> facilities in tst-setuid3 Remove local FAIL macro in favor to FAIL_EXIT1 from <support/check.h>, which provides equivalent reporting, with the name of the file and the line number within of the failure site additionally included. Remove FAIL_ERR altogether and include ": %m" explicitly with the format string supplied to FAIL_EXIT1 as there seems little value to have a separate macro just for this. Reviewed-by: DJ Delorie <dj@redhat.com> (cherry picked from commit 8c98195af6e6f1ce21743fc26c723e0f7e45bcf2)	2024-08-30 15:28:59 -04:00
Florian Weimer	b41034cebf	Add mremap tests Add tests for MREMAP_MAYMOVE and MREMAP_FIXED. On Linux, also test MREMAP_DONTUNMAP. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit ff0320bec2810192d453c579623482fab87bfa01)	2024-08-01 17:15:36 +02:00
H.J. Lu	ab482a557d	linux: Update the mremap C implementation [BZ #31968 ] Update the mremap C implementation to support the optional argument for MREMAP_DONTUNMAP added in Linux 5.7 since it may not always be correct to implement a variadic function as a non-variadic function on all Linux targets. Return MAP_FAILED and set errno to EINVAL for unknown flag bits. This fixes BZ #31968. Note: A test must be added when a new flag bit is introduced. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 6c40cb0e9f893d49dc7caee580a055de53562206)	2024-08-01 16:59:32 +02:00
Stafford Horne	f908fa07c6	misc: Add support for Linux uio.h RWF_NOAPPEND flag In Linux 6.9 a new flag is added to allow for Per-io operations to disable append mode even if a file was opened with the flag O_APPEND. This is done with the new RWF_NOAPPEND flag. This caused two test failures as these tests expected the flag 0x00000020 to be unused. Adding the flag definition now fixes these tests on Linux 6.9 (v6.9-rc1). FAIL: misc/tst-preadvwritev2 FAIL: misc/tst-preadvwritev64v2 This patch adds the flag, adjusts the test and adds details to documentation. Link: https://lore.kernel.org/all/20200831153207.GO3265@brightrain.aerifal.cx/ Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 3db9d208dd5f30b12900989c6d2214782b8e2011)	2024-07-20 19:30:25 +02:00
Florian Weimer	b0471d1605	Linux: Make __rseq_size useful for feature detection (bug 31965) The __rseq_size value is now the active area of struct rseq (so 20 initially), not the full struct size including padding at the end (32 initially). Update misc/tst-rseq to print some additional diagnostics. Reviewed-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> (cherry picked from commit 2e456ccf0c34a056e3ccafac4a0c7effef14d918)	2024-07-16 17:59:45 +02:00
Adhemerval Zanella	bce1d80721	elf: Make dl-rseq-symbols Linux only And avoid a Hurd build failures. Checked on x86_64-linux-gnu. (cherry picked from commit 9fc639f654dc004736836613be703e6bed0c36a8)	2024-07-16 17:25:28 +02:00
Michael Jeanson	72f4c0533c	nptl: fix potential merge of __rseq_* relro symbols While working on a patch to add support for the extensible rseq ABI, we came across an issue where a new 'const' variable would be merged with the existing '__rseq_size' variable. We tracked this to the use of '-fmerge-all-constants' which allows the compiler to merge identical constant variables. This means that all 'const' variables in a compile unit that are of the same size and are initialized to the same value can be merged. In this specific case, on 32 bit systems 'unsigned int' and 'ptrdiff_t' are both 4 bytes and initialized to 0 which should trigger the merge. However for reasons we haven't delved into when the attribute 'section (".data.rel.ro")' is added to the mix, only variables of the same exact types are merged. As far as we know this behavior is not specified anywhere and could change with a new compiler version, hence this patch. Move the definitions of these variables into an assembler file and add hidden writable aliases for internal use. This has the added bonus of removing the asm workaround to set the values on rseq registration. Tested on Debian 12 with GCC 12.2. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit 2b92982e2369d292560793bee8e730f695f48ff3)	2024-07-16 17:25:28 +02:00
Stefan Liebler	a70c55a91b	s390x: Fix segfault in wcsncmp [BZ #31934 ] The z13/vector-optimized wcsncmp implementation segfaults if n=1 and there is only one character (equal on both strings) before the page end. Then it loads and compares one character and misses to check n again. The following load fails. This patch removes the extra load and compare of the first character and just start with the loop which uses vector-load-to-block-boundary. This code-path also checks n. With this patch both tests are passing: - the simplified one mentioned in the bugzilla 31934 - the full one in Florian Weimer's patch: "manual: Document a GNU extension for strncmp/wcsncmp" (https://patchwork.sourceware.org/project/glibc/patch/874j9eml6y.fsf@oldenburg.str.redhat.com/): On s390x-linux-gnu (z16), the new wcsncmp test fails due to bug 31934. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 9b7651410375ec8848a1944992d663d514db4ba7)	2024-07-16 10:30:45 +02:00
Florian Weimer	85242ed63a	login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) These structs describe file formats under /var/log, and should not depend on the definition of _TIME_BITS. This is achieved by defining __WORDSIZE_TIME64_COMPAT32 to 1 on 32-bit ports that support 32-bit time_t values (where __time_t is 32 bits). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 9abdae94c7454c45e02e97e4ed1eb1b1915d13d8)	2024-05-02 19:30:48 +02:00
Florian Weimer	eeeaf0fe2d	login: Check default sizes of structs utmp, utmpx, lastlog The default <utmp-size.h> is for ports with a 64-bit time_t. Ports with a 32-bit time_t or with __WORDSIZE_TIME64_COMPAT32=1 need to override it. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 4d4da5aab936504b2d3eca3146e109630d9093c4)	2024-05-02 19:30:48 +02:00
Adhemerval Zanella	5b48737556	sparc: Remove 64 bit check on sparc32 wordsize (BZ 27574) The sparc32 is always 32 bits. Checked on sparcv9-linux-gnu. (cherry picked from commit dd57f5e7b652772499cb220d78157c1038d24f06)	2024-05-02 19:30:48 +02:00
Florian Weimer	78ad1b1b0d	powerpc: Fix ld.so address determination for PCREL mode (bug 31640) This seems to have stopped working with some GCC 14 versions, which clobber r2. With other compilers, the kernel-provided r2 value is still available at this point. Reviewed-by: Peter Bergner <bergner@linux.ibm.com> (cherry picked from commit 14e56bd4ce15ac2d1cc43f762eb2e6b83fec1afe)	2024-04-14 10:34:10 +02:00
Wilco Dijkstra	24de733967	AArch64: Check kernel version for SVE ifuncs Old Linux kernels disable SVE after every system call. Calling the SVE-optimized memcpy afterwards will then cause a trap to reenable SVE. As a result, applications with a high use of syscalls may run slower with the SVE memcpy. This is true for kernels between 4.15.0 and before 6.2.0, except for 5.14.0 which was patched. Avoid this by checking the kernel version and selecting the SVE ifunc on modern kernels. Parse the kernel version reported by uname() into a 24-bit kernel.major.minor value without calling any library functions. If uname() is not supported or if the version format is not recognized, assume the kernel is modern. Tested-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 2e94e2f5d2bf2de124c8ad7da85463355e54ccb2)	2024-04-08 17:08:03 +01:00
Szabolcs Nagy	b77e357bf9	aarch64: fix check for SVE support in assembler Due to GCC bug 110901 -mcpu can override -march setting when compiling asm code and thus a compiler targetting a specific cpu can fail the configure check even when binutils gas supports SVE. The workaround is that explicit .arch directive overrides both -mcpu and -march, and since that's what the actual SVE memcpy uses the configure check should use that too even if the GCC issue is fixed independently. Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit 73c26018ed0ecd9c807bb363cc2c2ab4aca66a82)	2024-04-08 16:53:05 +01:00
Andreas Schwab	88e96e7b5f	aarch64: correct CFI in rawmemchr (bug 31113) The .cfi_return_column directive changes the return column for the whole FDE range. But the actual intent is to tell the unwinder that the value in x30 (lr) now resides in x15 after the move, and that is expressed by the .cfi_register directive. (cherry picked from commit 3f798427884fa57770e8e2291cf58d5918254bb5)	2024-04-08 16:52:55 +01:00
Wilco Dijkstra	5a5211b438	AArch64: Remove Falkor memcpy The latest implementations of memcpy are actually faster than the Falkor implementations [1], so remove the falkor/phecda ifuncs for memcpy and the now unused IS_FALKOR/IS_PHECDA defines. [1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 2f5524cc5381eb75fef55f7901bb907bd5628333)	2024-04-08 16:52:39 +01:00
Wilco Dijkstra	1da0176152	AArch64: Add memset_zva64 Add a specialized memset for the common ZVA size of 64 to avoid the overhead of reading the ZVA size. Since the code is identical to __memset_falkor, remove the latter. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 3d7090f14b13312320e425b27dcf0fe72de026fd)	2024-04-08 16:51:34 +01:00
Wilco Dijkstra	f9ae26cbbe	AArch64: Cleanup emag memset Cleanup emag memset - merge the memset_base64.S file, remove the unused ZVA code (since it is disabled on emag). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 9627ab99b50d250c6dd3001a3355aa03692f7fe5)	2024-04-08 16:50:46 +01:00
Wilco Dijkstra	6bb4121873	AArch64: Cleanup ifuncs Cleanup ifuncs. Remove uses of libc_hidden_builtin_def, use ENTRY rather than ENTRY_ALIGN, remove unnecessary defines and conditional compilation. Rename strlen_mte to strlen_generic. Remove rtld-memset. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 9fd3409842b3e2d31cff5dbd6f96066c430f0aa2)	2024-04-08 16:49:14 +01:00
Wilco Dijkstra	3bac018a98	AArch64: Add support for MOPS memcpy/memmove/memset Add support for MOPS in cpu_features and INIT_ARCH. Add ifuncs using MOPS for memcpy, memmove and memset (use .inst for now so it works with all binutils versions without needing complex configure and conditional compilation). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 2bd00179885928fd95fcabfafc50e7b5c6e660d2)	2024-04-08 16:47:51 +01:00
Joseph Myers	aaa0fc324d	Add HWCAP2_MOPS from Linux 6.5 to AArch64 bits/hwcap.h Linux 6.5 adds a new AArch64 HWCAP2 value, HWCAP2_MOPS. Add it to glibc's bits/hwcap.h. Tested with build-many-glibcs.py for aarch64-linux-gnu. (cherry picked from commit ff5d2abd18629e0efac41e31699cdff3be0e08fa)	2024-04-08 16:47:23 +01:00
Wilco Dijkstra	f45608f6d7	AArch64: Improve SVE memcpy and memmove Improve SVE memcpy by copying 2 vectors if the size is small enough. This improves performance of random memcpy by ~9% on Neoverse V1, and 33-64 byte copies are ~16% faster. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit d2d3f3720ce627a4fe154d8dd14db716a32bcc6e)	2024-04-08 16:46:36 +01:00
Wilco Dijkstra	600098c58a	AArch64: Improve strrchr Use shrn for narrowing the mask which simplifies code and speeds up small strings. Unroll the first search loop to improve performance on large strings. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 55599d480437dcf129b41b95be32b48f2a9e5da9)	2024-04-08 16:46:23 +01:00
Wilco Dijkstra	7cbcc95927	AArch64: Optimize strnlen Optimize strnlen using the shrn instruction and improve the main loop. Small strings are around 10% faster, large strings are 40% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit ad098893ba3c3344a5f2f6ab1627c47204afdb47)	2024-04-08 16:46:11 +01:00
Wilco Dijkstra	daa13ed0ac	AArch64: Optimize strlen Optimize strlen by unrolling the main loop. Large strings are 64% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 03c8ce5000198947a4dd7b2c14e5131738fda62b)	2024-04-08 16:45:59 +01:00
Wilco Dijkstra	90d9b8897b	AArch64: Optimize strcpy Unroll the main loop. Large strings are around 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 349e48c01e85bd96006860084e76d322e6ca02f1)	2024-04-08 16:45:50 +01:00
Wilco Dijkstra	8ee69bd2df	AArch64: Improve strchrnul Unroll the main loop, which improves performance slightly. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 09ebd8549b2ce5a3a6c0c7c5f3e62227faf50a99)	2024-04-08 16:45:31 +01:00
Wilco Dijkstra	2e4cdcbd35	AArch64: Optimize strchr Simplify calculation of the mask using shrn. Unroll the main loop. Small strings are 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 51541a229740801882490177fa178e49264b13fb)	2024-04-08 16:45:20 +01:00
Wilco Dijkstra	ef62fc2223	AArch64: Improve strlen_asimd Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment. Performance improves slightly as a result. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 1bbb1a2022e126f21810d3d0ebe0a975d5243e43)	2024-04-08 16:42:24 +01:00
Wilco Dijkstra	1296aa9b0b	AArch64: Optimize memrchr Optimize the main loop - large strings are 43% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 00776241776e67fc666b896c1e85770f4f3ec1e1)	2024-04-08 16:42:14 +01:00

1 2 3 4 5 ...

15287 Commits