glibc

mirror of git://sourceware.org/git/glibc.git synced 2024-12-03 04:01:43 +08:00

Author	SHA1	Message	Date
Noah Goldstein	aadd0a1c7c	x86: Put wcs{n}len-sse4.1 in the sse4.1 text section Previously was missing but the two implementations shouldn't get in the sse2 (generic) text section. (cherry picked from commit `afc6e4328f`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	d201c59177	x86: Align entry for memrchr to 64-bytes. The function was tuned around 64-byte entry alignment and performs better for all sizes with it. As well different code boths where explicitly written to touch the minimum number of cache line i.e sizes <= 32 touch only the entry cache line. (cherry picked from commit `227afaa672`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	c51d8d383c	x86: Add BMI1/BMI2 checks for ISA_V3 check BMI1/BMI2 are part of the ISA V3 requirements: https://en.wikipedia.org/wiki/X86-64 And defined by GCC when building with `-march=x86-64-v3` (cherry picked from commit `8da9f346cb`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	ba1c3f23d9	x86: Cleanup bounds checking in large memcpy case 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x). Previously was using `__x86_rep_movsb_threshold` and should have been using `__x86_shared_non_temporal_threshold`. 2. Avoid reloading __x86_shared_non_temporal_threshold before the L(large_memcpy_4x) bounds check. 3. Document the second bounds check for L(large_memcpy_4x) more clearly. (cherry picked from commit `89a25c6f64`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	94b0dc9419	x86: Add bounds `x86_non_temporal_threshold` The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed by memmove-vec-unaligned-erms. The lower-bound is needed because memmove-vec-unaligned-erms unrolls the loop aggressively in the L(large_memset_4x) case. The upper-bound is needed because memmove-vec-unaligned-erms right-shifts the value of `x86_non_temporal_threshold` by LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow. The lack of lower-bound can be a correctness issue. The lack of upper-bound cannot. (cherry picked from commit `b446822b6a`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	9d50e162ee	x86: Add sse42 implementation to strcmp's ifunc This has been missing since the the ifuncs where added. The performance of SSE4.2 is preferable to to SSE2. Measured on Tigerlake with N = 20 runs. Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906 (cherry picked from commit `ff439c4717`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	6e008c884d	x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Move the setting of `rep_movsb_stop_threshold` to after the tunables have been collected so that the `rep_movsb_stop_threshold` (which is used to redirect control flow to the non_temporal case) will use any user value for `non_temporal_threshold` (set using glibc.cpu.x86_non_temporal_threshold) (cherry picked from commit `0355915514`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	fc54e1fae8	x86: Align varshift table to 32-bytes This ensures the load will never split a cache line. (cherry picked from commit `0f91811333`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	820504e3ed	x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactions Give fall-through path to `vzeroupper` and taken-path to `vzeroall`. Generally even on machines with RTM the expectation is the string-library functions will not be called in transactions. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `c28db9cb29`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	3c87383a20	x86: Shrink code size of memchr-evex.S This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `56da3fe1dd`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	a910d7e164	x86: Shrink code size of memchr-avx2.S This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `6dcbb7d95d`) x86: Fix page cross case in rawmemchr-avx2 [BZ #29234] commit `6dcbb7d95d` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jun 6 21:11:33 2022 -0700 x86: Shrink code size of memchr-avx2.S Changed how the page cross case aligned string (rdi) in rawmemchr. This was incompatible with how `L(cross_page_continue)` expected the pointer to be aligned and would cause rawmemchr to read data start started before the beginning of the string. What it would read was in valid memory but could count CHAR matches resulting in an incorrect return value. This commit fixes that issue by essentially reverting the changes to the L(page_cross) case as they didn't really matter. Test cases added and all pass with the new code (and where confirmed to fail with the old code). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `2c9af8421d`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	b05bd59823	x86: Optimize memrchr-avx2.S The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `af5306a735`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	83a986e9fb	x86: Optimize memrchr-evex.S The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `b4209615a0`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	4901009dad	x86: Optimize memrchr-sse2.S The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `731feee386`)	2022-07-18 20:45:21 -07:00
Noah Goldstein	e805606193	x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret` The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This can be useful when hoisting a vzeroupper to save code size for example: ``` L(foo): cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax VZEROUPPER_RETURN L(bar): xorl %eax, %eax VZEROUPPER_RETURN ``` Can become: ``` L(foo): COND_VZEROUPPER cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax ret L(bar): xorl %eax, %eax ret ``` This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `dd5c483b25`)	2022-07-18 20:45:20 -07:00
Noah Goldstein	70be93d1c5	x86: Create header for VEC classes in x86 strings library This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `8a780a6b91`)	2022-07-18 20:45:20 -07:00
Raghuveer Devulapalli	82a707aeb7	x86_64: Add strstr function with 512-bit EVEX Adding a 512-bit EVEX version of strstr. The algorithm works as follows: (1) We spend a few cycles at the begining to peek into the needle. We locate an edge in the needle (first occurance of 2 consequent distinct characters) and also store the first 64-bytes into a zmm register. (2) We search for the edge in the haystack by looking into one cache line of the haystack at a time. This avoids having to read past a page boundary which can cause a seg fault. (3) If an edge is found in the haystack we first compare the first 64-bytes of the needle (already stored in a zmm register) before we proceed with a full string compare performed byte by byte. Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512) Geometric mean of all benchmarks: new / old = 0.66 Difficult skiptable(0) : new / old = 0.02 Difficult skiptable(1) : new / old = 0.01 Difficult 2-way : new / old = 0.25 Difficult testing first 2 : new / old = 1.26 Difficult skiptable(0) : new / old = 0.05 Difficult skiptable(1) : new / old = 0.06 Difficult 2-way : new / old = 0.26 Difficult testing first 2 : new / old = 1.05 Difficult skiptable(0) : new / old = 0.42 Difficult skiptable(1) : new / old = 0.24 Difficult 2-way : new / old = 0.21 Difficult testing first 2 : new / old = 1.04 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `5082a287d5`) x86: Remove __mmask intrinsics in strstr-avx512.c The intrinsics are not available before GCC7 and using standard operators generates code of equivalent or better quality. Removed: _cvtmask64_u64 _kshiftri_mask64 _kand_mask64 Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958 (cherry picked from commit `f2698954ff`)	2022-07-18 20:45:20 -07:00
H.J. Lu	f6bc52f080	x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we can ignore their r_addends. Reviewed-by: Fangrui Song <maskray@google.com> (cherry picked from commit `f8587a6189`)	2022-07-18 20:45:20 -07:00
Sunil K Pandey	8ab861d295	x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen This patch implements following evex512 version of string functions. Perf gain for evex512 version is up to 50% as compared to evex, depending on length and alignment. Placeholder function, not used by any processor at the moment. - String length function using 512 bit vectors. - String N length using 512 bit vectors. - Wide string length using 512 bit vectors. - Wide string N length using 512 bit vectors. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `9c66efb86f`)	2022-07-18 20:45:20 -07:00
Adhemerval Zanella	eb9aa96fac	x86_64: Remove bzero optimization Both symbols are marked as legacy in POSIX.1-2001 and removed on POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE or _DEFAULT_SOURCE. GCC also replaces bcopy with a memmove and bzero with memset on default configuration (to actually get a bzero libc call the code requires to omit string.h inclusion and built with -fno-builtin), so it is highly unlikely programs are actually calling libc bzero symbol. On a recent Linux distro (Ubuntu 22.04), there is no bzero calls by the installed binaries. $ cat count_bstring.sh #!/bin/bash files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done` total=0 for file in $files; do symbols=`objdump -R $file 2>&1` if [ $? -eq 0 ]; then ncalls=`echo $symbols \| grep -w $1 \| wc -l` ((total=total+ncalls)) if [ $ncalls -gt 0 ]; then echo "$file: $ncalls" fi fi done echo "TOTAL=$total" $ ./count_bstring.sh bzero TOTAL=0 Checked on x86_64-linux-gnu. (cherry picked from commit `9403b71ae9`)	2022-07-18 20:45:20 -07:00
Sunil K Pandey	8d324019e6	x86_64: Remove end of line trailing spaces This commit remove trailing space introduced by following commit. commit `a775a7a3eb` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:56:29 2021 -0400 x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]	2022-07-18 20:45:09 -07:00
Adhemerval Zanella	4b246b2bbd	linux: Fix mq_timereceive check for 32 bit fallback code (BZ 29304) On success, mq_receive() and mq_timedreceive() return the number of bytes in the received message, so it requires to check if the value is larger than 0. Checked on i686-linux-gnu. (cherry picked from commit `71d87d85bf`)	2022-06-30 10:46:39 -03:00
Adhemerval Zanella	bb4148283f	nios2: Remove _dl_skip_args usage (BZ# 29187) Since `ad43cac44a` the generic code already shuffles the argv/envp/auxv on the stack to remove the ld.so own arguments and thus _dl_skip_args is always 0. So there is no need to adjust the argc or argv. Checked with qemu-user that arguments are correctly passed on both constructors and main program. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `4868ba5d25`)	2022-06-10 09:15:00 -03:00
Adhemerval Zanella	96944f0f81	hppa: Remove _dl_skip_args usage (BZ# 29165) Different than other architectures, hppa creates an unrelated stack frame where ld.so argc/argv adjustments done by `ad43cac44a` is not done on the argc/argv saved/restore by _dl_start_user. Instead load _dl_argc and _dl_argv directlty instead of adjust them using _dl_skip_args value. Checked on hppa-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `6242602273`)	2022-06-10 09:13:54 -03:00
Adhemerval Zanella	a7ec6363a3	nptl: Fix __libc_cleanup_pop_restore asynchronous restore (BZ#29214) This was due a wrong revert done on `404656009b`. Checked on x86_64-linux-gnu. (cherry picked from commit `c7d36dcecc`)	2022-06-08 17:15:08 -03:00
Matheus Castanho	4c92a10412	powerpc: Fix VSX register number on __strncpy_power9 [BZ #29197 ] __strncpy_power9 initializes VR 18 with zeroes to be used throughout the code, including when zero-padding the destination string. However, the v18 reference was mistakenly being used for stxv and stxvl, which take a VSX vector as operand. The code ended up using the uninitialized VSR 18 register by mistake. Both occurrences have been changed to use the proper VSX number for VR 18 (i.e. VSR 50). Tested on powerpc, powerpc64 and powerpc64le. Signed-off-by: Kewen Lin <linkw@gcc.gnu.org> (cherry picked from commit `0218463dd8`)	2022-06-07 15:34:20 -03:00
Adhemerval Zanella	61fd3e0e74	socket: Use 64 bit stat for isfdtype (BZ# 29209) This is a missing spot initially from `52a5fe70a2`. Checked on i686-linux-gnu. (cherry picked from commit `87f1ec12e7`)	2022-06-01 13:34:51 -03:00
Adhemerval Zanella	f9c3e57ac2	posix: Use 64 bit stat for fpathconf (_PC_ASYNC_IO) (BZ# 29208) This is a missing spot initially from `52a5fe70a2`. Checked on i686-linux-gnu. (cherry picked from commit `6e7137f28c`)	2022-06-01 13:34:51 -03:00
Adhemerval Zanella	9db6a597ef	posix: Use 64 bit stat for posix_fallocate fallback (BZ# 29207) This is a missing spot initially from `52a5fe70a2`. Checked on i686-linux-gnu. (cherry picked from commit `574ba60fc8`)	2022-06-01 13:34:51 -03:00
Arjun Shankar	ff450cdbde	Fix deadlock when pthread_atfork handler calls pthread_atfork or dlclose In multi-threaded programs, registering via pthread_atfork, de-registering implicitly via dlclose, or running pthread_atfork handlers during fork was protected by an internal lock. This meant that a pthread_atfork handler attempting to register another handler or dlclose a dynamically loaded library would lead to a deadlock. This commit fixes the deadlock in the following way: During the execution of handlers at fork time, the atfork lock is released prior to the execution of each handler and taken again upon its return. Any handler registrations or de-registrations that occurred during the execution of the handler are accounted for before proceeding with further handler execution. If a handler that hasn't been executed yet gets de-registered by another handler during fork, it will not be executed. If a handler gets registered by another handler during fork, it will not be executed during that particular fork. The possibility that handlers may now be registered or deregistered during handler execution means that identifying the next handler to be run after a given handler may register/de-register others requires some bookkeeping. The fork_handler struct has an additional field, 'id', which is assigned sequentially during registration. Thus, handlers are executed in ascending order of 'id' during 'prepare', and descending order of 'id' during parent/child handler execution after the fork. Two tests are included: * tst-atfork3: Adhemerval Zanella <adhemerval.zanella@linaro.org> This test exercises calling dlclose from prepare, parent, and child handlers. * tst-atfork4: This test exercises calling pthread_atfork and dlclose from the prepare handler. [BZ #24595, BZ #27054] Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `52a103e237`)	2022-05-30 12:38:32 +02:00
Noah Goldstein	83ae8287c1	x86: Fallback {str\|wcs}cmp RTM in the ncmp overflow case [BZ #29127 ] Re-cherry-pick commit `c627209832` for strcmp-avx2.S change which was omitted in intial cherry pick because at the time this bug was not present on release branch. Fixes BZ #29127. In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would call strcmp-avx2 and wcscmp-avx2 respectively. This would have not checks around vzeroupper and would trigger spurious aborts. This commit fixes that. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on AVX2 machines with and without RTM. Co-authored-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `c627209832`)	2022-05-25 15:05:05 -07:00
Szabolcs Nagy	b2585cae28	linux: Add a getauxval test [BZ #23293 ] This is for bug 23293 and it relies on the glibc test system running tests via explicit ld.so invokation by default. Reviewed-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `9faf5262c7`)	2022-05-19 16:48:47 +01:00
Szabolcs Nagy	c73c79af7d	rtld: Use generic argv adjustment in ld.so [BZ #23293 ] When an executable is invoked as ./ld.so [ld.so-args] ./exe [exe-args] then the argv is adujusted in ld.so before calling the entry point of the executable so ld.so args are not visible to it. On most targets this requires moving argv, env and auxv on the stack to ensure correct stack alignment at the entry point. This had several issues: - The code for this adjustment on the stack is written in asm as part of the target specific ld.so _start code which is hard to maintain. - The adjustment is done after _dl_start returns, where it's too late to update GLRO(dl_auxv), as it is already readonly, so it points to memory that was clobbered by the adjustment. This is bug 23293. - _environ is also wrong in ld.so after the adjustment, but it is likely not used after _dl_start returns so this is not user visible. - _dl_argv was updated, but for this it was moved out of relro, which changes security properties across targets unnecessarily. This patch introduces a generic _dl_start_args_adjust function that handles the argument adjustments after ld.so processed its own args and before relro protection is applied. The same algorithm is used on all targets, _dl_skip_args is now 0, so existing target specific adjustment code is no longer used. The bug affects aarch64, alpha, arc, arm, csky, ia64, nios2, s390-32 and sparc, other targets don't need the change in principle, only for consistency. The GNU Hurd start code relied on _dl_skip_args after dl_main returned, now it checks directly if args were adjusted and fixes the Hurd startup data accordingly. Follow up patches can remove _dl_skip_args and DL_ARGV_NOT_RELRO. Tested on aarch64-linux-gnu and cross tested on i686-gnu. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `ad43cac44a`)	2022-05-19 16:48:47 +01:00
Stefan Liebler	89b638f48a	S390: Enable static PIE This commit enables static PIE on 64bit. On 31bit, static PIE is not supported. A new configure check in sysdeps/s390/s390-64/configure.ac also performs a minimal test for requirements in ld: Ensure you also have those patches for: - binutils (ld) - "[PR ld/22263] s390: Avoid dynamic TLS relocs in PIE" https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=26b1426577b5dcb32d149c64cca3e603b81948a9 (Tested by configure check above) Otherwise there will be a R_390_TLS_TPOFF relocation, which fails to be processed in _dl_relocate_static_pie() as static TLS map is not setup. - "s390: Add DT_JMPREL pointing to .rela.[i]plt with static-pie" https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d942d8db12adf4c9e5c7d9ed6496a779ece7149e (We can't test it in configure as we are not able to link a static PIE executable if the system glibc lacks static PIE support) Otherwise there won't be DT_JMPREL, DT_PLTRELA, DT_PLTRELASZ entries and the IFUNC symbols are not processed, which leads to crashes. - kernel (the mentioned links to the commits belong to 5.19 merge window): - "s390/mmap: increase stack/mmap gap to 128MB" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=f2f47d0ef72c30622e62471903ea19446ea79ee2 - "s390/vdso: move vdso mapping to its own function" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=57761da4dc5cd60bed2c81ba0edb7495c3c740b8 - "s390/vdso: map vdso above stack" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=9e37a2e8546f9e48ea76c839116fa5174d14e033 - "s390/vdso: add vdso randomization" https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=41cd81abafdc4e58a93fcb677712a76885e3ca25 (We can't test the kernel of the target system) Otherwise if /proc/sys/kernel/randomize_va_space is turned off (0), static PIE executables like ldconfig will crash. While startup sbrk is used to enlarge the HEAP. Unfortunately the underlying brk syscall fails as there is not enough space after the HEAP. Then the address of the TLS image is invalid and the following memcpy in __libc_setup_tls() leads to a segfault. If /proc/sys/kernel/randomize_va_space is activated (default: 2), there is enough space after HEAP. - glibc - "Linux: Define MMAP_CALL_INTERNAL" https://sourceware.org/git/?p=glibc.git;a=commit;h=c1b68685d438373efe64e5f076f4215723004dfb - "i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S" https://sourceware.org/git/?p=glibc.git;a=commit;h=6e5c7a1e262961adb52443ab91bd2c9b72316402 - "i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls" https://sourceware.org/git/?p=glibc.git;a=commit;h=60f0f2130d30cfd008ca39743027f1e200592dff - "ia64: Always define IA64_USE_NEW_STUB as a flag macro" https://sourceware.org/git/?p=glibc.git;a=commit;h=18bd9c3d3b1b6a9182698c85354578d1d58e9d64 - "Linux: Implement a useful version of _startup_fatal" https://sourceware.org/git/?p=glibc.git;a=commit;h=a2a6bce7d7e52c1c34369a7da62c501cc350bc31 - "Linux: Introduce __brk_call for invoking the brk system call" https://sourceware.org/git/?p=glibc.git;a=commit;h=b57ab258c1140bc45464b4b9908713e3e0ee35aa - "csu: Implement and use _dl_early_allocate during static startup" https://sourceware.org/git/?p=glibc.git;a=commit;h=f787e138aa0bf677bf74fa2a08595c446292f3d7 The mentioned patch series by Florian Weimer avoids the mentioned failing sbrk syscall by falling back to mmap. This commit also adjusts startup code in start.S to be ready for static PIE. We have to add a wrapper function for main as we are not allowed to use GOT relocations before __libc_start_main is called. (Compare also to: - commit `14d886edbd` "aarch64: fix start code for static pie" - commit `3d1d79283e` "aarch64: fix static pie enabled libc when main is in a shared library" ) (cherry picked from commit `728894dba4`)	2022-05-19 17:15:57 +02:00
Florian Weimer	ede8d94d15	csu: Implement and use _dl_early_allocate during static startup This implements mmap fallback for a brk failure during TLS allocation. scripts/tls-elf-edit.py is updated to support the new patching method. The script no longer requires that in the input object is of ET_DYN type. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `f787e138aa`)	2022-05-19 12:13:53 +02:00
Florian Weimer	43d77ef9b8	Linux: Introduce __brk_call for invoking the brk system call Alpha and sparc can now use the generic implementation. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `b57ab258c1`)	2022-05-19 12:13:53 +02:00
Florian Weimer	e7ca2a475c	Linux: Implement a useful version of _startup_fatal On i386 and ia64, the TCB is not available at this point. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `a2a6bce7d7`)	2022-05-19 12:13:53 +02:00
Florian Weimer	b2387bea84	ia64: Always define IA64_USE_NEW_STUB as a flag macro And keep the previous definition if it exists. This allows disabling IA64_USE_NEW_STUB while keeping USE_DL_SYSINFO defined. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `18bd9c3d3b`)	2022-05-19 12:13:53 +02:00
Florian Weimer	b38c9cdb58	Linux: Define MMAP_CALL_INTERNAL Unlike MMAP_CALL, this avoids a TCB dependency for an errno update on failure. <mmap_internal.h> cannot be included as is on several architectures due to the definition of page_unit, so introduce a separate header file for the definition of MMAP_CALL and MMAP_CALL_INTERNAL, <mmap_call.h>. Reviewed-by: Stefan Liebler <stli@linux.ibm.com> (cherry picked from commit `c1b68685d4`)	2022-05-19 12:13:53 +02:00
Florian Weimer	1a5b9d1a23	i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls Introduce an int-80h-based version of __libc_do_syscall and use it if I386_USE_SYSENTER is defined as 0. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `60f0f2130d`)	2022-05-19 12:13:53 +02:00
Florian Weimer	1e7b011f87	i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S After commit `a78e6a10d0` ("i386: Remove broken CAN_USE_REGISTER_ASM_EBP (bug 28771)"), it is never defined. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `6e5c7a1e26`)	2022-05-19 12:13:53 +02:00
Fangrui Song	be9240c84c	elf: Remove __libc_init_secure After `73fc4e28b9`, __libc_enable_secure_decided is always 0 and a statically linked executable may overwrite __libc_enable_secure without considering AT_SECURE. The __libc_enable_secure has been correctly initialized in _dl_aux_init, so just remove __libc_enable_secure_decided and __libc_init_secure. This allows us to remove some startup_get*id functions from `22b79ed7f4`. Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit `3e9acce8c5`)	2022-05-19 12:13:52 +02:00
Florian Weimer	ff900fad89	Linux: Consolidate auxiliary vector parsing (redo) And optimize it slightly. This is commit `8c8510ab27` revised. In _dl_aux_init in elf/dl-support.c, use an explicit loop and -fno-tree-loop-distribute-patterns to avoid memset. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit `73fc4e28b9`)	2022-05-19 12:13:51 +02:00
Florian Weimer	28bdb03b1b	Linux: Include <dl-auxv.h> in dl-sysdep.c only for SHARED Otherwise, <dl-auxv.h> on POWER ends up being included twice, once in dl-sysdep.c, once in dl-support.c. That leads to a linker failure due to multiple definitions of _dl_cache_line_size. Fixes commit `d96d2995c1` ("Revert "Linux: Consolidate auxiliary vector parsing"). (cherry picked from commit `098c795e85`)	2022-05-19 11:49:06 +02:00
Florian Weimer	1cc4ddfeeb	Revert "Linux: Consolidate auxiliary vector parsing" This reverts commit `8c8510ab27`. The revert is not perfect because the commit included a bug fix for _dl_sysdep_start with an empty argv, introduced in commit `2d47fa6862` ("Linux: Remove DL_FIND_ARG_COMPONENTS"), and this bug fix is kept. The revert is necessary because the reverted commit introduced an early memset call on aarch64, which leads to crash due to lack of TCB initialization. (cherry picked from commit `d96d2995c1`)	2022-05-19 11:49:06 +02:00
Florian Weimer	4b9cd5465d	Linux: Consolidate auxiliary vector parsing And optimize it slightly. The large switch statement in _dl_sysdep_start can be replaced with a large array. This reduces source code and binary size. On i686-linux-gnu: Before: text data bss dec hex filename 7791 12 0 7803 1e7b elf/dl-sysdep.os After: text data bss dec hex filename 7135 12 0 7147 1beb elf/dl-sysdep.os Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `8c8510ab27`)	2022-05-19 11:49:06 +02:00
Florian Weimer	08728256fa	Linux: Assume that NEED_DL_SYSINFO_DSO is always defined The definition itself is still needed for generic code. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `f19fc997a5`)	2022-05-19 11:49:06 +02:00
Florian Weimer	458733fffe	Linux: Remove DL_FIND_ARG_COMPONENTS The generic definition is always used since the Native Client port has been removed. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `2d47fa6862`)	2022-05-19 11:49:06 +02:00
Florian Weimer	2139b1848e	Linux: Remove HAVE_AUX_SECURE, HAVE_AUX_XID, HAVE_AUX_PAGESIZE They are always defined. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `b9c3d3382f`)	2022-05-19 11:49:06 +02:00
Florian Weimer	b0bd6a1323	elf: Merge dl-sysdep.c into the Linux version The generic version is the de-facto Linux implementation. It requires an auxiliary vector, so Hurd does not use it. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `91c0a47ffb`)	2022-05-19 11:43:53 +02:00
Noah Goldstein	596c9a32cc	x86: Optimize {str\|wcs}rchr-evex The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.755 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `c966099cdc`)	2022-05-16 18:55:54 -07:00
Noah Goldstein	00f09a14d2	x86: Optimize {str\|wcs}rchr-avx2 The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.832 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `df7e295d18`)	2022-05-16 18:55:45 -07:00
Noah Goldstein	0a11305416	x86: Optimize {str\|wcs}rchr-sse2 The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.741 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `5307aa9c18`)	2022-05-16 18:55:37 -07:00
Noah Goldstein	df5de87260	x86: Cleanup page cross code in memcmp-avx2-movbe.S Old code was both inefficient and wasted code size. New code (-62 bytes) and comparable or better performance in the page cross case. geometric_mean(N=20) of page cross cases New / Original: 0.960 size, align0, align1, ret, New Time/Old Time 1, 4095, 0, 0, 1.001 1, 4095, 0, 1, 0.999 1, 4095, 0, -1, 1.0 2, 4094, 0, 0, 1.0 2, 4094, 0, 1, 1.0 2, 4094, 0, -1, 1.0 3, 4093, 0, 0, 1.0 3, 4093, 0, 1, 1.0 3, 4093, 0, -1, 1.0 4, 4092, 0, 0, 0.987 4, 4092, 0, 1, 1.0 4, 4092, 0, -1, 1.0 5, 4091, 0, 0, 0.984 5, 4091, 0, 1, 1.002 5, 4091, 0, -1, 1.005 6, 4090, 0, 0, 0.993 6, 4090, 0, 1, 1.001 6, 4090, 0, -1, 1.003 7, 4089, 0, 0, 0.991 7, 4089, 0, 1, 1.0 7, 4089, 0, -1, 1.001 8, 4088, 0, 0, 0.875 8, 4088, 0, 1, 0.881 8, 4088, 0, -1, 0.888 9, 4087, 0, 0, 0.872 9, 4087, 0, 1, 0.879 9, 4087, 0, -1, 0.883 10, 4086, 0, 0, 0.878 10, 4086, 0, 1, 0.886 10, 4086, 0, -1, 0.873 11, 4085, 0, 0, 0.878 11, 4085, 0, 1, 0.881 11, 4085, 0, -1, 0.879 12, 4084, 0, 0, 0.873 12, 4084, 0, 1, 0.889 12, 4084, 0, -1, 0.875 13, 4083, 0, 0, 0.873 13, 4083, 0, 1, 0.863 13, 4083, 0, -1, 0.863 14, 4082, 0, 0, 0.838 14, 4082, 0, 1, 0.869 14, 4082, 0, -1, 0.877 15, 4081, 0, 0, 0.841 15, 4081, 0, 1, 0.869 15, 4081, 0, -1, 0.876 16, 4080, 0, 0, 0.988 16, 4080, 0, 1, 0.99 16, 4080, 0, -1, 0.989 17, 4079, 0, 0, 0.978 17, 4079, 0, 1, 0.981 17, 4079, 0, -1, 0.98 18, 4078, 0, 0, 0.981 18, 4078, 0, 1, 0.98 18, 4078, 0, -1, 0.985 19, 4077, 0, 0, 0.977 19, 4077, 0, 1, 0.979 19, 4077, 0, -1, 0.986 20, 4076, 0, 0, 0.977 20, 4076, 0, 1, 0.986 20, 4076, 0, -1, 0.984 21, 4075, 0, 0, 0.977 21, 4075, 0, 1, 0.983 21, 4075, 0, -1, 0.988 22, 4074, 0, 0, 0.983 22, 4074, 0, 1, 0.994 22, 4074, 0, -1, 0.993 23, 4073, 0, 0, 0.98 23, 4073, 0, 1, 0.992 23, 4073, 0, -1, 0.995 24, 4072, 0, 0, 0.989 24, 4072, 0, 1, 0.989 24, 4072, 0, -1, 0.991 25, 4071, 0, 0, 0.99 25, 4071, 0, 1, 0.999 25, 4071, 0, -1, 0.996 26, 4070, 0, 0, 0.993 26, 4070, 0, 1, 0.995 26, 4070, 0, -1, 0.998 27, 4069, 0, 0, 0.993 27, 4069, 0, 1, 0.999 27, 4069, 0, -1, 1.0 28, 4068, 0, 0, 0.997 28, 4068, 0, 1, 1.0 28, 4068, 0, -1, 0.999 29, 4067, 0, 0, 0.996 29, 4067, 0, 1, 0.999 29, 4067, 0, -1, 0.999 30, 4066, 0, 0, 0.991 30, 4066, 0, 1, 1.001 30, 4066, 0, -1, 0.999 31, 4065, 0, 0, 0.988 31, 4065, 0, 1, 0.998 31, 4065, 0, -1, 0.998 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `23102686ec`)	2022-05-16 18:55:24 -07:00
Noah Goldstein	ffe75982cc	x86: Remove memcmp-sse4.S Code didn't actually use any sse4 instructions since `ptest` was removed in: commit `2f9062d717` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Nov 10 16:18:56 2021 -0600 x86: Shrink memcmp-sse4.S code size The new memcmp-sse2 implementation is also faster. geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 Note there are two regressions preferring SSE2 for Size = 1 and Size = 65. Size = 1: size, align0, align1, ret, New Time/Old Time 1, 1, 1, 0, 1.2 1, 1, 1, 1, 1.197 1, 1, 1, -1, 1.2 This is intentional. Size == 1 is significantly less hot based on profiles of GCC11 and Python3 than sizes [4, 8] (which is made hotter). Python3 Size = 1 -> 13.64% Python3 Size = [4, 8] -> 60.92% GCC11 Size = 1 -> 1.29% GCC11 Size = [4, 8] -> 33.86% size, align0, align1, ret, New Time/Old Time 4, 4, 4, 0, 0.622 4, 4, 4, 1, 0.797 4, 4, 4, -1, 0.805 5, 5, 5, 0, 0.623 5, 5, 5, 1, 0.777 5, 5, 5, -1, 0.802 6, 6, 6, 0, 0.625 6, 6, 6, 1, 0.813 6, 6, 6, -1, 0.788 7, 7, 7, 0, 0.625 7, 7, 7, 1, 0.799 7, 7, 7, -1, 0.795 8, 8, 8, 0, 0.625 8, 8, 8, 1, 0.848 8, 8, 8, -1, 0.914 9, 9, 9, 0, 0.625 Size = 65: size, align0, align1, ret, New Time/Old Time 65, 0, 0, 0, 1.103 65, 0, 0, 1, 1.216 65, 0, 0, -1, 1.227 65, 65, 0, 0, 1.091 65, 0, 65, 1, 1.19 65, 65, 65, -1, 1.215 This is because A) the checks in range [65, 96] are now unrolled 2x and B) because smaller values <= 16 are now given a hotter path. By contrast the SSE4 version has a branch for Size = 80. The unrolled version has get better performance for returns which need both comparisons. size, align0, align1, ret, New Time/Old Time 128, 4, 8, 0, 0.858 128, 4, 8, 1, 0.879 128, 4, 8, -1, 0.888 As well, out of microbenchmark environments that are not full predictable the branch will have a real-cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `7cbc03d030`)	2022-05-16 18:55:16 -07:00
Noah Goldstein	4ff6ae069b	x86: Small improvements for wcslen Just a few QOL changes. 1. Prefer `add` > `lea` as it has high execution units it can run on. 2. Don't break macro-fusion between `test` and `jcc` 3. Reduce code size by removing gratuitous padding bytes (-90 bytes). geometric_mean(N=20) of all benchmarks New / Original: 0.959 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `244b415d38`)	2022-05-16 18:55:09 -07:00
Noah Goldstein	80883f4354	x86: Remove AVX str{n}casecmp The rational is: 1. SSE42 has nearly identical logic so any benefit is minimal (3.4% regression on Tigerlake using SSE42 versus AVX across the benchtest suite). 2. AVX2 version covers the majority of targets that previously prefered it. 3. The targets where AVX would still be best (SnB and IVB) are becoming outdated. All in all the saving the code size is worth it. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `305769b2a1`)	2022-05-16 18:55:02 -07:00
Noah Goldstein	b13a2e68eb	x86: Add EVEX optimized str{n}casecmp geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `84e7c46df4`)	2022-05-16 18:54:52 -07:00
Noah Goldstein	3051cf3e74	x86: Add AVX2 optimized str{n}casecmp geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `bbf8122234`)	2022-05-16 18:54:41 -07:00
Noah Goldstein	3605c74407	x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .920 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `d154758e61`)	2022-05-16 18:54:27 -07:00
Noah Goldstein	5997011826	x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .894 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `670b54bc58`)	2022-05-16 18:54:17 -07:00
Noah Goldstein	a4b1cae068	x86: Remove strspn-sse2.S and use the generic implementation The generic implementation is faster. geometric_mean(N=20) of all benchmarks New / Original: .710 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `9c8a6ad620`)	2022-05-16 18:54:09 -07:00
Noah Goldstein	3811544655	x86: Remove strpbrk-sse2.S and use the generic implementation The generic implementation is faster (see strcspn commit). All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `6533585352`)	2022-05-16 18:53:59 -07:00
Noah Goldstein	0dafa75e3c	x86: Remove strcspn-sse2.S and use the generic implementation The generic implementation is faster. geometric_mean(N=20) of all benchmarks New / Original: .678 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `fe28e7d9d9`)	2022-05-16 18:53:48 -07:00
Noah Goldstein	0a2da01110	x86: Optimize strspn in strspn-c.c Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of _mm_cmpistri. Also change offset to unsigned to avoid unnecessary sign extensions. geometric_mean(N=20) of all benchmarks that dont fallback on sse2; New / Original: .901 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `412d103431`)	2022-05-16 18:53:39 -07:00
Noah Goldstein	0ae1006967	x86: Optimize strcspn and strpbrk in strcspn-c.c Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of _mm_cmpistri. Also change offset to unsigned to avoid unnecessary sign extensions. geometric_mean(N=20) of all benchmarks that dont fallback on sse2/strlen; New / Original: .928 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `30d627d477`)	2022-05-16 18:53:28 -07:00
Noah Goldstein	dd6d3a0bbc	x86: Code cleanup in strchr-evex and comment justifying branch Small code cleanup for size: -81 bytes. Add comment justifying using a branch to do NULL/non-null return. All string/memory tests pass and no regressions in benchtests. geometric_mean(N=20) of all benchmarks New / Original: .985 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `ec285ea904`)	2022-05-16 18:53:19 -07:00
Noah Goldstein	3c55c20756	x86: Code cleanup in strchr-avx2 and comment justifying branch Small code cleanup for size: -53 bytes. Add comment justifying using a branch to do NULL/non-null return. All string/memory tests pass and no regressions in benchtests. geometric_mean(N=20) of all benchmarks Original / New: 1.00 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `a6fbf4d51e`)	2022-05-16 18:53:07 -07:00
Adhemerval Zanella	dd457606ca	x86_64: Remove bcopy optimizations The symbols is not present in current POSIX specification and compiler already generates memmove call. (cherry picked from commit `bf92893a14`)	2022-05-16 18:52:57 -07:00
H.J. Lu	37f373e334	x86-64: Remove bzero weak alias in SS2 memset commit `3d9f171bfb` Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Feb 7 05:55:15 2022 -0800 x86-64: Optimize bzero added the optimized bzero. Remove bzero weak alias in SS2 memset to avoid undefined __bzero in memset-sse2-unaligned-erms. (cherry picked from commit `0fb8800029`)	2022-05-16 18:52:47 -07:00
H.J. Lu	6cba46c858	x86_64/multiarch: Sort sysdep_routines and put one entry per line (cherry picked from commit `c328d0152d`)	2022-05-16 18:52:35 -07:00
H.J. Lu	8de6e4a199	x86: Improve L to support L(XXX_SYMBOL (YYY, ZZZ)) (cherry picked from commit `1283948f23`)	2022-05-16 18:52:19 -07:00
Noah Goldstein	e123f08ad5	x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896 ] Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not __wcscmp_avx2. commit `ddf0992cf5` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jan 9 16:02:21 2022 -0600 x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which can cause spurious aborts. This change will need to be backported. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `9fef7039a7`)	2022-05-05 09:13:13 -07:00
Noah Goldstein	5373c90f2e	x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895 ] Logic can read before the start of `s1` / `s2` if both `s1` and `s2` are near the start of a page. To avoid having the result contimated by these comparisons the `strcmp` variants would mask off these comparisons. This was missing in the `strncmp` variants causing the bug. This commit adds the masking to `strncmp` so that out of range comparisons don't affect the result. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as well a full xcheck on x86_64 linux. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `e108c02a5e`)	2022-05-05 09:11:49 -07:00
Noah Goldstein	70509f9b48	x86: Set .text section in memset-vec-unaligned-erms commit `3d9f171bfb` Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Feb 7 05:55:15 2022 -0800 x86-64: Optimize bzero Remove setting the .text section for the code. This commit adds that back. (cherry picked from commit `7912236f4a`)	2022-05-05 09:11:13 -07:00
H.J. Lu	5cb6329652	x86-64: Optimize bzero memset with zero as the value to set is by far the majority value (99%+ for Python3 and GCC). bzero can be slightly more optimized for this case by using a zero-idiom xor for broadcasting the set value to a register (vector or GPR). Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `3d9f171bfb`)	2022-05-05 09:10:53 -07:00
Noah Goldstein	190ea5f7e4	x86: Remove SSSE3 instruction for broadcast in memset.S (SSE2 Only) commit `b62ace2740` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Feb 6 00:54:18 2022 -0600 x86: Improve vec generation in memset-vec-unaligned-erms.S Revert usage of 'pshufb' in broadcast logic as it is an SSSE3 instruction and memset.S is restricted to only SSE2 instructions. (cherry picked from commit `1b0c60f95b`)	2022-05-05 08:54:23 -07:00
Noah Goldstein	ea19c490a3	x86: Improve vec generation in memset-vec-unaligned-erms.S No bug. Split vec generation into multiple steps. This allows the broadcast in AVX2 to use 'xmm' registers for the L(less_vec) case. This saves an expensive lane-cross instruction and removes the need for 'vzeroupper'. For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for byte broadcast. Results for memset-avx2 small (geomean of N = 20 benchset runs). size, New Time, Old Time, New / Old 0, 4.100, 3.831, 0.934 1, 5.074, 4.399, 0.867 2, 4.433, 4.411, 0.995 4, 4.487, 4.415, 0.984 8, 4.454, 4.396, 0.987 16, 4.502, 4.443, 0.987 All relevant string/wcsmbs tests are passing. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `b62ace2740`)	2022-05-05 08:54:11 -07:00
H.J. Lu	53ddafe917	x86-64: Fix strcmp-evex.S Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit `8418eb3ff4` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:39 2022 -0600 x86: Optimize strcmp-evex.S (cherry picked from commit `0e0199a9e0`)	2022-05-05 08:54:03 -07:00
H.J. Lu	d299032743	x86-64: Fix strcmp-avx2.S Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit `b77b06e0e2` Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:38 2022 -0600 x86: Optimize strcmp-avx2.S (cherry picked from commit `c15efd011c`)	2022-05-05 08:53:50 -07:00
Noah Goldstein	c41a66767d	x86: Optimize strcmp-evex.S Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases as well are nearly universally improved. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `8418eb3ff4`)	2022-05-05 08:53:42 -07:00
Noah Goldstein	0d5b36c8cc	x86: Optimize strcmp-avx2.S Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases are improved most for smaller sizes [0, 128] and go about even for (128, 4096]. The loop page cross logic is improved so some more significant speedup is seen there as well. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `b77b06e0e2`)	2022-05-05 08:53:34 -07:00
Joseph Myers	97cb8227b8	Add HWCAP2_AFP, HWCAP2_RPRES from Linux 5.17 to AArch64 bits/hwcap.h Add the new HWCAP2_AFP and HWCAP2_RPRES constants from Linux 5.17. Tested with build-many-glibcs.py for aarch64-linux-gnu. (cherry picked from commit `866c599182`)	2022-05-03 11:08:52 +02:00
Szabolcs Nagy	c108e87026	aarch64: Add HWCAP2_ECV from Linux 5.16 Indicates the availability of enhanced counter virtualization extension of armv8.6-a with self-synchronized virtual counter CNTVCTSS_EL0 usable in userspace. (cherry picked from commit `5a1be8ebdf`)	2022-05-03 11:08:52 +02:00
Joseph Myers	f858bc3093	Add SOL_MPTCP, SOL_MCTP from Linux 5.16 to bits/socket.h Linux 5.16 adds constants SOL_MPTCP and SOL_MCTP to the getsockopt / setsockopt levels; add these constants to bits/socket.h. Tested for x86_64. (cherry picked from commit `fdc1ae67fe`)	2022-05-03 11:08:52 +02:00
Joseph Myers	0499c3a95f	Update kernel version to 5.17 in tst-mman-consts.py This patch updates the kernel version in the test tst-mman-consts.py to 5.17. (There are no new MAP_* constants covered by this test in 5.17 that need any other header changes.) Tested with build-many-glibcs.py. (cherry picked from commit `23808a422e`)	2022-05-03 11:08:52 +02:00
Joseph Myers	81181ba5d9	Update kernel version to 5.16 in tst-mman-consts.py This patch updates the kernel version in the test tst-mman-consts.py to 5.16. (There are no new MAP_* constants covered by this test in 5.16 that need any other header changes.) Tested with build-many-glibcs.py. (cherry picked from commit `790a607e23`)	2022-05-03 11:08:52 +02:00
Joseph Myers	6af165658d	Update syscall lists for Linux 5.17 Linux 5.17 has one new syscall, set_mempolicy_home_node. Update syscall-names.list and regenerate the arch-syscall.h headers with build-many-glibcs.py update-syscalls. Tested with build-many-glibcs.py. (cherry picked from commit `8ef9196b26`)	2022-05-03 11:08:52 +02:00
Joseph Myers	5146b73d72	Add ARPHRD_CAN, ARPHRD_MCTP to net/if_arp.h Add the constant ARPHRD_MCTP, from Linux 5.15, to net/if_arp.h, along with ARPHRD_CAN which was added to Linux in version 2.6.25 (commit cd05acfe65ed2cf2db683fa9a6adb8d35635263b, "[CAN]: Allocate protocol numbers for PF_CAN") but apparently missed for glibc at the time. Tested for x86_64. (cherry picked from commit `a94d9659cd`)	2022-05-03 11:07:10 +02:00
Joseph Myers	fd5dbfd1cd	Update kernel version to 5.15 in tst-mman-consts.py This patch updates the kernel version in the test tst-mman-consts.py to 5.15. (There are no new MAP_* constants covered by this test in 5.15 that need any other header changes.) Tested with build-many-glibcs.py. (cherry picked from commit `5c3ece451d`)	2022-05-03 11:07:07 +02:00
Joseph Myers	bc6fba3c80	Add PF_MCTP, AF_MCTP from Linux 5.15 to bits/socket.h Linux 5.15 adds a new address / protocol family PF_MCTP / AF_MCTP; add these constants to bits/socket.h. Tested for x86_64. (cherry picked from commit `bdeb7a8fa9`)	2022-05-03 11:07:03 +02:00
DJ Delorie	c66c92181d	posix/glob.c: update from gnulib Copied from gnulib/lib/glob.c in order to fix rhbz 1982608 Also fixes swbz 25659 Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `7c477b57a3`)	2022-04-28 11:57:23 -04:00
Adhemerval Zanella	88a8637cb4	linux: Fix fchmodat with AT_SYMLINK_NOFOLLOW for 64 bit time_t (BZ#29097) The AT_SYMLINK_NOFOLLOW emulation ues the default 32 bit stat internal calls, which fails with EOVERFLOW if the file constains timestamps beyond 2038. Checked on i686-linux-gnu. (cherry picked from commit `118a2aee07`)	2022-04-28 10:10:30 -03:00
Carlos O'Donell	55640ed3fd	i386: Regenerate ulps These failures were caught while building glibc master for Fedora Rawhide which is built with '-mtune=generic -msse2 -mfpmath=sse' using gcc 11.3 (gcc-11.3.1-2.fc35) on a Cascadelake Intel Xeon processor. (cherry picked from commit `e465d97653`)	2022-04-27 21:20:43 -04:00
Adhemerval Zanella	9681691402	linux: Fix missing internal 64 bit time_t stat usage These are two missing spots initially done by `52a5fe70a2`. Checked on i686-linux-gnu. (cherry picked from commit `834ddd0432`)	2022-04-27 14:52:26 -03:00
Noah Goldstein	c796418d00	x86: Optimize L(less_vec) case in memcmp-evex-movbe.S No bug. Optimizations are twofold. 1) Replace page cross and 0/1 checks with masked load instructions in L(less_vec). In applications this reduces branch-misses in the hot [0, 32] case. 2) Change controlflow so that L(less_vec) case gets the fall through. Change 2) helps copies in the [0, 32] size range but comes at the cost of copies in the [33, 64] size range. From profiles of GCC and Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this appears to the the right tradeoff. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `abddd61de0`)	2022-04-26 18:18:16 -07:00
H.J. Lu	f3a99b2216	x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNI Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since they won't lower CPU frequency when ZMM load and store instructions are used. (cherry picked from commit `ceeffe968c`)	2022-04-26 18:18:16 -07:00
Noah Goldstein	4bbd0f866a	x86-64: Use notl in EVEX strcmp [BZ #28646 ] Must use notl %edi here as lower bits are for CHAR comparisons potentially out of range thus can be 0 without indicating mismatch. This fixes BZ #28646. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `4df1fa6ddc`)	2022-04-26 18:18:16 -07:00
Noah Goldstein	7cb126e7e7	x86: Shrink memcmp-sse4.S code size No bug. This implementation refactors memcmp-sse4.S primarily with minimizing code size in mind. It does this by removing the lookup table logic and removing the unrolled check from (256, 512] bytes. memcmp-sse4 code size reduction : -3487 bytes wmemcmp-sse4 code size reduction: -1472 bytes The current memcmp-sse4.S implementation has a large code size cost. This has serious adverse affects on the ICache / ITLB. While in micro-benchmarks the implementations appears fast, traces of real-world code have shown that the speed in micro benchmarks does not translate when the ICache/ITLB are not primed, and that the cost of the code size has measurable negative affects on overall application performance. See https://research.google/pubs/pub48320/ for more details. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `2f9062d717`)	2022-04-26 18:18:16 -07:00
Noah Goldstein	cecbac5212	x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h No bug. This patch doubles the rep_movsb_threshold when using ERMS. Based on benchmarks the vector copy loop, especially now that it handles 4k aliasing, is better for these medium ranged. On Skylake with ERMS: Size, Align1, Align2, dst>src,(rep movsb) / (vec copy) 4096, 0, 0, 0, 0.975 4096, 0, 0, 1, 0.953 4096, 12, 0, 0, 0.969 4096, 12, 0, 1, 0.872 4096, 44, 0, 0, 0.979 4096, 44, 0, 1, 0.83 4096, 0, 12, 0, 1.006 4096, 0, 12, 1, 0.989 4096, 0, 44, 0, 0.739 4096, 0, 44, 1, 0.942 4096, 12, 12, 0, 1.009 4096, 12, 12, 1, 0.973 4096, 44, 44, 0, 0.791 4096, 44, 44, 1, 0.961 4096, 2048, 0, 0, 0.978 4096, 2048, 0, 1, 0.951 4096, 2060, 0, 0, 0.986 4096, 2060, 0, 1, 0.963 4096, 2048, 12, 0, 0.971 4096, 2048, 12, 1, 0.941 4096, 2060, 12, 0, 0.977 4096, 2060, 12, 1, 0.949 8192, 0, 0, 0, 0.85 8192, 0, 0, 1, 0.845 8192, 13, 0, 0, 0.937 8192, 13, 0, 1, 0.939 8192, 45, 0, 0, 0.932 8192, 45, 0, 1, 0.927 8192, 0, 13, 0, 0.621 8192, 0, 13, 1, 0.62 8192, 0, 45, 0, 0.53 8192, 0, 45, 1, 0.516 8192, 13, 13, 0, 0.664 8192, 13, 13, 1, 0.659 8192, 45, 45, 0, 0.593 8192, 45, 45, 1, 0.575 8192, 2048, 0, 0, 0.854 8192, 2048, 0, 1, 0.834 8192, 2061, 0, 0, 0.863 8192, 2061, 0, 1, 0.857 8192, 2048, 13, 0, 0.63 8192, 2048, 13, 1, 0.629 8192, 2061, 13, 0, 0.627 8192, 2061, 13, 1, 0.62 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `475b63702e`)	2022-04-26 18:18:16 -07:00

1 2 3 4 5 ...

14580 Commits