Commit Graph

14580 Commits

Author SHA1 Message Date
Noah Goldstein
aadd0a1c7c x86: Put wcs{n}len-sse4.1 in the sse4.1 text section
Previously was missing but the two implementations shouldn't get in
the sse2 (generic) text section.

(cherry picked from commit afc6e4328f)
2022-07-18 20:45:21 -07:00
Noah Goldstein
d201c59177 x86: Align entry for memrchr to 64-bytes.
The function was tuned around 64-byte entry alignment and performs
better for all sizes with it.

As well different code boths where explicitly written to touch the
minimum number of cache line i.e sizes <= 32 touch only the entry
cache line.

(cherry picked from commit 227afaa672)
2022-07-18 20:45:21 -07:00
Noah Goldstein
c51d8d383c x86: Add BMI1/BMI2 checks for ISA_V3 check
BMI1/BMI2 are part of the ISA V3 requirements:
https://en.wikipedia.org/wiki/X86-64

And defined by GCC when building with `-march=x86-64-v3`

(cherry picked from commit 8da9f346cb)
2022-07-18 20:45:21 -07:00
Noah Goldstein
ba1c3f23d9 x86: Cleanup bounds checking in large memcpy case
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
   Previously was using `__x86_rep_movsb_threshold` and should
   have been using `__x86_shared_non_temporal_threshold`.

2. Avoid reloading __x86_shared_non_temporal_threshold before
   the L(large_memcpy_4x) bounds check.

3. Document the second bounds check for L(large_memcpy_4x)
   more clearly.

(cherry picked from commit 89a25c6f64)
2022-07-18 20:45:21 -07:00
Noah Goldstein
94b0dc9419 x86: Add bounds x86_non_temporal_threshold
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.

The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.

The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.

The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.

(cherry picked from commit b446822b6a)
2022-07-18 20:45:21 -07:00
Noah Goldstein
9d50e162ee x86: Add sse42 implementation to strcmp's ifunc
This has been missing since the the ifuncs where added.

The performance of SSE4.2 is preferable to to SSE2.

Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906

(cherry picked from commit ff439c4717)
2022-07-18 20:45:21 -07:00
Noah Goldstein
6e008c884d x86: Fix misordered logic for setting rep_movsb_stop_threshold
Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)

(cherry picked from commit 0355915514)
2022-07-18 20:45:21 -07:00
Noah Goldstein
fc54e1fae8 x86: Align varshift table to 32-bytes
This ensures the load will never split a cache line.

(cherry picked from commit 0f91811333)
2022-07-18 20:45:21 -07:00
Noah Goldstein
820504e3ed x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactions
Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.

Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit c28db9cb29)
2022-07-18 20:45:21 -07:00
Noah Goldstein
3c87383a20 x86: Shrink code size of memchr-evex.S
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 64 bytes

There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 56da3fe1dd)
2022-07-18 20:45:21 -07:00
Noah Goldstein
a910d7e164 x86: Shrink code size of memchr-avx2.S
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 59 bytes

There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 6dcbb7d95d)

x86: Fix page cross case in rawmemchr-avx2 [BZ #29234]

commit 6dcbb7d95d
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jun 6 21:11:33 2022 -0700

    x86: Shrink code size of memchr-avx2.S

Changed how the page cross case aligned string (rdi) in
rawmemchr. This was incompatible with how
`L(cross_page_continue)` expected the pointer to be aligned and
would cause rawmemchr to read data start started before the
beginning of the string. What it would read was in valid memory
but could count CHAR matches resulting in an incorrect return
value.

This commit fixes that issue by essentially reverting the changes to
the L(page_cross) case as they didn't really matter.

Test cases added and all pass with the new code (and where confirmed
to fail with the old code).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 2c9af8421d)
2022-07-18 20:45:21 -07:00
Noah Goldstein
b05bd59823 x86: Optimize memrchr-avx2.S
The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit af5306a735)
2022-07-18 20:45:21 -07:00
Noah Goldstein
83a986e9fb x86: Optimize memrchr-evex.S
The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit b4209615a0)
2022-07-18 20:45:21 -07:00
Noah Goldstein
4901009dad x86: Optimize memrchr-sse2.S
The new code:
    1. prioritizes smaller lengths more.
    2. optimizes target placement more carefully.
    3. reuses logic more.
    4. fixes up various inefficiencies in the logic.

The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874

Regressions:
    1. The page cross case is now colder, especially re-entry from the
       page cross case if a match is not found in the first VEC
       (roughly 50%). My general opinion with this patch is this is
       acceptable given the "coldness" of this case (less than 4%) and
       generally performance improvement in the other far more common
       cases.

    2. There are some regressions 5-15% for medium/large user-arg
       lengths that have a match in the first VEC. This is because the
       logic was rewritten to optimize finds in the first VEC if the
       user-arg length is shorter (where we see roughly 20-50%
       performance improvements). It is not always the case this is a
       regression. My intuition is some frontend quirk is partially
       explaining the data although I haven't been able to find the
       root cause.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 731feee386)
2022-07-18 20:45:21 -07:00
Noah Goldstein
e805606193 x86: Add COND_VZEROUPPER that can replace vzeroupper if no ret
The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.

This can be useful when hoisting a vzeroupper to save code size
for example:

```
L(foo):
	cmpl	%eax, %edx
	jz	L(bar)
	tzcntl	%eax, %eax
	addq	%rdi, %rax
	VZEROUPPER_RETURN

L(bar):
	xorl	%eax, %eax
	VZEROUPPER_RETURN
```

Can become:

```
L(foo):
	COND_VZEROUPPER
	cmpl	%eax, %edx
	jz	L(bar)
	tzcntl	%eax, %eax
	addq	%rdi, %rax
	ret

L(bar):
	xorl	%eax, %eax
	ret
```

This code does not change any existing functionality.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit dd5c483b25)
2022-07-18 20:45:20 -07:00
Noah Goldstein
70be93d1c5 x86: Create header for VEC classes in x86 strings library
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 8a780a6b91)
2022-07-18 20:45:20 -07:00
Raghuveer Devulapalli
82a707aeb7 x86_64: Add strstr function with 512-bit EVEX
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:

(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.

(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.

(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.

Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)

Geometric mean of all benchmarks: new / old =  0.66

Difficult skiptable(0) : new / old =  0.02
Difficult skiptable(1) : new / old =  0.01
Difficult 2-way : new / old =  0.25
Difficult testing first 2 : new / old =  1.26
Difficult skiptable(0) : new / old =  0.05
Difficult skiptable(1) : new / old =  0.06
Difficult 2-way : new / old =  0.26
Difficult testing first 2 : new / old =  1.05
Difficult skiptable(0) : new / old =  0.42
Difficult skiptable(1) : new / old =  0.24
Difficult 2-way : new / old =  0.21
Difficult testing first 2 : new / old =  1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 5082a287d5)

x86: Remove __mmask intrinsics in strstr-avx512.c

The intrinsics are not available before GCC7 and using standard
operators generates code of equivalent or better quality.

Removed:
    _cvtmask64_u64
    _kshiftri_mask64
    _kand_mask64

Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958

(cherry picked from commit f2698954ff)
2022-07-18 20:45:20 -07:00
H.J. Lu
f6bc52f080 x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT.  Since linkers always set their r_addends to 0, we
can ignore their r_addends.

Reviewed-by: Fangrui Song <maskray@google.com>
(cherry picked from commit f8587a6189)
2022-07-18 20:45:20 -07:00
Sunil K Pandey
8ab861d295 x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.

Placeholder function, not used by any processor at the moment.

- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 9c66efb86f)
2022-07-18 20:45:20 -07:00
Adhemerval Zanella
eb9aa96fac x86_64: Remove bzero optimization
Both symbols are marked as legacy in POSIX.1-2001 and removed on
POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE
or _DEFAULT_SOURCE.

GCC also replaces bcopy with a memmove and bzero with memset on default
configuration (to actually get a bzero libc call the code requires
to omit string.h inclusion and built with -fno-builtin), so it is
highly unlikely programs are actually calling libc bzero symbol.

On a recent Linux distro (Ubuntu 22.04), there is no bzero calls
by the installed binaries.

  $ cat count_bstring.sh
  #!/bin/bash

  files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done`
  total=0
  for file in $files; do
    symbols=`objdump -R $file 2>&1`
    if [ $? -eq 0 ]; then
      ncalls=`echo $symbols | grep -w $1 | wc -l`
      ((total=total+ncalls))
      if [ $ncalls -gt 0 ]; then
        echo "$file: $ncalls"
      fi
    fi
  done
  echo "TOTAL=$total"
  $ ./count_bstring.sh bzero
  TOTAL=0

Checked on x86_64-linux-gnu.

(cherry picked from commit 9403b71ae9)
2022-07-18 20:45:20 -07:00
Sunil K Pandey
8d324019e6 x86_64: Remove end of line trailing spaces
This commit remove trailing space introduced by following commit.

commit a775a7a3eb
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Jun 23 01:56:29 2021 -0400

    x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]
2022-07-18 20:45:09 -07:00
Adhemerval Zanella
4b246b2bbd linux: Fix mq_timereceive check for 32 bit fallback code (BZ 29304)
On  success,  mq_receive() and mq_timedreceive() return the number of
bytes in the received message, so it requires to check if the value
is larger than 0.

Checked on i686-linux-gnu.

(cherry picked from commit 71d87d85bf)
2022-06-30 10:46:39 -03:00
Adhemerval Zanella
bb4148283f nios2: Remove _dl_skip_args usage (BZ# 29187)
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0.   So there is no need to adjust the argc or argv.

Checked with qemu-user that arguments are correctly passed on both
constructors and main program.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4868ba5d25)
2022-06-10 09:15:00 -03:00
Adhemerval Zanella
96944f0f81 hppa: Remove _dl_skip_args usage (BZ# 29165)
Different than other architectures, hppa creates an unrelated stack
frame where ld.so argc/argv adjustments done by ad43cac44a
is not done on the argc/argv saved/restore by _dl_start_user.

Instead load _dl_argc and _dl_argv directlty instead of adjust them
using _dl_skip_args value.

Checked on hppa-linux-gnu.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 6242602273)
2022-06-10 09:13:54 -03:00
Adhemerval Zanella
a7ec6363a3 nptl: Fix __libc_cleanup_pop_restore asynchronous restore (BZ#29214)
This was due a wrong revert done on 404656009b.

Checked on x86_64-linux-gnu.

(cherry picked from commit c7d36dcecc)
2022-06-08 17:15:08 -03:00
Matheus Castanho
4c92a10412 powerpc: Fix VSX register number on __strncpy_power9 [BZ #29197]
__strncpy_power9 initializes VR 18 with zeroes to be used throughout the
code, including when zero-padding the destination string. However, the
v18 reference was mistakenly being used for stxv and stxvl, which take a
VSX vector as operand. The code ended up using the uninitialized VSR 18
register by mistake.

Both occurrences have been changed to use the proper VSX number for VR 18
(i.e. VSR 50).

Tested on powerpc, powerpc64 and powerpc64le.

Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>
(cherry picked from commit 0218463dd8)
2022-06-07 15:34:20 -03:00
Adhemerval Zanella
61fd3e0e74 socket: Use 64 bit stat for isfdtype (BZ# 29209)
This is a missing spot initially from 52a5fe70a2.

Checked on i686-linux-gnu.

(cherry picked from commit 87f1ec12e7)
2022-06-01 13:34:51 -03:00
Adhemerval Zanella
f9c3e57ac2 posix: Use 64 bit stat for fpathconf (_PC_ASYNC_IO) (BZ# 29208)
This is a missing spot initially from 52a5fe70a2.

Checked on i686-linux-gnu.

(cherry picked from commit 6e7137f28c)
2022-06-01 13:34:51 -03:00
Adhemerval Zanella
9db6a597ef posix: Use 64 bit stat for posix_fallocate fallback (BZ# 29207)
This is a missing spot initially from 52a5fe70a2.

Checked on i686-linux-gnu.

(cherry picked from commit 574ba60fc8)
2022-06-01 13:34:51 -03:00
Arjun Shankar
ff450cdbde Fix deadlock when pthread_atfork handler calls pthread_atfork or dlclose
In multi-threaded programs, registering via pthread_atfork,
de-registering implicitly via dlclose, or running pthread_atfork
handlers during fork was protected by an internal lock.  This meant
that a pthread_atfork handler attempting to register another handler or
dlclose a dynamically loaded library would lead to a deadlock.

This commit fixes the deadlock in the following way:

During the execution of handlers at fork time, the atfork lock is
released prior to the execution of each handler and taken again upon its
return.  Any handler registrations or de-registrations that occurred
during the execution of the handler are accounted for before proceeding
with further handler execution.

If a handler that hasn't been executed yet gets de-registered by another
handler during fork, it will not be executed.   If a handler gets
registered by another handler during fork, it will not be executed
during that particular fork.

The possibility that handlers may now be registered or deregistered
during handler execution means that identifying the next handler to be
run after a given handler may register/de-register others requires some
bookkeeping.  The fork_handler struct has an additional field, 'id',
which is assigned sequentially during registration.  Thus, handlers are
executed in ascending order of 'id' during 'prepare', and descending
order of 'id' during parent/child handler execution after the fork.

Two tests are included:

* tst-atfork3: Adhemerval Zanella <adhemerval.zanella@linaro.org>
  This test exercises calling dlclose from prepare, parent, and child
  handlers.

* tst-atfork4: This test exercises calling pthread_atfork and dlclose
  from the prepare handler.

[BZ #24595, BZ #27054]

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 52a103e237)
2022-05-30 12:38:32 +02:00
Noah Goldstein
83ae8287c1 x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #29127]
Re-cherry-pick commit c627209832 for strcmp-avx2.S change which was
omitted in intial cherry pick because at the time this bug was not
present on release branch.

Fixes BZ #29127.

In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.

Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c627209832)
2022-05-25 15:05:05 -07:00
Szabolcs Nagy
b2585cae28 linux: Add a getauxval test [BZ #23293]
This is for bug 23293 and it relies on the glibc test system running
tests via explicit ld.so invokation by default.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 9faf5262c7)
2022-05-19 16:48:47 +01:00
Szabolcs Nagy
c73c79af7d rtld: Use generic argv adjustment in ld.so [BZ #23293]
When an executable is invoked as

  ./ld.so [ld.so-args] ./exe [exe-args]

then the argv is adujusted in ld.so before calling the entry point of
the executable so ld.so args are not visible to it.  On most targets
this requires moving argv, env and auxv on the stack to ensure correct
stack alignment at the entry point.  This had several issues:

- The code for this adjustment on the stack is written in asm as part
  of the target specific ld.so _start code which is hard to maintain.

- The adjustment is done after _dl_start returns, where it's too late
  to update GLRO(dl_auxv), as it is already readonly, so it points to
  memory that was clobbered by the adjustment. This is bug 23293.

- _environ is also wrong in ld.so after the adjustment, but it is
  likely not used after _dl_start returns so this is not user visible.

- _dl_argv was updated, but for this it was moved out of relro, which
  changes security properties across targets unnecessarily.

This patch introduces a generic _dl_start_args_adjust function that
handles the argument adjustments after ld.so processed its own args
and before relro protection is applied.

The same algorithm is used on all targets, _dl_skip_args is now 0, so
existing target specific adjustment code is no longer used.  The bug
affects aarch64, alpha, arc, arm, csky, ia64, nios2, s390-32 and sparc,
other targets don't need the change in principle, only for consistency.

The GNU Hurd start code relied on _dl_skip_args after dl_main returned,
now it checks directly if args were adjusted and fixes the Hurd startup
data accordingly.

Follow up patches can remove _dl_skip_args and DL_ARGV_NOT_RELRO.

Tested on aarch64-linux-gnu and cross tested on i686-gnu.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit ad43cac44a)
2022-05-19 16:48:47 +01:00
Stefan Liebler
89b638f48a S390: Enable static PIE
This commit enables static PIE on 64bit.  On 31bit, static PIE is
not supported.

A new configure check in sysdeps/s390/s390-64/configure.ac also performs
a minimal test for requirements in ld:
Ensure you also have those patches for:
- binutils (ld)
  - "[PR ld/22263] s390: Avoid dynamic TLS relocs in PIE"
    https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=26b1426577b5dcb32d149c64cca3e603b81948a9
    (Tested by configure check above)
    Otherwise there will be a R_390_TLS_TPOFF relocation, which fails to
    be processed in _dl_relocate_static_pie() as static TLS map is not setup.
  - "s390: Add DT_JMPREL pointing to .rela.[i]plt with static-pie"
    https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d942d8db12adf4c9e5c7d9ed6496a779ece7149e
    (We can't test it in configure as we are not able to link a static PIE
    executable if the system glibc lacks static PIE support)
    Otherwise there won't be DT_JMPREL, DT_PLTRELA, DT_PLTRELASZ entries
    and the IFUNC symbols are not processed, which leads to crashes.

- kernel (the mentioned links to the commits belong to 5.19 merge window):
  - "s390/mmap: increase stack/mmap gap to 128MB"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=f2f47d0ef72c30622e62471903ea19446ea79ee2
  - "s390/vdso: move vdso mapping to its own function"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=57761da4dc5cd60bed2c81ba0edb7495c3c740b8
  - "s390/vdso: map vdso above stack"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=9e37a2e8546f9e48ea76c839116fa5174d14e033
  - "s390/vdso: add vdso randomization"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=41cd81abafdc4e58a93fcb677712a76885e3ca25
  (We can't test the kernel of the target system)
  Otherwise if /proc/sys/kernel/randomize_va_space is turned off (0),
  static PIE executables like ldconfig will crash.  While startup sbrk is
  used to enlarge the HEAP.  Unfortunately the underlying brk syscall fails
  as there is not enough space after the HEAP.  Then the address of the TLS
  image is invalid and the following memcpy in __libc_setup_tls() leads
  to a segfault.
  If /proc/sys/kernel/randomize_va_space is activated (default: 2), there
  is enough space after HEAP.

- glibc
  - "Linux: Define MMAP_CALL_INTERNAL"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=c1b68685d438373efe64e5f076f4215723004dfb
  - "i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=6e5c7a1e262961adb52443ab91bd2c9b72316402
  - "i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=60f0f2130d30cfd008ca39743027f1e200592dff
  - "ia64: Always define IA64_USE_NEW_STUB as a flag macro"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=18bd9c3d3b1b6a9182698c85354578d1d58e9d64
  - "Linux: Implement a useful version of _startup_fatal"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=a2a6bce7d7e52c1c34369a7da62c501cc350bc31
  - "Linux: Introduce __brk_call for invoking the brk system call"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=b57ab258c1140bc45464b4b9908713e3e0ee35aa
  - "csu: Implement and use _dl_early_allocate during static startup"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=f787e138aa0bf677bf74fa2a08595c446292f3d7
  The mentioned patch series by Florian Weimer avoids the mentioned failing
  sbrk syscall by falling back to mmap.

This commit also adjusts startup code in start.S to be ready for static PIE.
We have to add a wrapper function for main as we are not allowed to use
GOT relocations before __libc_start_main is called.
(Compare also to:
- commit 14d886edbd
  "aarch64: fix start code for static pie"
- commit 3d1d79283e
  "aarch64: fix static pie enabled libc when main is in a shared library"
)

(cherry picked from commit 728894dba4)
2022-05-19 17:15:57 +02:00
Florian Weimer
ede8d94d15 csu: Implement and use _dl_early_allocate during static startup
This implements mmap fallback for a brk failure during TLS
allocation.

scripts/tls-elf-edit.py is updated to support the new patching method.
The script no longer requires that in the input object is of ET_DYN
type.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit f787e138aa)
2022-05-19 12:13:53 +02:00
Florian Weimer
43d77ef9b8 Linux: Introduce __brk_call for invoking the brk system call
Alpha and sparc can now use the generic implementation.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit b57ab258c1)
2022-05-19 12:13:53 +02:00
Florian Weimer
e7ca2a475c Linux: Implement a useful version of _startup_fatal
On i386 and ia64, the TCB is not available at this point.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit a2a6bce7d7)
2022-05-19 12:13:53 +02:00
Florian Weimer
b2387bea84 ia64: Always define IA64_USE_NEW_STUB as a flag macro
And keep the previous definition if it exists.  This allows
disabling IA64_USE_NEW_STUB while keeping USE_DL_SYSINFO defined.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 18bd9c3d3b)
2022-05-19 12:13:53 +02:00
Florian Weimer
b38c9cdb58 Linux: Define MMAP_CALL_INTERNAL
Unlike MMAP_CALL, this avoids a TCB dependency for an errno update
on failure.

<mmap_internal.h> cannot be included as is on several architectures
due to the definition of page_unit, so introduce a separate header
file for the definition of MMAP_CALL and MMAP_CALL_INTERNAL,
<mmap_call.h>.

Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
(cherry picked from commit c1b68685d4)
2022-05-19 12:13:53 +02:00
Florian Weimer
1a5b9d1a23 i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls
Introduce an int-80h-based version of __libc_do_syscall and use
it if I386_USE_SYSENTER is defined as 0.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 60f0f2130d)
2022-05-19 12:13:53 +02:00
Florian Weimer
1e7b011f87 i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S
After commit a78e6a10d0
("i386: Remove broken CAN_USE_REGISTER_ASM_EBP (bug 28771)"),
it is never defined.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6e5c7a1e26)
2022-05-19 12:13:53 +02:00
Fangrui Song
be9240c84c elf: Remove __libc_init_secure
After 73fc4e28b9,
__libc_enable_secure_decided is always 0 and a statically linked
executable may overwrite __libc_enable_secure without considering
AT_SECURE.

The __libc_enable_secure has been correctly initialized in _dl_aux_init,
so just remove __libc_enable_secure_decided and __libc_init_secure.
This allows us to remove some startup_get*id functions from
22b79ed7f4.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 3e9acce8c5)
2022-05-19 12:13:52 +02:00
Florian Weimer
ff900fad89 Linux: Consolidate auxiliary vector parsing (redo)
And optimize it slightly.

This is commit 8c8510ab27 revised.

In _dl_aux_init in elf/dl-support.c, use an explicit loop
and -fno-tree-loop-distribute-patterns to avoid memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 73fc4e28b9)
2022-05-19 12:13:51 +02:00
Florian Weimer
28bdb03b1b Linux: Include <dl-auxv.h> in dl-sysdep.c only for SHARED
Otherwise, <dl-auxv.h> on POWER ends up being included twice,
once in dl-sysdep.c, once in dl-support.c.  That leads to a linker
failure due to multiple definitions of _dl_cache_line_size.

Fixes commit d96d2995c1
("Revert "Linux: Consolidate auxiliary vector parsing").

(cherry picked from commit 098c795e85)
2022-05-19 11:49:06 +02:00
Florian Weimer
1cc4ddfeeb Revert "Linux: Consolidate auxiliary vector parsing"
This reverts commit 8c8510ab27.  The
revert is not perfect because the commit included a bug fix for
_dl_sysdep_start with an empty argv, introduced in commit
2d47fa6862 ("Linux: Remove
DL_FIND_ARG_COMPONENTS"), and this bug fix is kept.

The revert is necessary because the reverted commit introduced an
early memset call on aarch64, which leads to crash due to lack of TCB
initialization.

(cherry picked from commit d96d2995c1)
2022-05-19 11:49:06 +02:00
Florian Weimer
4b9cd5465d Linux: Consolidate auxiliary vector parsing
And optimize it slightly.

The large switch statement in _dl_sysdep_start can be replaced with
a large array.  This reduces source code and binary size.  On
i686-linux-gnu:

Before:

   text	   data	    bss	    dec	    hex	filename
   7791	     12	      0	   7803	   1e7b	elf/dl-sysdep.os

After:

   text	   data	    bss	    dec	    hex	filename
   7135	     12	      0	   7147	   1beb	elf/dl-sysdep.os

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 8c8510ab27)
2022-05-19 11:49:06 +02:00
Florian Weimer
08728256fa Linux: Assume that NEED_DL_SYSINFO_DSO is always defined
The definition itself is still needed for generic code.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit f19fc997a5)
2022-05-19 11:49:06 +02:00
Florian Weimer
458733fffe Linux: Remove DL_FIND_ARG_COMPONENTS
The generic definition is always used since the Native Client
port has been removed.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 2d47fa6862)
2022-05-19 11:49:06 +02:00
Florian Weimer
2139b1848e Linux: Remove HAVE_AUX_SECURE, HAVE_AUX_XID, HAVE_AUX_PAGESIZE
They are always defined.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit b9c3d3382f)
2022-05-19 11:49:06 +02:00
Florian Weimer
b0bd6a1323 elf: Merge dl-sysdep.c into the Linux version
The generic version is the de-facto Linux implementation.  It
requires an auxiliary vector, so Hurd does not use it.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 91c0a47ffb)
2022-05-19 11:43:53 +02:00
Noah Goldstein
596c9a32cc x86: Optimize {str|wcs}rchr-evex
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.755
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit c966099cdc)
2022-05-16 18:55:54 -07:00
Noah Goldstein
00f09a14d2 x86: Optimize {str|wcs}rchr-avx2
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.832
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit df7e295d18)
2022-05-16 18:55:45 -07:00
Noah Goldstein
0a11305416 x86: Optimize {str|wcs}rchr-sse2
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.741
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 5307aa9c18)
2022-05-16 18:55:37 -07:00
Noah Goldstein
df5de87260 x86: Cleanup page cross code in memcmp-avx2-movbe.S
Old code was both inefficient and wasted code size. New code (-62
bytes) and comparable or better performance in the page cross case.

geometric_mean(N=20) of page cross cases New / Original: 0.960

size, align0, align1, ret, New Time/Old Time
   1,   4095,      0,   0,             1.001
   1,   4095,      0,   1,             0.999
   1,   4095,      0,  -1,               1.0
   2,   4094,      0,   0,               1.0
   2,   4094,      0,   1,               1.0
   2,   4094,      0,  -1,               1.0
   3,   4093,      0,   0,               1.0
   3,   4093,      0,   1,               1.0
   3,   4093,      0,  -1,               1.0
   4,   4092,      0,   0,             0.987
   4,   4092,      0,   1,               1.0
   4,   4092,      0,  -1,               1.0
   5,   4091,      0,   0,             0.984
   5,   4091,      0,   1,             1.002
   5,   4091,      0,  -1,             1.005
   6,   4090,      0,   0,             0.993
   6,   4090,      0,   1,             1.001
   6,   4090,      0,  -1,             1.003
   7,   4089,      0,   0,             0.991
   7,   4089,      0,   1,               1.0
   7,   4089,      0,  -1,             1.001
   8,   4088,      0,   0,             0.875
   8,   4088,      0,   1,             0.881
   8,   4088,      0,  -1,             0.888
   9,   4087,      0,   0,             0.872
   9,   4087,      0,   1,             0.879
   9,   4087,      0,  -1,             0.883
  10,   4086,      0,   0,             0.878
  10,   4086,      0,   1,             0.886
  10,   4086,      0,  -1,             0.873
  11,   4085,      0,   0,             0.878
  11,   4085,      0,   1,             0.881
  11,   4085,      0,  -1,             0.879
  12,   4084,      0,   0,             0.873
  12,   4084,      0,   1,             0.889
  12,   4084,      0,  -1,             0.875
  13,   4083,      0,   0,             0.873
  13,   4083,      0,   1,             0.863
  13,   4083,      0,  -1,             0.863
  14,   4082,      0,   0,             0.838
  14,   4082,      0,   1,             0.869
  14,   4082,      0,  -1,             0.877
  15,   4081,      0,   0,             0.841
  15,   4081,      0,   1,             0.869
  15,   4081,      0,  -1,             0.876
  16,   4080,      0,   0,             0.988
  16,   4080,      0,   1,              0.99
  16,   4080,      0,  -1,             0.989
  17,   4079,      0,   0,             0.978
  17,   4079,      0,   1,             0.981
  17,   4079,      0,  -1,              0.98
  18,   4078,      0,   0,             0.981
  18,   4078,      0,   1,              0.98
  18,   4078,      0,  -1,             0.985
  19,   4077,      0,   0,             0.977
  19,   4077,      0,   1,             0.979
  19,   4077,      0,  -1,             0.986
  20,   4076,      0,   0,             0.977
  20,   4076,      0,   1,             0.986
  20,   4076,      0,  -1,             0.984
  21,   4075,      0,   0,             0.977
  21,   4075,      0,   1,             0.983
  21,   4075,      0,  -1,             0.988
  22,   4074,      0,   0,             0.983
  22,   4074,      0,   1,             0.994
  22,   4074,      0,  -1,             0.993
  23,   4073,      0,   0,              0.98
  23,   4073,      0,   1,             0.992
  23,   4073,      0,  -1,             0.995
  24,   4072,      0,   0,             0.989
  24,   4072,      0,   1,             0.989
  24,   4072,      0,  -1,             0.991
  25,   4071,      0,   0,              0.99
  25,   4071,      0,   1,             0.999
  25,   4071,      0,  -1,             0.996
  26,   4070,      0,   0,             0.993
  26,   4070,      0,   1,             0.995
  26,   4070,      0,  -1,             0.998
  27,   4069,      0,   0,             0.993
  27,   4069,      0,   1,             0.999
  27,   4069,      0,  -1,               1.0
  28,   4068,      0,   0,             0.997
  28,   4068,      0,   1,               1.0
  28,   4068,      0,  -1,             0.999
  29,   4067,      0,   0,             0.996
  29,   4067,      0,   1,             0.999
  29,   4067,      0,  -1,             0.999
  30,   4066,      0,   0,             0.991
  30,   4066,      0,   1,             1.001
  30,   4066,      0,  -1,             0.999
  31,   4065,      0,   0,             0.988
  31,   4065,      0,   1,             0.998
  31,   4065,      0,  -1,             0.998
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 23102686ec)
2022-05-16 18:55:24 -07:00
Noah Goldstein
ffe75982cc x86: Remove memcmp-sse4.S
Code didn't actually use any sse4 instructions since `ptest` was
removed in:

commit 2f9062d717
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Nov 10 16:18:56 2021 -0600

    x86: Shrink memcmp-sse4.S code size

The new memcmp-sse2 implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
 128,      4,      8,   0,             0.858
 128,      4,      8,   1,             0.879
 128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 7cbc03d030)
2022-05-16 18:55:16 -07:00
Noah Goldstein
4ff6ae069b x86: Small improvements for wcslen
Just a few QOL changes.
    1. Prefer `add` > `lea` as it has high execution units it can run
       on.
    2. Don't break macro-fusion between `test` and `jcc`
    3. Reduce code size by removing gratuitous padding bytes (-90
       bytes).

geometric_mean(N=20) of all benchmarks New / Original: 0.959

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 244b415d38)
2022-05-16 18:55:09 -07:00
Noah Goldstein
80883f4354 x86: Remove AVX str{n}casecmp
The rational is:

1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
   regression on Tigerlake using SSE42 versus AVX across the
   benchtest suite).
2. AVX2 version covers the majority of targets that previously
   prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
   becoming outdated.

All in all the saving the code size is worth it.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 305769b2a1)
2022-05-16 18:55:02 -07:00
Noah Goldstein
b13a2e68eb x86: Add EVEX optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 84e7c46df4)
2022-05-16 18:54:52 -07:00
Noah Goldstein
3051cf3e74 x86: Add AVX2 optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit bbf8122234)
2022-05-16 18:54:41 -07:00
Noah Goldstein
3605c74407 x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S
Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .920

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit d154758e61)
2022-05-16 18:54:27 -07:00
Noah Goldstein
5997011826 x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S
Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .894

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 670b54bc58)
2022-05-16 18:54:17 -07:00
Noah Goldstein
a4b1cae068 x86: Remove strspn-sse2.S and use the generic implementation
The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .710

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 9c8a6ad620)
2022-05-16 18:54:09 -07:00
Noah Goldstein
3811544655 x86: Remove strpbrk-sse2.S and use the generic implementation
The generic implementation is faster (see strcspn commit).

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 6533585352)
2022-05-16 18:53:59 -07:00
Noah Goldstein
0dafa75e3c x86: Remove strcspn-sse2.S and use the generic implementation
The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .678

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit fe28e7d9d9)
2022-05-16 18:53:48 -07:00
Noah Goldstein
0a2da01110 x86: Optimize strspn in strspn-c.c
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2; New / Original: .901

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 412d103431)
2022-05-16 18:53:39 -07:00
Noah Goldstein
0ae1006967 x86: Optimize strcspn and strpbrk in strcspn-c.c
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2/strlen; New / Original: .928

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 30d627d477)
2022-05-16 18:53:28 -07:00
Noah Goldstein
dd6d3a0bbc x86: Code cleanup in strchr-evex and comment justifying branch
Small code cleanup for size: -81 bytes.

Add comment justifying using a branch to do NULL/non-null return.

All string/memory tests pass and no regressions in benchtests.

geometric_mean(N=20) of all benchmarks New / Original: .985
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit ec285ea904)
2022-05-16 18:53:19 -07:00
Noah Goldstein
3c55c20756 x86: Code cleanup in strchr-avx2 and comment justifying branch
Small code cleanup for size: -53 bytes.

Add comment justifying using a branch to do NULL/non-null return.

All string/memory tests pass and no regressions in benchtests.

geometric_mean(N=20) of all benchmarks Original / New: 1.00
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit a6fbf4d51e)
2022-05-16 18:53:07 -07:00
Adhemerval Zanella
dd457606ca x86_64: Remove bcopy optimizations
The symbols is not present in current POSIX specification and compiler
already generates memmove call.

(cherry picked from commit bf92893a14)
2022-05-16 18:52:57 -07:00
H.J. Lu
37f373e334 x86-64: Remove bzero weak alias in SS2 memset
commit 3d9f171bfb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Feb 7 05:55:15 2022 -0800

    x86-64: Optimize bzero

added the optimized bzero.  Remove bzero weak alias in SS2 memset to
avoid undefined __bzero in memset-sse2-unaligned-erms.

(cherry picked from commit 0fb8800029)
2022-05-16 18:52:47 -07:00
H.J. Lu
6cba46c858 x86_64/multiarch: Sort sysdep_routines and put one entry per line
(cherry picked from commit c328d0152d)
2022-05-16 18:52:35 -07:00
H.J. Lu
8de6e4a199 x86: Improve L to support L(XXX_SYMBOL (YYY, ZZZ))
(cherry picked from commit 1283948f23)
2022-05-16 18:52:19 -07:00
Noah Goldstein
e123f08ad5 x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896]
Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not
__wcscmp_avx2.

commit ddf0992cf5
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Jan 9 16:02:21 2022 -0600

    x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]

Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.

This change will need to be backported.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 9fef7039a7)
2022-05-05 09:13:13 -07:00
Noah Goldstein
5373c90f2e x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895]
Logic can read before the start of `s1` / `s2` if both `s1` and `s2`
are near the start of a page. To avoid having the result contimated by
these comparisons the `strcmp` variants would mask off these
comparisons. This was missing in the `strncmp` variants causing
the bug. This commit adds the masking to `strncmp` so that out of
range comparisons don't affect the result.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as
well a full xcheck on x86_64 linux.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit e108c02a5e)
2022-05-05 09:11:49 -07:00
Noah Goldstein
70509f9b48 x86: Set .text section in memset-vec-unaligned-erms
commit 3d9f171bfb
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Feb 7 05:55:15 2022 -0800

    x86-64: Optimize bzero

Remove setting the .text section for the code. This commit
adds that back.

(cherry picked from commit 7912236f4a)
2022-05-05 09:11:13 -07:00
H.J. Lu
5cb6329652 x86-64: Optimize bzero
memset with zero as the value to set is by far the majority value (99%+
for Python3 and GCC).

bzero can be slightly more optimized for this case by using a zero-idiom
xor for broadcasting the set value to a register (vector or GPR).

Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 3d9f171bfb)
2022-05-05 09:10:53 -07:00
Noah Goldstein
190ea5f7e4 x86: Remove SSSE3 instruction for broadcast in memset.S (SSE2 Only)
commit b62ace2740
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Feb 6 00:54:18 2022 -0600

    x86: Improve vec generation in memset-vec-unaligned-erms.S

Revert usage of 'pshufb' in broadcast logic as it is an SSSE3
instruction and memset.S is restricted to only SSE2 instructions.

(cherry picked from commit 1b0c60f95b)
2022-05-05 08:54:23 -07:00
Noah Goldstein
ea19c490a3 x86: Improve vec generation in memset-vec-unaligned-erms.S
No bug.

Split vec generation into multiple steps. This allows the
broadcast in AVX2 to use 'xmm' registers for the L(less_vec)
case. This saves an expensive lane-cross instruction and removes
the need for 'vzeroupper'.

For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for
byte broadcast.

Results for memset-avx2 small (geomean of N = 20 benchset runs).

size, New Time, Old Time, New / Old
   0,    4.100,    3.831,     0.934
   1,    5.074,    4.399,     0.867
   2,    4.433,    4.411,     0.995
   4,    4.487,    4.415,     0.984
   8,    4.454,    4.396,     0.987
  16,    4.502,    4.443,     0.987

All relevant string/wcsmbs tests are passing.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit b62ace2740)
2022-05-05 08:54:11 -07:00
H.J. Lu
53ddafe917 x86-64: Fix strcmp-evex.S
Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit 8418eb3ff4
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jan 10 15:35:39 2022 -0600

    x86: Optimize strcmp-evex.S

(cherry picked from commit 0e0199a9e0)
2022-05-05 08:54:03 -07:00
H.J. Lu
d299032743 x86-64: Fix strcmp-avx2.S
Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit b77b06e0e2
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jan 10 15:35:38 2022 -0600

    x86: Optimize strcmp-avx2.S

(cherry picked from commit c15efd011c)
2022-05-05 08:53:50 -07:00
Noah Goldstein
c41a66767d x86: Optimize strcmp-evex.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases as well are nearly universally improved.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 8418eb3ff4)
2022-05-05 08:53:42 -07:00
Noah Goldstein
0d5b36c8cc x86: Optimize strcmp-avx2.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit b77b06e0e2)
2022-05-05 08:53:34 -07:00
Joseph Myers
97cb8227b8 Add HWCAP2_AFP, HWCAP2_RPRES from Linux 5.17 to AArch64 bits/hwcap.h
Add the new HWCAP2_AFP and HWCAP2_RPRES constants from Linux 5.17.
Tested with build-many-glibcs.py for aarch64-linux-gnu.

(cherry picked from commit 866c599182)
2022-05-03 11:08:52 +02:00
Szabolcs Nagy
c108e87026 aarch64: Add HWCAP2_ECV from Linux 5.16
Indicates the availability of enhanced counter virtualization extension
of armv8.6-a with self-synchronized virtual counter CNTVCTSS_EL0 usable
in userspace.

(cherry picked from commit 5a1be8ebdf)
2022-05-03 11:08:52 +02:00
Joseph Myers
f858bc3093 Add SOL_MPTCP, SOL_MCTP from Linux 5.16 to bits/socket.h
Linux 5.16 adds constants SOL_MPTCP and SOL_MCTP to the getsockopt /
setsockopt levels; add these constants to bits/socket.h.

Tested for x86_64.

(cherry picked from commit fdc1ae67fe)
2022-05-03 11:08:52 +02:00
Joseph Myers
0499c3a95f Update kernel version to 5.17 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.17.  (There are no new MAP_* constants covered by this test in
5.17 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 23808a422e)
2022-05-03 11:08:52 +02:00
Joseph Myers
81181ba5d9 Update kernel version to 5.16 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.16.  (There are no new MAP_* constants covered by this test in
5.16 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 790a607e23)
2022-05-03 11:08:52 +02:00
Joseph Myers
6af165658d Update syscall lists for Linux 5.17
Linux 5.17 has one new syscall, set_mempolicy_home_node.  Update
syscall-names.list and regenerate the arch-syscall.h headers with
build-many-glibcs.py update-syscalls.

Tested with build-many-glibcs.py.

(cherry picked from commit 8ef9196b26)
2022-05-03 11:08:52 +02:00
Joseph Myers
5146b73d72 Add ARPHRD_CAN, ARPHRD_MCTP to net/if_arp.h
Add the constant ARPHRD_MCTP, from Linux 5.15, to net/if_arp.h, along
with ARPHRD_CAN which was added to Linux in version 2.6.25 (commit
cd05acfe65ed2cf2db683fa9a6adb8d35635263b, "[CAN]: Allocate protocol
numbers for PF_CAN") but apparently missed for glibc at the time.

Tested for x86_64.

(cherry picked from commit a94d9659cd)
2022-05-03 11:07:10 +02:00
Joseph Myers
fd5dbfd1cd Update kernel version to 5.15 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.15.  (There are no new MAP_* constants covered by this test in
5.15 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 5c3ece451d)
2022-05-03 11:07:07 +02:00
Joseph Myers
bc6fba3c80 Add PF_MCTP, AF_MCTP from Linux 5.15 to bits/socket.h
Linux 5.15 adds a new address / protocol family PF_MCTP / AF_MCTP; add
these constants to bits/socket.h.

Tested for x86_64.

(cherry picked from commit bdeb7a8fa9)
2022-05-03 11:07:03 +02:00
DJ Delorie
c66c92181d posix/glob.c: update from gnulib
Copied from gnulib/lib/glob.c in order to fix rhbz 1982608
Also fixes swbz 25659

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 7c477b57a3)
2022-04-28 11:57:23 -04:00
Adhemerval Zanella
88a8637cb4 linux: Fix fchmodat with AT_SYMLINK_NOFOLLOW for 64 bit time_t (BZ#29097)
The AT_SYMLINK_NOFOLLOW emulation ues the default 32 bit stat internal
calls, which fails with EOVERFLOW if the file constains timestamps
beyond 2038.

Checked on i686-linux-gnu.

(cherry picked from commit 118a2aee07)
2022-04-28 10:10:30 -03:00
Carlos O'Donell
55640ed3fd i386: Regenerate ulps
These failures were caught while building glibc master for Fedora
Rawhide which is built with '-mtune=generic -msse2 -mfpmath=sse'
using gcc 11.3 (gcc-11.3.1-2.fc35) on a Cascadelake Intel Xeon
processor.

(cherry picked from commit e465d97653)
2022-04-27 21:20:43 -04:00
Adhemerval Zanella
9681691402 linux: Fix missing internal 64 bit time_t stat usage
These are two missing spots initially done by 52a5fe70a2.

Checked on i686-linux-gnu.

(cherry picked from commit 834ddd0432)
2022-04-27 14:52:26 -03:00
Noah Goldstein
c796418d00 x86: Optimize L(less_vec) case in memcmp-evex-movbe.S
No bug.
Optimizations are twofold.

1) Replace page cross and 0/1 checks with masked load instructions in
   L(less_vec). In applications this reduces branch-misses in the
   hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.

Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range.  From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit abddd61de0)
2022-04-26 18:18:16 -07:00
H.J. Lu
f3a99b2216 x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNI
Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since
they won't lower CPU frequency when ZMM load and store instructions are
used.

(cherry picked from commit ceeffe968c)
2022-04-26 18:18:16 -07:00
Noah Goldstein
4bbd0f866a x86-64: Use notl in EVEX strcmp [BZ #28646]
Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 4df1fa6ddc)
2022-04-26 18:18:16 -07:00
Noah Goldstein
7cb126e7e7 x86: Shrink memcmp-sse4.S code size
No bug.

This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.

memcmp-sse4 code size reduction : -3487 bytes
wmemcmp-sse4 code size reduction: -1472 bytes

The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.

See https://research.google/pubs/pub48320/ for more details.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2f9062d717)
2022-04-26 18:18:16 -07:00
Noah Goldstein
cecbac5212 x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h
No bug.

This patch doubles the rep_movsb_threshold when using ERMS. Based on
benchmarks the vector copy loop, especially now that it handles 4k
aliasing, is better for these medium ranged.

On Skylake with ERMS:

Size,   Align1, Align2, dst>src,(rep movsb) / (vec copy)
4096,   0,      0,      0,      0.975
4096,   0,      0,      1,      0.953
4096,   12,     0,      0,      0.969
4096,   12,     0,      1,      0.872
4096,   44,     0,      0,      0.979
4096,   44,     0,      1,      0.83
4096,   0,      12,     0,      1.006
4096,   0,      12,     1,      0.989
4096,   0,      44,     0,      0.739
4096,   0,      44,     1,      0.942
4096,   12,     12,     0,      1.009
4096,   12,     12,     1,      0.973
4096,   44,     44,     0,      0.791
4096,   44,     44,     1,      0.961
4096,   2048,   0,      0,      0.978
4096,   2048,   0,      1,      0.951
4096,   2060,   0,      0,      0.986
4096,   2060,   0,      1,      0.963
4096,   2048,   12,     0,      0.971
4096,   2048,   12,     1,      0.941
4096,   2060,   12,     0,      0.977
4096,   2060,   12,     1,      0.949
8192,   0,      0,      0,      0.85
8192,   0,      0,      1,      0.845
8192,   13,     0,      0,      0.937
8192,   13,     0,      1,      0.939
8192,   45,     0,      0,      0.932
8192,   45,     0,      1,      0.927
8192,   0,      13,     0,      0.621
8192,   0,      13,     1,      0.62
8192,   0,      45,     0,      0.53
8192,   0,      45,     1,      0.516
8192,   13,     13,     0,      0.664
8192,   13,     13,     1,      0.659
8192,   45,     45,     0,      0.593
8192,   45,     45,     1,      0.575
8192,   2048,   0,      0,      0.854
8192,   2048,   0,      1,      0.834
8192,   2061,   0,      0,      0.863
8192,   2061,   0,      1,      0.857
8192,   2048,   13,     0,      0.63
8192,   2048,   13,     1,      0.629
8192,   2061,   13,     0,      0.627
8192,   2061,   13,     1,      0.62

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 475b63702e)
2022-04-26 18:18:16 -07:00