glibc/manual/tunables.texi

487 lines
20 KiB
Plaintext
Raw Normal View History

@node Tunables
@c @node Tunables, , Internal Probes, Top
@c %MENU% Tunable switches to alter libc internal behavior
@chapter Tunables
@cindex tunables
@dfn{Tunables} are a feature in @theglibc{} that allows application authors and
distribution maintainers to alter the runtime library behavior to match
their workload. These are implemented as a set of switches that may be
modified in different ways. The current default method to do this is via
the @env{GLIBC_TUNABLES} environment variable by setting it to a string
of colon-separated @var{name}=@var{value} pairs. For example, the following
example enables malloc checking and sets the malloc trim threshold to 128
bytes:
@example
GLIBC_TUNABLES=glibc.malloc.trim_threshold=128:glibc.malloc.check=3
export GLIBC_TUNABLES
@end example
Tunables are not part of the @glibcadj{} stable ABI, and they are
subject to change or removal across releases. Additionally, the method to
modify tunable values may change between releases and across distributions.
It is possible to implement multiple `frontends' for the tunables allowing
distributions to choose their preferred method at build time.
Finally, the set of tunables available may vary between distributions as
the tunables feature allows distributions to add their own tunables under
their own namespace.
@menu
* Tunable names:: The structure of a tunable name
* Memory Allocation Tunables:: Tunables in the memory allocation subsystem
* Dynamic Linking Tunables:: Tunables in the dynamic linking subsystem
Add elision tunables This patch adds several new tunables to control the behavior of elision on supported platforms[1]. Since elision now depends on tunables, we should always *compile* with elision enabled, and leave the code disabled, but available for runtime selection. This gives us *much* better compile-time testing of the existing code to avoid bit-rot[2]. Tested on ppc, ppc64, ppc64le, s390x and x86_64. [1] This part of the patch was initially proposed by Paul Murphy but was "staled" because the framework have changed since the patch was originally proposed: https://patchwork.sourceware.org/patch/10342/ [2] This part of the patch was inititally proposed as a RFC by Carlos O'Donnell. Make sense to me integrate this on the patch: https://sourceware.org/ml/libc-alpha/2017-05/msg00335.html * elf/dl-tunables.list: Add elision parameters. * manual/tunables.texi: Add entries about elision tunable. * sysdeps/unix/sysv/linux/powerpc/elision-conf.c: Add callback functions to dynamically enable/disable elision. Add multiple callbacks functions to set elision parameters. Deleted __libc_enable_secure check. * sysdeps/unix/sysv/linux/s390/elision-conf.c: Likewise. * sysdeps/unix/sysv/linux/x86/elision-conf.c: Likewise. * configure: Regenerated. * configure.ac: Option enable_lock_elision was deleted. * config.h.in: ENABLE_LOCK_ELISION flag was deleted. * config.make.in: Remove references to enable_lock_elision. * manual/install.texi: Elision configure option was removed. * INSTALL: Regenerated to remove enable_lock_elision. * nptl/Makefile: Disable elision so it can verify error case for destroying a mutex. * sysdeps/powerpc/nptl/elide.h: Cleanup ENABLE_LOCK_ELISION check. Deleted macros for the case when ENABLE_LOCK_ELISION was not defined. * sysdeps/s390/configure: Regenerated. * sysdeps/s390/configure.ac: Remove references to enable_lock_elision.. * nptl/tst-mutex8.c: Deleted all #ifndef ENABLE_LOCK_ELISION from the test. * sysdeps/powerpc/powerpc32/sysdep.h: Deleted all ENABLE_LOCK_ELISION checks. * sysdeps/powerpc/powerpc64/sysdep.h: Likewise. * sysdeps/powerpc/sysdep.h: Likewise. * sysdeps/s390/nptl/bits/pthreadtypes-arch.h: Likewise. * sysdeps/unix/sysv/linux/powerpc/force-elision.h: Likewise. * sysdeps/unix/sysv/linux/s390/elision-conf.h: Likewise. * sysdeps/unix/sysv/linux/s390/force-elision.h: Likewise. * sysdeps/unix/sysv/linux/s390/lowlevellock.h: Likewise. * sysdeps/unix/sysv/linux/s390/Makefile: Remove references to enable-lock-elision. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>
2017-12-06 00:24:14 +08:00
* Elision Tunables:: Tunables in elision subsystem
Mutex: Add pthread mutex tunables This patch does not have any functionality change, we only provide a spin count tunes for pthread adaptive spin mutex. The tunable glibc.pthread.mutex_spin_count tunes can be used by system administrator to squeeze system performance according to different hardware capabilities and workload characteristics. The maximum value of spin count is limited to 32767 to avoid the overflow of mutex->__data.__spins variable with the possible type of short in pthread_mutex_lock (). The default value of spin count is set to 100 with the reference to the previous number of times of spinning via trylock. This value would be architecture-specific and can be tuned with kinds of benchmarks to fit most cases in future. I would extend my appreciation sincerely to H.J.Lu for his help to refine this patch series. * manual/tunables.texi (POSIX Thread Tunables): New node. * nptl/Makefile (libpthread-routines): Add pthread_mutex_conf. * nptl/nptl-init.c: Include pthread_mutex_conf.h (__pthread_initialize_minimal_internal) [HAVE_TUNABLES]: Call __pthread_tunables_init. * nptl/pthreadP.h (MAX_ADAPTIVE_COUNT): Remove. (max_adaptive_count): Define. * nptl/pthread_mutex_conf.c: New file. * nptl/pthread_mutex_conf.h: New file. * sysdeps/generic/adaptive_spin_count.h: New file. * sysdeps/nptl/dl-tunables.list: New file. * nptl/pthread_mutex_lock.c (__pthread_mutex_lock): Use max_adaptive_count () not MAX_ADAPTIVE_COUNT. * nptl/pthread_mutex_timedlock.c (__pthrad_mutex_timedlock): Likewise. Suggested-by: Andi Kleen <andi.kleen@intel.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Signed-off-by: Kemi.wang <kemi.wang@intel.com>
2018-12-01 23:03:33 +08:00
* POSIX Thread Tunables:: Tunables in the POSIX thread subsystem
* Hardware Capability Tunables:: Tunables that modify the hardware
capabilities seen by @theglibc{}
@end menu
@node Tunable names
@section Tunable names
@cindex Tunable names
@cindex Tunable namespaces
A tunable name is split into three components, a top namespace, a tunable
namespace and the tunable name. The top namespace for tunables implemented in
@theglibc{} is @code{glibc}. Distributions that choose to add custom tunables
in their maintained versions of @theglibc{} may choose to do so under their own
top namespace.
The tunable namespace is a logical grouping of tunables in a single
module. This currently holds no special significance, although that may
change in the future.
The tunable name is the actual name of the tunable. It is possible that
different tunable namespaces may have tunables within them that have the
same name, likewise for top namespaces. Hence, we only support
identification of tunables by their full name, i.e. with the top
namespace, tunable namespace and tunable name, separated by periods.
@node Memory Allocation Tunables
@section Memory Allocation Tunables
@cindex memory allocation tunables
@cindex malloc tunables
@cindex tunables, malloc
@deftp {Tunable namespace} glibc.malloc
Memory allocation behavior can be modified by setting any of the
following tunables in the @code{malloc} namespace:
@end deftp
@deftp Tunable glibc.malloc.check
This tunable supersedes the @env{MALLOC_CHECK_} environment variable and is
identical in features.
Setting this tunable to a non-zero value enables a special (less
efficient) memory allocator for the malloc family of functions that is
designed to be tolerant against simple errors such as double calls of
free with the same argument, or overruns of a single byte (off-by-one
bugs). Not all such errors can be protected against, however, and memory
leaks can result. Any detected heap corruption results in immediate
termination of the process.
Like @env{MALLOC_CHECK_}, @code{glibc.malloc.check} has a problem in that it
diverges from normal program behavior by writing to @code{stderr}, which could
by exploited in SUID and SGID binaries. Therefore, @code{glibc.malloc.check}
is disabled by default for SUID and SGID binaries. This can be enabled again
by the system administrator by adding a file @file{/etc/suid-debug}; the
content of the file could be anything or even empty.
@end deftp
@deftp Tunable glibc.malloc.top_pad
This tunable supersedes the @env{MALLOC_TOP_PAD_} environment variable and is
identical in features.
This tunable determines the amount of extra memory in bytes to obtain from the
system when any of the arenas need to be extended. It also specifies the
number of bytes to retain when shrinking any of the arenas. This provides the
necessary hysteresis in heap size such that excessive amounts of system calls
can be avoided.
The default value of this tunable is @samp{0}.
@end deftp
@deftp Tunable glibc.malloc.perturb
This tunable supersedes the @env{MALLOC_PERTURB_} environment variable and is
identical in features.
If set to a non-zero value, memory blocks are initialized with values depending
on some low order bits of this tunable when they are allocated (except when
allocated by calloc) and freed. This can be used to debug the use of
uninitialized or freed heap memory. Note that this option does not guarantee
that the freed block will have any specific values. It only guarantees that the
content the block had before it was freed will be overwritten.
The default value of this tunable is @samp{0}.
@end deftp
@deftp Tunable glibc.malloc.mmap_threshold
This tunable supersedes the @env{MALLOC_MMAP_THRESHOLD_} environment variable
and is identical in features.
When this tunable is set, all chunks larger than this value in bytes are
allocated outside the normal heap, using the @code{mmap} system call. This way
it is guaranteed that the memory for these chunks can be returned to the system
on @code{free}. Note that requests smaller than this threshold might still be
allocated via @code{mmap}.
If this tunable is not set, the default value is set to @samp{131072} bytes and
the threshold is adjusted dynamically to suit the allocation patterns of the
program. If the tunable is set, the dynamic adjustment is disabled and the
value is set as static.
@end deftp
@deftp Tunable glibc.malloc.trim_threshold
This tunable supersedes the @env{MALLOC_TRIM_THRESHOLD_} environment variable
and is identical in features.
The value of this tunable is the minimum size (in bytes) of the top-most,
releasable chunk in an arena that will trigger a system call in order to return
memory to the system from that arena.
If this tunable is not set, the default value is set as 128 KB and the
threshold is adjusted dynamically to suit the allocation patterns of the
program. If the tunable is set, the dynamic adjustment is disabled and the
value is set as static.
@end deftp
@deftp Tunable glibc.malloc.mmap_max
This tunable supersedes the @env{MALLOC_MMAP_MAX_} environment variable and is
identical in features.
The value of this tunable is maximum number of chunks to allocate with
@code{mmap}. Setting this to zero disables all use of @code{mmap}.
The default value of this tunable is @samp{65536}.
@end deftp
@deftp Tunable glibc.malloc.arena_test
This tunable supersedes the @env{MALLOC_ARENA_TEST} environment variable and is
identical in features.
The @code{glibc.malloc.arena_test} tunable specifies the number of arenas that
can be created before the test on the limit to the number of arenas is
conducted. The value is ignored if @code{glibc.malloc.arena_max} is set.
The default value of this tunable is 2 for 32-bit systems and 8 for 64-bit
systems.
@end deftp
@deftp Tunable glibc.malloc.arena_max
This tunable supersedes the @env{MALLOC_ARENA_MAX} environment variable and is
identical in features.
This tunable sets the number of arenas to use in a process regardless of the
number of cores in the system.
The default value of this tunable is @code{0}, meaning that the limit on the
number of arenas is determined by the number of CPU cores online. For 32-bit
systems the limit is twice the number of cores online and on 64-bit systems, it
is 8 times the number of cores online.
@end deftp
@deftp Tunable glibc.malloc.tcache_max
The maximum size of a request (in bytes) which may be met via the
per-thread cache. The default (and maximum) value is 1032 bytes on
64-bit systems and 516 bytes on 32-bit systems.
@end deftp
@deftp Tunable glibc.malloc.tcache_count
The maximum number of chunks of each size to cache. The default is 7.
The upper limit is 65535. If set to zero, the per-thread cache is effectively
disabled.
The approximate maximum overhead of the per-thread cache is thus equal
to the number of bins times the chunk count in each bin times the size
of each chunk. With defaults, the approximate maximum overhead of the
per-thread cache is approximately 236 KB on 64-bit systems and 118 KB
on 32-bit systems.
@end deftp
@deftp Tunable glibc.malloc.tcache_unsorted_limit
When the user requests memory and the request cannot be met via the
per-thread cache, the arenas are used to meet the request. At this
time, additional chunks will be moved from existing arena lists to
pre-fill the corresponding cache. While copies from the fastbins,
smallbins, and regular bins are bounded and predictable due to the bin
sizes, copies from the unsorted bin are not bounded, and incur
additional time penalties as they need to be sorted as they're
scanned. To make scanning the unsorted list more predictable and
bounded, the user may set this tunable to limit the number of chunks
that are scanned from the unsorted list while searching for chunks to
pre-fill the per-thread cache with. The default, or when set to zero,
is no limit.
@end deftp
@deftp Tunable glibc.malloc.mxfast
One of the optimizations malloc uses is to maintain a series of ``fast
bins'' that hold chunks up to a specific size. The default and
maximum size which may be held this way is 80 bytes on 32-bit systems
or 160 bytes on 64-bit systems. Applications which value size over
speed may choose to reduce the size of requests which are serviced
from fast bins with this tunable. Note that the value specified
includes malloc's internal overhead, which is normally the size of one
pointer, so add 4 on 32-bit systems or 8 on 64-bit systems to the size
passed to @code{malloc} for the largest bin size to enable.
@end deftp
@node Dynamic Linking Tunables
@section Dynamic Linking Tunables
@cindex dynamic linking tunables
@cindex rtld tunables
@deftp {Tunable namespace} glibc.rtld
Dynamic linker behavior can be modified by setting the
following tunables in the @code{rtld} namespace:
@end deftp
@deftp Tunable glibc.rtld.nns
Sets the number of supported dynamic link namespaces (see @code{dlmopen}).
Currently this limit can be set between 1 and 16 inclusive, the default is 4.
Each link namespace consumes some memory in all thread, and thus raising the
limit will increase the amount of memory each thread uses. Raising the limit
is useful when your application uses more than 4 dynamic link namespaces as
created by @code{dlmopen} with an lmid argument of @code{LM_ID_NEWLM}.
Dynamic linker audit modules are loaded in their own dynamic link namespaces,
but they are not accounted for in @code{glibc.rtld.nns}. They implicitly
increase the per-thread memory usage as necessary, so this tunable does
not need to be changed to allow many audit modules e.g. via @env{LD_AUDIT}.
@end deftp
@deftp Tunable glibc.rtld.optional_static_tls
Sets the amount of surplus static TLS in bytes to allocate at program
startup. Every thread created allocates this amount of specified surplus
static TLS. This is a minimum value and additional space may be allocated
for internal purposes including alignment. Optional static TLS is used for
optimizing dynamic TLS access for platforms that support such optimizations
e.g. TLS descriptors or optimized TLS access for POWER (@code{DT_PPC64_OPT}
and @code{DT_PPC_OPT}). In order to make the best use of such optimizations
the value should be as many bytes as would be required to hold all TLS
variables in all dynamic loaded shared libraries. The value cannot be known
by the dynamic loader because it doesn't know the expected set of shared
libraries which will be loaded. The existing static TLS space cannot be
changed once allocated at process startup. The default allocation of
optional static TLS is 512 bytes and is allocated in every thread.
@end deftp
Add elision tunables This patch adds several new tunables to control the behavior of elision on supported platforms[1]. Since elision now depends on tunables, we should always *compile* with elision enabled, and leave the code disabled, but available for runtime selection. This gives us *much* better compile-time testing of the existing code to avoid bit-rot[2]. Tested on ppc, ppc64, ppc64le, s390x and x86_64. [1] This part of the patch was initially proposed by Paul Murphy but was "staled" because the framework have changed since the patch was originally proposed: https://patchwork.sourceware.org/patch/10342/ [2] This part of the patch was inititally proposed as a RFC by Carlos O'Donnell. Make sense to me integrate this on the patch: https://sourceware.org/ml/libc-alpha/2017-05/msg00335.html * elf/dl-tunables.list: Add elision parameters. * manual/tunables.texi: Add entries about elision tunable. * sysdeps/unix/sysv/linux/powerpc/elision-conf.c: Add callback functions to dynamically enable/disable elision. Add multiple callbacks functions to set elision parameters. Deleted __libc_enable_secure check. * sysdeps/unix/sysv/linux/s390/elision-conf.c: Likewise. * sysdeps/unix/sysv/linux/x86/elision-conf.c: Likewise. * configure: Regenerated. * configure.ac: Option enable_lock_elision was deleted. * config.h.in: ENABLE_LOCK_ELISION flag was deleted. * config.make.in: Remove references to enable_lock_elision. * manual/install.texi: Elision configure option was removed. * INSTALL: Regenerated to remove enable_lock_elision. * nptl/Makefile: Disable elision so it can verify error case for destroying a mutex. * sysdeps/powerpc/nptl/elide.h: Cleanup ENABLE_LOCK_ELISION check. Deleted macros for the case when ENABLE_LOCK_ELISION was not defined. * sysdeps/s390/configure: Regenerated. * sysdeps/s390/configure.ac: Remove references to enable_lock_elision.. * nptl/tst-mutex8.c: Deleted all #ifndef ENABLE_LOCK_ELISION from the test. * sysdeps/powerpc/powerpc32/sysdep.h: Deleted all ENABLE_LOCK_ELISION checks. * sysdeps/powerpc/powerpc64/sysdep.h: Likewise. * sysdeps/powerpc/sysdep.h: Likewise. * sysdeps/s390/nptl/bits/pthreadtypes-arch.h: Likewise. * sysdeps/unix/sysv/linux/powerpc/force-elision.h: Likewise. * sysdeps/unix/sysv/linux/s390/elision-conf.h: Likewise. * sysdeps/unix/sysv/linux/s390/force-elision.h: Likewise. * sysdeps/unix/sysv/linux/s390/lowlevellock.h: Likewise. * sysdeps/unix/sysv/linux/s390/Makefile: Remove references to enable-lock-elision. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>
2017-12-06 00:24:14 +08:00
@node Elision Tunables
@section Elision Tunables
@cindex elision tunables
@cindex tunables, elision
@deftp {Tunable namespace} glibc.elision
Contended locks are usually slow and can lead to performance and scalability
issues in multithread code. Lock elision will use memory transactions to under
certain conditions, to elide locks and improve performance.
Elision behavior can be modified by setting the following tunables in
the @code{elision} namespace:
@end deftp
@deftp Tunable glibc.elision.enable
The @code{glibc.elision.enable} tunable enables lock elision if the feature is
supported by the hardware. If elision is not supported by the hardware this
tunable has no effect.
Elision tunables are supported for 64-bit Intel, IBM POWER, and z System
architectures.
@end deftp
@deftp Tunable glibc.elision.skip_lock_busy
The @code{glibc.elision.skip_lock_busy} tunable sets how many times to use a
non-transactional lock after a transactional failure has occurred because the
lock is already acquired. Expressed in number of lock acquisition attempts.
The default value of this tunable is @samp{3}.
@end deftp
@deftp Tunable glibc.elision.skip_lock_internal_abort
The @code{glibc.elision.skip_lock_internal_abort} tunable sets how many times
the thread should avoid using elision if a transaction aborted for any reason
other than a different thread's memory accesses. Expressed in number of lock
acquisition attempts.
The default value of this tunable is @samp{3}.
@end deftp
@deftp Tunable glibc.elision.skip_lock_after_retries
The @code{glibc.elision.skip_lock_after_retries} tunable sets how many times
to try to elide a lock with transactions, that only failed due to a different
thread's memory accesses, before falling back to regular lock.
Expressed in number of lock elision attempts.
This tunable is supported only on IBM POWER, and z System architectures.
The default value of this tunable is @samp{3}.
@end deftp
@deftp Tunable glibc.elision.tries
The @code{glibc.elision.tries} sets how many times to retry elision if there is
chance for the transaction to finish execution e.g., it wasn't
aborted due to the lock being already acquired. If elision is not supported
by the hardware this tunable is set to @samp{0} to avoid retries.
The default value of this tunable is @samp{3}.
@end deftp
@deftp Tunable glibc.elision.skip_trylock_internal_abort
The @code{glibc.elision.skip_trylock_internal_abort} tunable sets how many
times the thread should avoid trying the lock if a transaction aborted due to
reasons other than a different thread's memory accesses. Expressed in number
of try lock attempts.
The default value of this tunable is @samp{3}.
@end deftp
Mutex: Add pthread mutex tunables This patch does not have any functionality change, we only provide a spin count tunes for pthread adaptive spin mutex. The tunable glibc.pthread.mutex_spin_count tunes can be used by system administrator to squeeze system performance according to different hardware capabilities and workload characteristics. The maximum value of spin count is limited to 32767 to avoid the overflow of mutex->__data.__spins variable with the possible type of short in pthread_mutex_lock (). The default value of spin count is set to 100 with the reference to the previous number of times of spinning via trylock. This value would be architecture-specific and can be tuned with kinds of benchmarks to fit most cases in future. I would extend my appreciation sincerely to H.J.Lu for his help to refine this patch series. * manual/tunables.texi (POSIX Thread Tunables): New node. * nptl/Makefile (libpthread-routines): Add pthread_mutex_conf. * nptl/nptl-init.c: Include pthread_mutex_conf.h (__pthread_initialize_minimal_internal) [HAVE_TUNABLES]: Call __pthread_tunables_init. * nptl/pthreadP.h (MAX_ADAPTIVE_COUNT): Remove. (max_adaptive_count): Define. * nptl/pthread_mutex_conf.c: New file. * nptl/pthread_mutex_conf.h: New file. * sysdeps/generic/adaptive_spin_count.h: New file. * sysdeps/nptl/dl-tunables.list: New file. * nptl/pthread_mutex_lock.c (__pthread_mutex_lock): Use max_adaptive_count () not MAX_ADAPTIVE_COUNT. * nptl/pthread_mutex_timedlock.c (__pthrad_mutex_timedlock): Likewise. Suggested-by: Andi Kleen <andi.kleen@intel.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Signed-off-by: Kemi.wang <kemi.wang@intel.com>
2018-12-01 23:03:33 +08:00
@node POSIX Thread Tunables
@section POSIX Thread Tunables
@cindex pthread mutex tunables
@cindex thread mutex tunables
@cindex mutex tunables
@cindex tunables thread mutex
@deftp {Tunable namespace} glibc.pthread
The behavior of POSIX threads can be tuned to gain performance improvements
according to specific hardware capabilities and workload characteristics by
setting the following tunables in the @code{pthread} namespace:
@end deftp
@deftp Tunable glibc.pthread.mutex_spin_count
The @code{glibc.pthread.mutex_spin_count} tunable sets the maximum number of times
a thread should spin on the lock before calling into the kernel to block.
Adaptive spin is used for mutexes initialized with the
@code{PTHREAD_MUTEX_ADAPTIVE_NP} GNU extension. It affects both
@code{pthread_mutex_lock} and @code{pthread_mutex_timedlock}.
The thread spins until either the maximum spin count is reached or the lock
is acquired.
The default value of this tunable is @samp{100}.
@end deftp
@node Hardware Capability Tunables
@section Hardware Capability Tunables
@cindex hardware capability tunables
@cindex hwcap tunables
@cindex tunables, hwcap
@cindex hwcaps tunables
@cindex tunables, hwcaps
tunables: Add IFUNC selection and cache sizes The current IFUNC selection is based on microbenchmarks in glibc. It should give the best performance for most workloads. But other choices may have better performance for a particular workload or on the hardware which wasn't available at the selection was made. The environment variable, GLIBC_TUNABLES=glibc.tune.ifunc=-xxx,yyy,-zzz...., can be used to enable CPU/ARCH feature yyy, disable CPU/ARCH feature yyy and zzz, where the feature name is case-sensitive and has to match the ones in cpu-features.h. It can be used by glibc developers to override the IFUNC selection to tune for a new processor or improve performance for a particular workload. It isn't intended for normal end users. NOTE: the IFUNC selection may change over time. Please check all multiarch implementations when experimenting. Also, GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=NUMBER is provided to set threshold to use non temporal store to NUMBER, GLIBC_TUNABLES=glibc.tune.x86_data_cache_size=NUMBER to set data cache size, GLIBC_TUNABLES=glibc.tune.x86_shared_cache_size=NUMBER to set shared cache size. * elf/dl-tunables.list (tune): Add ifunc, x86_non_temporal_threshold, x86_data_cache_size and x86_shared_cache_size. * manual/tunables.texi: Document glibc.tune.ifunc, glibc.tune.x86_data_cache_size, glibc.tune.x86_shared_cache_size and glibc.tune.x86_non_temporal_threshold. * sysdeps/unix/sysv/linux/x86/dl-sysdep.c: New file. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and get data cache size, shared cache size and non temporal threshold from cpu_features. * sysdeps/x86/cpu-features.c [HAVE_TUNABLES] (TUNABLE_NAMESPACE): New. [HAVE_TUNABLES] Include <unistd.h>. [HAVE_TUNABLES] Include <elf/dl-tunables.h>. [HAVE_TUNABLES] (TUNABLE_CALLBACK (set_ifunc)): Likewise. [HAVE_TUNABLES] (init_cpu_features): Use TUNABLE_GET to set IFUNC selection, data cache size, shared cache size and non temporal threshold. * sysdeps/x86/cpu-features.h (cpu_features): Add data_cache_size, shared_cache_size and non_temporal_threshold.
2017-06-20 23:33:29 +08:00
@cindex data_cache_size tunables
@cindex tunables, data_cache_size
@cindex shared_cache_size tunables
@cindex tunables, shared_cache_size
@cindex non_temporal_threshold tunables
@cindex tunables, non_temporal_threshold
@deftp {Tunable namespace} glibc.cpu
Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
by setting the following tunables in the @code{cpu} namespace:
@end deftp
@deftp Tunable glibc.cpu.hwcap_mask
This tunable supersedes the @env{LD_HWCAP_MASK} environment variable and is
identical in features.
2018-01-24 03:40:44 +08:00
The @code{AT_HWCAP} key in the Auxiliary Vector specifies instruction set
extensions available in the processor at runtime for some architectures. The
@code{glibc.cpu.hwcap_mask} tunable allows the user to mask out those
capabilities at runtime, thus disabling use of those extensions.
@end deftp
tunables: Add IFUNC selection and cache sizes The current IFUNC selection is based on microbenchmarks in glibc. It should give the best performance for most workloads. But other choices may have better performance for a particular workload or on the hardware which wasn't available at the selection was made. The environment variable, GLIBC_TUNABLES=glibc.tune.ifunc=-xxx,yyy,-zzz...., can be used to enable CPU/ARCH feature yyy, disable CPU/ARCH feature yyy and zzz, where the feature name is case-sensitive and has to match the ones in cpu-features.h. It can be used by glibc developers to override the IFUNC selection to tune for a new processor or improve performance for a particular workload. It isn't intended for normal end users. NOTE: the IFUNC selection may change over time. Please check all multiarch implementations when experimenting. Also, GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=NUMBER is provided to set threshold to use non temporal store to NUMBER, GLIBC_TUNABLES=glibc.tune.x86_data_cache_size=NUMBER to set data cache size, GLIBC_TUNABLES=glibc.tune.x86_shared_cache_size=NUMBER to set shared cache size. * elf/dl-tunables.list (tune): Add ifunc, x86_non_temporal_threshold, x86_data_cache_size and x86_shared_cache_size. * manual/tunables.texi: Document glibc.tune.ifunc, glibc.tune.x86_data_cache_size, glibc.tune.x86_shared_cache_size and glibc.tune.x86_non_temporal_threshold. * sysdeps/unix/sysv/linux/x86/dl-sysdep.c: New file. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and get data cache size, shared cache size and non temporal threshold from cpu_features. * sysdeps/x86/cpu-features.c [HAVE_TUNABLES] (TUNABLE_NAMESPACE): New. [HAVE_TUNABLES] Include <unistd.h>. [HAVE_TUNABLES] Include <elf/dl-tunables.h>. [HAVE_TUNABLES] (TUNABLE_CALLBACK (set_ifunc)): Likewise. [HAVE_TUNABLES] (init_cpu_features): Use TUNABLE_GET to set IFUNC selection, data cache size, shared cache size and non temporal threshold. * sysdeps/x86/cpu-features.h (cpu_features): Add data_cache_size, shared_cache_size and non_temporal_threshold.
2017-06-20 23:33:29 +08:00
@deftp Tunable glibc.cpu.hwcaps
The @code{glibc.cpu.hwcaps=-xxx,yyy,-zzz...} tunable allows the user to
tunables: Add IFUNC selection and cache sizes The current IFUNC selection is based on microbenchmarks in glibc. It should give the best performance for most workloads. But other choices may have better performance for a particular workload or on the hardware which wasn't available at the selection was made. The environment variable, GLIBC_TUNABLES=glibc.tune.ifunc=-xxx,yyy,-zzz...., can be used to enable CPU/ARCH feature yyy, disable CPU/ARCH feature yyy and zzz, where the feature name is case-sensitive and has to match the ones in cpu-features.h. It can be used by glibc developers to override the IFUNC selection to tune for a new processor or improve performance for a particular workload. It isn't intended for normal end users. NOTE: the IFUNC selection may change over time. Please check all multiarch implementations when experimenting. Also, GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=NUMBER is provided to set threshold to use non temporal store to NUMBER, GLIBC_TUNABLES=glibc.tune.x86_data_cache_size=NUMBER to set data cache size, GLIBC_TUNABLES=glibc.tune.x86_shared_cache_size=NUMBER to set shared cache size. * elf/dl-tunables.list (tune): Add ifunc, x86_non_temporal_threshold, x86_data_cache_size and x86_shared_cache_size. * manual/tunables.texi: Document glibc.tune.ifunc, glibc.tune.x86_data_cache_size, glibc.tune.x86_shared_cache_size and glibc.tune.x86_non_temporal_threshold. * sysdeps/unix/sysv/linux/x86/dl-sysdep.c: New file. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and get data cache size, shared cache size and non temporal threshold from cpu_features. * sysdeps/x86/cpu-features.c [HAVE_TUNABLES] (TUNABLE_NAMESPACE): New. [HAVE_TUNABLES] Include <unistd.h>. [HAVE_TUNABLES] Include <elf/dl-tunables.h>. [HAVE_TUNABLES] (TUNABLE_CALLBACK (set_ifunc)): Likewise. [HAVE_TUNABLES] (init_cpu_features): Use TUNABLE_GET to set IFUNC selection, data cache size, shared cache size and non temporal threshold. * sysdeps/x86/cpu-features.h (cpu_features): Add data_cache_size, shared_cache_size and non_temporal_threshold.
2017-06-20 23:33:29 +08:00
enable CPU/ARCH feature @code{yyy}, disable CPU/ARCH feature @code{xxx}
and @code{zzz} where the feature name is case-sensitive and has to match
the ones in @code{sysdeps/x86/cpu-features.h}.
This tunable is specific to i386 and x86-64.
@end deftp
@deftp Tunable glibc.cpu.cached_memopt
The @code{glibc.cpu.cached_memopt=[0|1]} tunable allows the user to
powerpc: POWER8 memcpy optimization for cached memory On POWER8, unaligned memory accesses to cached memory has little impact on performance as opposed to its ancestors. It is disabled by default and will only be available when the tunable glibc.tune.cached_memopt is set to 1. __memcpy_power8_cached __memcpy_power7 ============================================================ max-size=4096: 33325.70 ( 12.65%) 38153.00 max-size=8192: 32878.20 ( 11.17%) 37012.30 max-size=16384: 33782.20 ( 11.61%) 38219.20 max-size=32768: 33296.20 ( 11.30%) 37538.30 max-size=65536: 33765.60 ( 10.53%) 37738.40 * manual/tunables.texi (Hardware Capability Tunables): Document glibc.tune.cached_memopt. * sysdeps/powerpc/cpu-features.c: New file. * sysdeps/powerpc/cpu-features.h: New file. * sysdeps/powerpc/dl-procinfo.c [!IS_IN(ldconfig)]: Add _dl_powerpc_cpu_features. * sysdeps/powerpc/dl-tunables.list: New file. * sysdeps/powerpc/ldsodefs.h: Include cpu-features.h. * sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h (INIT_ARCH): Initialize use_aligned_memopt. * sysdeps/powerpc/powerpc64/dl-machine.h [defined(SHARED && IS_IN(rtld))]: Restrict dl_platform_init availability and initialize CPU features used by tunables. * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add memcpy-power8-cached. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add __memcpy_power8_cached. * sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S: New file. Reviewed-by: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
2017-12-12 03:39:42 +08:00
enable optimizations recommended for cacheable memory. If set to
@code{1}, @theglibc{} assumes that the process memory image consists
of cacheable (non-device) memory only. The default, @code{0},
indicates that the process may use device memory.
This tunable is specific to powerpc, powerpc64 and powerpc64le.
@end deftp
@deftp Tunable glibc.cpu.name
The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
assume that the CPU is @code{xxx} where xxx may have one of these values:
@code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
This tunable is specific to aarch64.
@end deftp
@deftp Tunable glibc.cpu.x86_data_cache_size
The @code{glibc.cpu.x86_data_cache_size} tunable allows the user to set
tunables: Add IFUNC selection and cache sizes The current IFUNC selection is based on microbenchmarks in glibc. It should give the best performance for most workloads. But other choices may have better performance for a particular workload or on the hardware which wasn't available at the selection was made. The environment variable, GLIBC_TUNABLES=glibc.tune.ifunc=-xxx,yyy,-zzz...., can be used to enable CPU/ARCH feature yyy, disable CPU/ARCH feature yyy and zzz, where the feature name is case-sensitive and has to match the ones in cpu-features.h. It can be used by glibc developers to override the IFUNC selection to tune for a new processor or improve performance for a particular workload. It isn't intended for normal end users. NOTE: the IFUNC selection may change over time. Please check all multiarch implementations when experimenting. Also, GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=NUMBER is provided to set threshold to use non temporal store to NUMBER, GLIBC_TUNABLES=glibc.tune.x86_data_cache_size=NUMBER to set data cache size, GLIBC_TUNABLES=glibc.tune.x86_shared_cache_size=NUMBER to set shared cache size. * elf/dl-tunables.list (tune): Add ifunc, x86_non_temporal_threshold, x86_data_cache_size and x86_shared_cache_size. * manual/tunables.texi: Document glibc.tune.ifunc, glibc.tune.x86_data_cache_size, glibc.tune.x86_shared_cache_size and glibc.tune.x86_non_temporal_threshold. * sysdeps/unix/sysv/linux/x86/dl-sysdep.c: New file. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and get data cache size, shared cache size and non temporal threshold from cpu_features. * sysdeps/x86/cpu-features.c [HAVE_TUNABLES] (TUNABLE_NAMESPACE): New. [HAVE_TUNABLES] Include <unistd.h>. [HAVE_TUNABLES] Include <elf/dl-tunables.h>. [HAVE_TUNABLES] (TUNABLE_CALLBACK (set_ifunc)): Likewise. [HAVE_TUNABLES] (init_cpu_features): Use TUNABLE_GET to set IFUNC selection, data cache size, shared cache size and non temporal threshold. * sysdeps/x86/cpu-features.h (cpu_features): Add data_cache_size, shared_cache_size and non_temporal_threshold.
2017-06-20 23:33:29 +08:00
data cache size in bytes for use in memory and string routines.
This tunable is specific to i386 and x86-64.
@end deftp
@deftp Tunable glibc.cpu.x86_shared_cache_size
The @code{glibc.cpu.x86_shared_cache_size} tunable allows the user to
tunables: Add IFUNC selection and cache sizes The current IFUNC selection is based on microbenchmarks in glibc. It should give the best performance for most workloads. But other choices may have better performance for a particular workload or on the hardware which wasn't available at the selection was made. The environment variable, GLIBC_TUNABLES=glibc.tune.ifunc=-xxx,yyy,-zzz...., can be used to enable CPU/ARCH feature yyy, disable CPU/ARCH feature yyy and zzz, where the feature name is case-sensitive and has to match the ones in cpu-features.h. It can be used by glibc developers to override the IFUNC selection to tune for a new processor or improve performance for a particular workload. It isn't intended for normal end users. NOTE: the IFUNC selection may change over time. Please check all multiarch implementations when experimenting. Also, GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=NUMBER is provided to set threshold to use non temporal store to NUMBER, GLIBC_TUNABLES=glibc.tune.x86_data_cache_size=NUMBER to set data cache size, GLIBC_TUNABLES=glibc.tune.x86_shared_cache_size=NUMBER to set shared cache size. * elf/dl-tunables.list (tune): Add ifunc, x86_non_temporal_threshold, x86_data_cache_size and x86_shared_cache_size. * manual/tunables.texi: Document glibc.tune.ifunc, glibc.tune.x86_data_cache_size, glibc.tune.x86_shared_cache_size and glibc.tune.x86_non_temporal_threshold. * sysdeps/unix/sysv/linux/x86/dl-sysdep.c: New file. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and get data cache size, shared cache size and non temporal threshold from cpu_features. * sysdeps/x86/cpu-features.c [HAVE_TUNABLES] (TUNABLE_NAMESPACE): New. [HAVE_TUNABLES] Include <unistd.h>. [HAVE_TUNABLES] Include <elf/dl-tunables.h>. [HAVE_TUNABLES] (TUNABLE_CALLBACK (set_ifunc)): Likewise. [HAVE_TUNABLES] (init_cpu_features): Use TUNABLE_GET to set IFUNC selection, data cache size, shared cache size and non temporal threshold. * sysdeps/x86/cpu-features.h (cpu_features): Add data_cache_size, shared_cache_size and non_temporal_threshold.
2017-06-20 23:33:29 +08:00
set shared cache size in bytes for use in memory and string routines.
@end deftp
@deftp Tunable glibc.cpu.x86_non_temporal_threshold
The @code{glibc.cpu.x86_non_temporal_threshold} tunable allows the user
Reversing calculation of __x86_shared_non_temporal_threshold The __x86_shared_non_temporal_threshold determines when memcpy on x86 uses non_temporal stores to avoid pushing other data out of the last level cache. This patch proposes to revert the calculation change made by H.J. Lu's patch of June 2, 2017. H.J. Lu's patch selected a threshold suitable for a single thread getting maximum performance. It was tuned using the single threaded large memcpy micro benchmark on an 8 core processor. The last change changes the threshold from using 3/4 of one thread's share of the cache to using 3/4 of the entire cache of a multi-threaded system before switching to non-temporal stores. Multi-threaded systems with more than a few threads are server-class and typically have many active threads. If one thread consumes 3/4 of the available cache for all threads, it will cause other active threads to have data removed from the cache. Two examples show the range of the effect. John McCalpin's widely parallel Stream benchmark, which runs in parallel and fetches data sequentially, saw a 20% slowdown with this patch on an internal system test of 128 threads. This regression was discovered when comparing OL8 performance to OL7. An example that compares normal stores to non-temporal stores may be found at https://vgatherps.github.io/2018-09-02-nontemporal/. A simple test shows performance loss of 400 to 500% due to a failure to use nontemporal stores. These performance losses are most likely to occur when the system load is heaviest and good performance is critical. The tunable x86_non_temporal_threshold can be used to override the default for the knowledgable user who really wants maximum cache allocation to a single thread in a multi-threaded system. The manual entry for the tunable has been expanded to provide more information about its purpose. modified: sysdeps/x86/cacheinfo.c modified: manual/tunables.texi
2020-09-29 04:11:28 +08:00
to set threshold in bytes for non temporal store. Non temporal stores
give a hint to the hardware to move data directly to memory without
displacing other data from the cache. This tunable is used by some
platforms to determine when to use non temporal stores in operations
like memmove and memcpy.
tunables: Add IFUNC selection and cache sizes The current IFUNC selection is based on microbenchmarks in glibc. It should give the best performance for most workloads. But other choices may have better performance for a particular workload or on the hardware which wasn't available at the selection was made. The environment variable, GLIBC_TUNABLES=glibc.tune.ifunc=-xxx,yyy,-zzz...., can be used to enable CPU/ARCH feature yyy, disable CPU/ARCH feature yyy and zzz, where the feature name is case-sensitive and has to match the ones in cpu-features.h. It can be used by glibc developers to override the IFUNC selection to tune for a new processor or improve performance for a particular workload. It isn't intended for normal end users. NOTE: the IFUNC selection may change over time. Please check all multiarch implementations when experimenting. Also, GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=NUMBER is provided to set threshold to use non temporal store to NUMBER, GLIBC_TUNABLES=glibc.tune.x86_data_cache_size=NUMBER to set data cache size, GLIBC_TUNABLES=glibc.tune.x86_shared_cache_size=NUMBER to set shared cache size. * elf/dl-tunables.list (tune): Add ifunc, x86_non_temporal_threshold, x86_data_cache_size and x86_shared_cache_size. * manual/tunables.texi: Document glibc.tune.ifunc, glibc.tune.x86_data_cache_size, glibc.tune.x86_shared_cache_size and glibc.tune.x86_non_temporal_threshold. * sysdeps/unix/sysv/linux/x86/dl-sysdep.c: New file. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and get data cache size, shared cache size and non temporal threshold from cpu_features. * sysdeps/x86/cpu-features.c [HAVE_TUNABLES] (TUNABLE_NAMESPACE): New. [HAVE_TUNABLES] Include <unistd.h>. [HAVE_TUNABLES] Include <elf/dl-tunables.h>. [HAVE_TUNABLES] (TUNABLE_CALLBACK (set_ifunc)): Likewise. [HAVE_TUNABLES] (init_cpu_features): Use TUNABLE_GET to set IFUNC selection, data cache size, shared cache size and non temporal threshold. * sysdeps/x86/cpu-features.h (cpu_features): Add data_cache_size, shared_cache_size and non_temporal_threshold.
2017-06-20 23:33:29 +08:00
This tunable is specific to i386 and x86-64.
@end deftp
@deftp Tunable glibc.cpu.x86_rep_movsb_threshold
The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
set threshold in bytes to start using "rep movsb". The value must be
greater than zero, and currently defaults to 2048 bytes.
This tunable is specific to i386 and x86-64.
@end deftp
@deftp Tunable glibc.cpu.x86_rep_stosb_threshold
The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user to
set threshold in bytes to start using "rep stosb". The value must be
greater than zero, and currently defaults to 2048 bytes.
This tunable is specific to i386 and x86-64.
@end deftp
@deftp Tunable glibc.cpu.x86_ibt
The @code{glibc.cpu.x86_ibt} tunable allows the user to control how
indirect branch tracking (IBT) should be enabled. Accepted values are
@code{on}, @code{off}, and @code{permissive}. @code{on} always turns
on IBT regardless of whether IBT is enabled in the executable and its
dependent shared libraries. @code{off} always turns off IBT regardless
of whether IBT is enabled in the executable and its dependent shared
libraries. @code{permissive} is the same as the default which disables
IBT on non-CET executables and shared libraries.
This tunable is specific to i386 and x86-64.
@end deftp
@deftp Tunable glibc.cpu.x86_shstk
The @code{glibc.cpu.x86_shstk} tunable allows the user to control how
the shadow stack (SHSTK) should be enabled. Accepted values are
@code{on}, @code{off}, and @code{permissive}. @code{on} always turns on
SHSTK regardless of whether SHSTK is enabled in the executable and its
dependent shared libraries. @code{off} always turns off SHSTK regardless
of whether SHSTK is enabled in the executable and its dependent shared
libraries. @code{permissive} changes how dlopen works on non-CET shared
libraries. By default, when SHSTK is enabled, dlopening a non-CET shared
library returns an error. With @code{permissive}, it turns off SHSTK
instead.
This tunable is specific to i386 and x86-64.
@end deftp