glibc

mirror of git://sourceware.org/git/glibc.git synced 2024-11-21 01:12:26 +08:00

History

H.J. Lu a057f5f8cd X86-64: Use non-temporal store in memcpy on large data The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source.		2016-04-12 08:10:47 -07:00
..
aarch64
alpha
arm
generic	Fix crash on getauxval call without HAVE_AUX_VECTOR	2016-04-10 23:58:43 +02:00
gnu
hppa
i386	When disabling SSE, make sure -fpmath is not set to use SSE either	2016-04-09 22:14:24 -04:00
ia64
ieee754	Increase internal precision of ldbl-128ibm decimal printf [BZ #19853 ]	2016-03-31 12:14:33 -05:00
init_array
m68k
mach
microblaze
mips
nacl	Fix build with HAVE_AUX_VECTOR	2016-04-11 10:27:25 +02:00
nios2
nptl
posix
powerpc	powerpc: Add optimized P8 strspn	2016-04-07 15:51:28 -05:00
pthread
s390	S390: Use ahi instead of aghi in 32bit _dl_runtime_resolve.	2016-04-01 10:42:54 +02:00
sh
sparc
tile
unix	VDSO support for MIPS	2016-04-12 11:05:13 +01:00
wordsize-32
wordsize-64
x86	Remove Fast_Copy_Backward from Intel Core processors	2016-04-01 15:09:14 -07:00
x86_64	X86-64: Use non-temporal store in memcpy on large data	2016-04-12 08:10:47 -07:00