binutils-gdb/gdb/testsuite/gdb.base/attach-pie-misread.exp
Pedro Alves 2c8c5d375e testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others)
The buildbots show that attach-many-short-lived-thread.exp is racy.
But after staring at debug logs and playing with SystemTap scripts for
a (long) while, I figured out that neither GDB, nor the kernel nor the
test's program itself are at fault.

The problem is simply that the testsuite machinery is currently
subject to PID-reuse races.  The attach-many-short-lived-threads.c
test program just happens to be much more susceptible to trigger this
race because threads and processes share the same number space on
Linux, and the test spawns many many short lived threads in
succession, thus enlarging the race window a lot.

Part of the problem is that several tests spawn processes with "exec&"
(in order to test the "attach" command) , and then at the end of the
test, to make sure things are cleaned up, issue a 'remote_spawn "kill
-p $testpid"'.  Since with tcl's "exec&", tcl itself is responsible
for reaping the process's exit status, when we go kill the process,
testpid may have already exited _and_ its status may have (and often
has) been reaped already.  Thus it can happen that another process
meanwhile reuses $testpid, and that "kill" command kills the wrong
process...  Frequently, that happens to be
attach-many-short-lived-thread, but this explains other test's races
as well.

In the attach-many-short-lived-threads test, it sometimes manifests
like this:

 (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads
 Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done.
 (gdb)           Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb
 attach 5940
 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940
 warning: process 5940 is a zombie - the process has already terminated
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ptrace: Operation not permitted.
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach
 info threads
 No threads.
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads
 set breakpoint always-inserted on
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on

Other times the process dies while the test is ongoing (the process is
ptrace-stopped):

 (gdb) print again = 1
 Cannot access memory at address 0x6020cc
 (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior

(Recall that on Linux, SIGKILL is not interceptable)

And other times it dies just while we're detaching:

 $4 = 319
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left
 detach
 Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process
 (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach

GDB mishandles the latter (it should ignore ESRCH while detaching just
like when continuing), but that's another story.

The fix here is to change spawn_wait_for_attach to use Expect's
'spawn' command instead of Tcl's 'exec&' to spawn programs, because
with spawn we control when to wait for/reap the process.  That allows
killing the process by PID without being subject to pid-reuse races,
because even if the process is already dead, the kernel won't reuse
the process's PID until the zombie is reaped.

The other part of the problem lies in DejaGnu itself, unfortunately.
I have occasionally seen tests (attach-many-short-lived-threads
included, but not only that one) die with a random inexplicable
SIGTERM too, and that too is caused by the same reason, except that in
that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp:

    exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &"
    ...
    catch "wait -i $shell_id"

Even if the program exits promptly, that whole cascade of kills
carries on in the background, thus potentially killing the poor
process that manages to reuse $pid...

I sent a fix for that to the DejaGnu list:
 http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html

With both patches in place, I haven't seen
attach-many-short-lived-threads.exp fail again.

Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver.

gdb/testsuite/ChangeLog:
2015-07-31  Pedro Alves  <palves@redhat.com>

	* gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id.
	Use spawn_id_get_pid.  Wait for spawn id after eof.  Use
	kill_wait_spawned_process instead of explicit "kill -9".
	* gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach
	returning a spawn id instead of a pid.  Use spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.base/attach-twice.exp: Likewise.
	* gdb.base/attach.exp: Likewise.
	(do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and
	gdb_test_multiple.
	* gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach
	returning a spawn id instead of a pid.  Use spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.base/valgrind-infcall.exp: Likewise.
	* gdb.multi/multi-attach.exp: Likewise.
	* gdb.python/py-prompt.exp: Likewise.
	* gdb.python/py-sync-interp.exp: Likewise.
	* gdb.server/ext-attach.exp: Likewise.
	* gdb.threads/attach-into-signal.exp (corefunc): Use
	spawn_wait_for_attach, spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.threads/attach-many-short-lived-threads.exp: Adjust to
	spawn_wait_for_attach returning a spawn id instead of a pid.  Use
	spawn_id_get_pid and kill_wait_spawned_process.
	* gdb.threads/attach-stopped.exp (corefunc): Use
	spawn_wait_for_attach, spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.base/break-interp.exp: Rename $res to $test_spawn_id.
	Use spawn_id_get_pid.  Wait for spawn id after eof.  Use
	kill_wait_spawned_process instead of explicit "kill -9".
	* lib/gdb.exp (can_spawn_for_attach): Adjust comment.
	(kill_wait_spawned_process, spawn_id_get_pid): New procedures.
	(spawn_wait_for_attach): Use spawn instead of exec to spawn
	processes.  Don't map cygwin/windows pids here.  Now returns a
	spawn id list.
2015-07-31 20:06:24 +01:00

200 lines
5.8 KiB
Plaintext

# Copyright 2010-2015 Free Software Foundation, Inc.
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# This test only works on GNU/Linux.
if { ![isnative] || [is_remote host] || [target_info exists use_gdb_stub]
|| ![istarget *-linux*] || [skip_shlib_tests]} {
continue
}
load_lib prelink-support.exp
standard_testfile .c
set genfile [standard_output_file ${testfile}-gen.h]
set executable $testfile
if {[build_executable_own_libs ${testfile}.exp $executable $srcfile [list additional_flags=-fPIE ldflags=-pie]] == ""} {
return -1
}
# Program Headers:
# Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
# LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x134f5ec 0x134f5ec R E 0x200000
# LOAD 0x134f5f0 0x000000000194f5f0 0x000000000194f5f0 0x1dbc60 0x214088 RW 0x200000
# DYNAMIC 0x134f618 0x000000000194f618 0x000000000194f618 0x000200 0x000200 RW 0x8
#
proc read_phdr {binfile test} {
set readelf_program [gdb_find_readelf]
set command "exec $readelf_program -Wl $binfile"
verbose -log "command is $command"
set result [catch $command output]
verbose -log "result is $result"
verbose -log "output is $output"
if {$result != 0} {
fail $test
return
}
if ![regexp {\nProgram Headers:\n *Type [^\n]* Align\n(.*?)\n\n} $output trash phdr] {
fail "$test (no Program Headers)"
return
}
if ![regexp -line {^ *DYNAMIC +0x[0-9a-f]+ +(0x[0-9a-f]+) } $phdr trash dynamic_vaddr] {
fail "$test (no DYNAMIC found)"
return
}
verbose -log "dynamic_vaddr is $dynamic_vaddr"
set align_max -1
foreach {trash align} [regexp -line -all -inline {^ *LOAD .* (0x[0-9]+)$} $phdr] {
if {$align_max < $align} {
set align_max $align
}
}
verbose -log "align_max is $align_max"
if {$align_max == -1} {
fail "$test (no LOAD found)"
return
}
pass $test
return [list $dynamic_vaddr $align_max]
}
set phdr [read_phdr $binfile "readelf initial scan"]
set dynamic_vaddr [lindex $phdr 0]
set align_max [lindex $phdr 1]
set stub_size [format 0x%x [expr "2 * $align_max - ($dynamic_vaddr & ($align_max - 1))"]]
verbose -log "stub_size is $stub_size"
# On x86_64 it is commonly about 4MB.
if {$stub_size > 25000000} {
xfail "stub size $stub_size is too large"
return
}
set test "generate stub"
set command "exec $binfile $stub_size >$genfile"
verbose -log "command is $command"
set result [catch $command output]
verbose -log "result is $result"
verbose -log "output is $output"
if {$result == 0} {
pass $test
} else {
fail $test
}
set prelink_args [build_executable_own_libs ${test}.exp $executable $srcfile [list "additional_flags=-fPIE -DGEN=\"$genfile\"" "ldflags=-pie"]]
if {$prelink_args == ""} {
return -1
}
# x86_64 file has 25MB, no need to keep it.
file delete -- $genfile
set phdr [read_phdr $binfile "readelf rebuilt with stub_size"]
set dynamic_vaddr_prelinkno [lindex $phdr 0]
if ![prelink_yes $prelink_args] {
return -1
}
set phdr [read_phdr $binfile "readelf with prelink -R"]
set dynamic_vaddr_prelinkyes [lindex $phdr 0]
set first_offset [format 0x%x [expr $dynamic_vaddr_prelinkyes - $dynamic_vaddr_prelinkno]]
verbose -log "first_offset is $first_offset"
set test "first offset is non-zero"
if {$first_offset == 0} {
fail "$test (-fPIE -pie in effect?)"
} else {
pass $test
}
set test "start inferior"
gdb_exit
set test_spawn_id [remote_spawn host $binfile]
if { $test_spawn_id < 0 || $test_spawn_id == "" } {
perror "Spawning $binfile failed."
fail $test
return
}
set testpid [spawn_id_get_pid $test_spawn_id]
gdb_expect {
-re "sleeping\r\n" {
pass $test
}
eof {
fail "$test (eof)"
wait -i $test_spawn_id
return
}
timeout {
fail "$test (timeout)"
kill_wait_spawned_process $test_spawn_id
return
}
}
# Due to alignments it was reproducible with 1 on x86_64 but 2 on i686.
foreach align_mult {1 2} { with_test_prefix "shift-by-$align_mult" {
# FIXME: We believe there is enough room under FIRST_OFFSET.
set shifted_offset [format 0x%x [expr "$first_offset - $align_mult * $align_max"]]
verbose -log "shifted_offset is $shifted_offset"
# For normal prelink (prelink_yes call), we need to supply $prelink_args.
# For the prelink `-r' option below, $prelink_args is not required.
# Moreover, if it was used, the problem would not longer be reproducible
# as the libraries would also get relocated.
set command "exec /usr/sbin/prelink -q -N --no-exec-shield -r $shifted_offset $binfile"
verbose -log "command is $command"
set result [catch $command output]
verbose -log "result is $result"
verbose -log "output is $output"
set test "prelink -r"
if {$result == 0 && $output == ""} {
pass $test
} else {
fail $test
}
clean_restart $executable
set test "attach"
gdb_test_multiple "attach $testpid" $test {
-re "Attaching to program: .*, process $testpid\r\n" {
# Missing "$gdb_prompt $" is intentional.
pass $test
}
}
set test "error on Cannot access memory at address"
gdb_test_multiple "" $test {
-re "\r\nCannot access memory at address .*$gdb_prompt $" {
fail $test
}
-re "$gdb_prompt $" {
pass $test
}
}
gdb_test "detach" "Detaching from program: .*"
}}
kill_wait_spawned_process $test_spawn_id