binutils-gdb/gdb/testsuite/gdb.threads/attach-slow-waitpid.exp

106 lines
3.5 KiB
Plaintext
Raw Normal View History

# Copyright 2018-2020 Free Software Foundation, Inc.
gdb: Don't drop SIGSTOP during stop_all_threads This patch fixes an issue where GDB would sometimes hang when attaching to a multi-threaded process. This issue was especially likely to trigger if the machine (running the inferior) was under load. In summary, the problem is an imbalance between two functions in linux-nat.c, stop_callback and stop_wait_callback. In stop_callback we send SIGSTOP to a thread, but _only_ if the thread is not already stopped, and if it is not signalled, which means it should stop soon. In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are aware that the thread might have been signalled for some other reason, and so if a signal other than SIGSTOP causes the thread to stop then we stash that signal away so it can be reported back later. If we get a SIGSTOP then this is discarded, after all, this signal was sent from stop_callback. Except that this might not be the case, it could be that SIGSTOP was sent to a thread from elsewhere in GDB, in which case we would not have sent another SIGSTOP from stop_callback and the SIGSTOP received in stop_wait_callback should not be ignored. Below I've laid out the exact sequence of events that I saw that lead me to track down the above diagnosis. After attaching to the inferior GDB sends a SIGSTOP to all of the threads and then returns to the event loop waiting for interesting things to happen. Eventually the first target event is detected (this will be the first SIGSTOP arriving) and GDB calls inferior_event_handler which calls fetch_inferior_event. Inside fetch_inferior_event GDB calls do_target_wait which calls target_wait to find a thread with an event. The target_wait call ends up in linux_nat_wait_1, which first checks to see if any threads already have stashed stop events to report, and if there are none then we enter a loop fetching as many events as possible out of the kernel. This event fetching is non-blocking, and we give up once the kernel has no more events ready to give us. All of the events from the kernel are passed through linux_nat_filter_event which stashes the wait status for all of the threads that reported a SIGSTOP, these will be returned by future calls to linux_nat_wait_1. Lets assume for a moment that we've attached to a multi-threaded inferior, and that all but one thread has reported its stop during the initial wait call in linux_nat_wait_1. The other thread will be reporting a SIGSTOP, but the kernel has not yet managed to deliver that signal to GDB before GDB gave up waiting and continued handling the events it already had. GDB selects one of the threads that has reported a SIGSTOP and passes this thread ID back to fetch_inferior_event. To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which calls stop_all_threads, this calls wait_one, which in turn calls target_wait. The first call to target_wait at this point will result in a stashed wait status being returned, at which point we call setup_inferior. The call to setup_inferior leads to a call into try_thread_db_load_1 which results in a call to linux_stop_and_wait_all_lwps. This in turn calls stop_callback on each thread followed by stop_wait_callback on each thread. We're now ready to make the mistake. In stop_callback we see that our problem thread is not stopped, but is signalled, so it should stop soon. As a result we don't send another SIGSTOP. We then enter stop_wait_callback, eventually the problem thread stops with SIGSTOP which we _incorrectly_ assume came from stop_callback, and we discard. Once stop_wait_callback has done its damage we return from linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and eventually unwind back to the call to setup_inferior in stop_all_threads. GDB now loops around, and performs another target_wait to get the next event from the inferior. The target_wait calls causes us to once again reach linux_nat_wait_1, and we pass through some code that calls resume_stopped_resumed_lwps. This allows GDB to resume threads that are physically stopped, but which GDB doesn't see any good reason for the thread to remain stopped. In our case, the problem thread which had its SIGSTOP discarded is stopped, but doesn't have a stashed wait status to report, and so GDB sets the thread going again. We are now stuck waiting for an event on the problem thread that might never arrive. When considering how to write a test for this bug I struggled. The issue was only spotted _randomly_ when a machine was heavily loaded with many multi-threaded applications, and GDB was being attached (by script) to all of these applications in parallel. In one reproducer I required around 5 applications each of 5 threads per machine core in order to reproduce the bug 2 out of 3 times. What we really want to do though is simulate the kernel being slow to report events through waitpid during the initial attach. The solution I came up with was to write an LD_PRELOAD library that intercepts (some) waitpid calls and rate limits them to one per-second. Any more than that simply return 0 indicating there's no event available. Obviously this can only be applied to waitpid calls that have the WNOHANG flag set. Unfortunately, once you ignore a waitpid call GDB can get a bit stuck. Usually, once the kernel has made a child status available to waitpid GDB will be sent a SIGCHLD signal. However, if the kernel makes 5 child statuses available but, due to the preload library we only collect one of them, then the kernel will not send any further SIGCHLD signals, and so, when GDB, thinking that the remaining statuses have not yet arrived sits waiting for a SIGCHLD it will be disappointed. The solution, implemented within the preload library, is that, when we hold back a waitpid result from GDB we spawn a new thread. This thread delays for a short period, and then sends GDB a SIGCHLD. This causes GDB to retry the waitpid, at which point sufficient time has passed and our library allows the waitpid call to complete. gdb/ChangeLog: * linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it was requested by GDB. gdb/testsuite/ChangeLog: * gdb.threads/attach-slow-waitpid.c: New file. * gdb.threads/attach-slow-waitpid.exp: New file. * gdb.threads/slow-waitpid.c: New file.
2018-05-11 06:52:49 +08:00
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# This test script tries to expose a bug in some of the uses of
# waitpid in the Linux native support within GDB. The problem was
# spotted on systems which were heavily loaded when attaching to
# threaded test programs. What happened was that during the initial
# attach, the loop of waitpid calls that normally received the stop
# events from each of the threads in the inferior was not receiving a
# stop event for some threads (the kernel just hadn't sent the stop
# event yet).
#
# GDB would then trigger a call to stop_all_threads which would
# continue to wait for all of the outstanding threads to stop, when
# the outstanding stop events finally arrived GDB would then
# (incorrectly) discard the stop event, resume the thread, and
# continue to wait for the thread to stop.... which it now never
# would.
#
# In order to try and expose this issue reliably, this test preloads a
# library that intercepts waitpid calls. All waitpid calls targeting
# pid -1 with the WNOHANG flag are rate limited so that only 1 per
# second can complete. Additional calls are forced to return 0
# indicating no event waiting. This is enough to trigger the bug
# during the attach phase.
# This test only works on Linux
if { ![isnative] || [is_remote host] || [use_gdb_stub]
|| ![istarget *-linux*] } {
continue
}
standard_testfile
set libfile slow-waitpid
set libsrc "${srcdir}/${subdir}/${libfile}.c"
set libobj [standard_output_file ${libfile}.so]
with_test_prefix "compile preload library" {
# Compile the preload library. We only get away with this as we
# limit this test to running when ISNATIVE is true.
if { [gdb_compile_shlib_pthreads \
$libsrc $libobj {debug}] != "" } then {
return -1
}
}
with_test_prefix "compile test executable" {
# Compile the test program
if { [gdb_compile_pthreads \
"${srcdir}/${subdir}/${srcfile}" "${binfile}" \
executable {debug}] != "" } {
return -1
}
}
# Spawn GDB with LIB preloaded with LD_PRELOAD.
proc gdb_spawn_with_ld_preload {lib} {
global env
save_vars { env(LD_PRELOAD) } {
if { ![info exists env(LD_PRELOAD) ]
|| $env(LD_PRELOAD) == "" } {
set env(LD_PRELOAD) "$lib"
} else {
append env(LD_PRELOAD) ":$lib"
}
gdb_start
}
}
# Run test program in the background.
set test_spawn_id [spawn_wait_for_attach $binfile]
set testpid [spawn_id_get_pid $test_spawn_id]
# Start GDB with preload library in place.
[gdb/testsuite] Bail out after gdb_start error in gdb.threads/attach-slow-waitpid.exp When building gdb using CFLAGS/CXXFLAGS+=-fsanitizer=address and LDFLAGS+=-lasan, and running test-case gdb.threads/attach-slow-waitpid.exp, we get: ... spawn gdb -nw -nx -data-directory data-directory^M ==16079==ASan runtime does not come first in initial library list; \ you should either link runtime to your application or manually preload \ it with LD_PRELOAD.^M ERROR: (eof) GDB never initialized. ERROR: : spawn id exp10 not open while executing "expect { -i exp10 -timeout 120 -re "Kill the program being debugged. .y or n. $" { send_gdb "y\n" answer verbose "\t\tKilling previous pro..." ("uplevel" body line 1) invoked from within "uplevel $body" NONE : spawn id exp10 not open WARNING: remote_expect statement without a default case ERROR: : spawn id exp10 not open while executing "expect { -i exp10 -timeout 120 -re "Reading symbols from.*LZMA support was disabled.*$gdb_prompt $" { verbose "\t\tLoaded $arg into $GDB; .gnu_..." ("uplevel" body line 1) invoked from within "uplevel $body" NONE : spawn id exp10 not open ERROR: Couldn't load attach-slow-waitpid into GDB (eof). ERROR: Couldn't send attach 16070 to GDB. UNRESOLVED: gdb.threads/attach-slow-waitpid.exp: attach to target ... Bail out at the first ERROR, such that we have instead: ... ERROR: (eof) GDB never initialized. UNTESTED: gdb.threads/attach-slow-waitpid.exp: \ Couldn't start GDB with preloaded lib ... Tested on x86_64-linux. gdb/testsuite/ChangeLog: 2020-07-20 Tom de Vries <tdevries@suse.de> * gdb.threads/attach-slow-waitpid.exp: Bail out if gdb_start fails.
2020-07-20 16:54:31 +08:00
if { [gdb_spawn_with_ld_preload $libobj] == -1 } {
# Make sure we get UNTESTED rather than UNRESOLVED.
set errcnt 0
untested "Couldn't start GDB with preloaded lib"
return -1
}
gdb: Don't drop SIGSTOP during stop_all_threads This patch fixes an issue where GDB would sometimes hang when attaching to a multi-threaded process. This issue was especially likely to trigger if the machine (running the inferior) was under load. In summary, the problem is an imbalance between two functions in linux-nat.c, stop_callback and stop_wait_callback. In stop_callback we send SIGSTOP to a thread, but _only_ if the thread is not already stopped, and if it is not signalled, which means it should stop soon. In stop_wait_callback we wait for the SIGSTOP to arrive, however, we are aware that the thread might have been signalled for some other reason, and so if a signal other than SIGSTOP causes the thread to stop then we stash that signal away so it can be reported back later. If we get a SIGSTOP then this is discarded, after all, this signal was sent from stop_callback. Except that this might not be the case, it could be that SIGSTOP was sent to a thread from elsewhere in GDB, in which case we would not have sent another SIGSTOP from stop_callback and the SIGSTOP received in stop_wait_callback should not be ignored. Below I've laid out the exact sequence of events that I saw that lead me to track down the above diagnosis. After attaching to the inferior GDB sends a SIGSTOP to all of the threads and then returns to the event loop waiting for interesting things to happen. Eventually the first target event is detected (this will be the first SIGSTOP arriving) and GDB calls inferior_event_handler which calls fetch_inferior_event. Inside fetch_inferior_event GDB calls do_target_wait which calls target_wait to find a thread with an event. The target_wait call ends up in linux_nat_wait_1, which first checks to see if any threads already have stashed stop events to report, and if there are none then we enter a loop fetching as many events as possible out of the kernel. This event fetching is non-blocking, and we give up once the kernel has no more events ready to give us. All of the events from the kernel are passed through linux_nat_filter_event which stashes the wait status for all of the threads that reported a SIGSTOP, these will be returned by future calls to linux_nat_wait_1. Lets assume for a moment that we've attached to a multi-threaded inferior, and that all but one thread has reported its stop during the initial wait call in linux_nat_wait_1. The other thread will be reporting a SIGSTOP, but the kernel has not yet managed to deliver that signal to GDB before GDB gave up waiting and continued handling the events it already had. GDB selects one of the threads that has reported a SIGSTOP and passes this thread ID back to fetch_inferior_event. To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which calls stop_all_threads, this calls wait_one, which in turn calls target_wait. The first call to target_wait at this point will result in a stashed wait status being returned, at which point we call setup_inferior. The call to setup_inferior leads to a call into try_thread_db_load_1 which results in a call to linux_stop_and_wait_all_lwps. This in turn calls stop_callback on each thread followed by stop_wait_callback on each thread. We're now ready to make the mistake. In stop_callback we see that our problem thread is not stopped, but is signalled, so it should stop soon. As a result we don't send another SIGSTOP. We then enter stop_wait_callback, eventually the problem thread stops with SIGSTOP which we _incorrectly_ assume came from stop_callback, and we discard. Once stop_wait_callback has done its damage we return from linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and eventually unwind back to the call to setup_inferior in stop_all_threads. GDB now loops around, and performs another target_wait to get the next event from the inferior. The target_wait calls causes us to once again reach linux_nat_wait_1, and we pass through some code that calls resume_stopped_resumed_lwps. This allows GDB to resume threads that are physically stopped, but which GDB doesn't see any good reason for the thread to remain stopped. In our case, the problem thread which had its SIGSTOP discarded is stopped, but doesn't have a stashed wait status to report, and so GDB sets the thread going again. We are now stuck waiting for an event on the problem thread that might never arrive. When considering how to write a test for this bug I struggled. The issue was only spotted _randomly_ when a machine was heavily loaded with many multi-threaded applications, and GDB was being attached (by script) to all of these applications in parallel. In one reproducer I required around 5 applications each of 5 threads per machine core in order to reproduce the bug 2 out of 3 times. What we really want to do though is simulate the kernel being slow to report events through waitpid during the initial attach. The solution I came up with was to write an LD_PRELOAD library that intercepts (some) waitpid calls and rate limits them to one per-second. Any more than that simply return 0 indicating there's no event available. Obviously this can only be applied to waitpid calls that have the WNOHANG flag set. Unfortunately, once you ignore a waitpid call GDB can get a bit stuck. Usually, once the kernel has made a child status available to waitpid GDB will be sent a SIGCHLD signal. However, if the kernel makes 5 child statuses available but, due to the preload library we only collect one of them, then the kernel will not send any further SIGCHLD signals, and so, when GDB, thinking that the remaining statuses have not yet arrived sits waiting for a SIGCHLD it will be disappointed. The solution, implemented within the preload library, is that, when we hold back a waitpid result from GDB we spawn a new thread. This thread delays for a short period, and then sends GDB a SIGCHLD. This causes GDB to retry the waitpid, at which point sufficient time has passed and our library allows the waitpid call to complete. gdb/ChangeLog: * linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it was requested by GDB. gdb/testsuite/ChangeLog: * gdb.threads/attach-slow-waitpid.c: New file. * gdb.threads/attach-slow-waitpid.exp: New file. * gdb.threads/slow-waitpid.c: New file.
2018-05-11 06:52:49 +08:00
# Load binary, and attach to running program.
gdb_load ${binfile}
gdb_test "attach $testpid" "Attaching to program.*" "attach to target"
gdb_exit
# Kill of test program.
kill_wait_spawned_process $test_spawn_id