Avoid /proc/pid/mem races (PR 28065)

PR 28065 (gdb.threads/access-mem-running-thread-exit.exp intermittent failure) shows that GDB can hit an unexpected scenario -- it can happen that the kernel manages to open a /proc/PID/task/LWP/mem file, but then reading from the file returns 0/EOF, even though the process hasn't exited or execed. "0" out of read/write is normally what you get when the address space of the process the file was open for is gone, because the process execed or exited. So when GDB gets the 0, it returns memory access failure. In the bad case in question, the process hasn't execed or exited, so GDB fails a memory access when the access should have worked. GDB has code in place to gracefully handle the case of opening the /proc/PID/task/LWP/mem just while the LWP is exiting -- most often the open fails with EACCES or ENOENT. When it happens, GDB just tries opening the file for a different thread of the process. The testcase is written such that it stresses GDB's logic of closing/reopening the /proc/PID/task/LWP/mem file, by constantly spawning short lived threads. However, there's a window where the kernel manages to find the thread, but the thread exits just after and clears its address space pointer. In this case, the kernel creates a file successfully, but the file ends up with no address space associated, so a subsequent read/write returns 0/EOF too, just like if the whole process had execed or exited. This is the case in question that GDB does not handle. Oleg Nesterov gave this suggestion as workaround for that race: gdb can open(/proc/pid/mem) and then read (say) /proc/pid/statm. If statm reports something non-zero, then open() was "successfull". I think that might work. However, I didn't try it, because I realized we have another nasty race that that wouldn't fix. The other race I realized is that because we close/reopen the /proc/PID/task/LWP/mem file when GDB switches to a different inferior, then it can happen that GDB reopens /proc/PID/task/LWP/mem just after a thread execs, and before GDB has seen the corresponding exec event. I.e., we can open a /proc/PID/task/LWP/mem file accessing the post-exec address space thinking we're accessing the pre-exec address space. A few months back, Simon, Oleg and I discussed a similar race: [Bug gdb/26754] Race condition when resuming threads and one does an exec https://sourceware.org/bugzilla/show_bug.cgi?id=26754 The solution back then was to make the kernel fail any ptrace operation until the exec event is consumed, with this kernel commit: commit dbb5afad100a828c97e012c6106566d99f041db6 Author: Oleg Nesterov <oleg@redhat.com> AuthorDate: Wed May 12 15:33:08 2021 +0200 Commit: Linus Torvalds <torvalds@linux-foundation.org> CommitDate: Wed May 12 10:45:22 2021 -0700 ptrace: make ptrace() fail if the tracee changed its pid unexpectedly This however, only applies to ptrace, not to the /proc/pid/mem file opening case. Also, even if it did apply to the file open case, we would want to support current kernels until such a fix is more wide spread anyhow. So all in all, this commit gives up on the idea of only ever keeping one /proc/pid/mem file descriptor open. Instead, make GDB open a /proc/pid/mem per inferior, and keep it open until the inferior exits, is detached or execs. Make GDB open the file right after the inferior is created or is attached to or forks, at which point we know the inferior is stable and stopped and isn't thus going to exec, or have a thread exit, and so the file open won't fail (unless the whole process is SIGKILLed from outside GDB, at which point it doesn't matter whether we open the file). This way, we avoid both races described above, at the expense of using more file descriptors (one per inferior). Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28065 Change-Id: Iff943b95126d0f98a7973a07e989e4f020c29419
2024-12-15 04:31:49 +08:00 · 2021-09-14 19:01:37 +01:00 · 2021-09-14 19:01:37 +01:00 · 8a89ddbda2
commit 8a89ddbda2
parent 707ed39ac5
2 changed files with 147 additions and 191 deletions
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@ -69,6 +69,7 @@
 #include "gdbsupport/scope-exit.h"
 #include "gdbsupport/gdb-sigmask.h"
 #include "gdbsupport/common-debug.h"
+#include <unordered_map>

 /* This comment documents high-level logic of this file.

@ -280,7 +281,8 @@ static int lwp_status_pending_p (struct lwp_info *lp);

 static void save_stop_reason (struct lwp_info *lp);

-static void maybe_close_proc_mem_file (pid_t pid);
+static void close_proc_mem_file (pid_t pid);
+static void open_proc_mem_file (ptid_t ptid);


 /* LWP accessors.  */
@ -514,6 +516,8 @@ linux_nat_target::follow_fork (inferior *child_inf, ptid_t child_ptid,
 		  && !signal_pass_state (gdb_signal_from_host (signo)))
 		signo = 0;
 	      ptrace (PTRACE_DETACH, child_pid, 0, signo);
+
+	      close_proc_mem_file (child_pid);
 	    }
 	}

@ -1059,6 +1063,8 @@ linux_nat_target::create_inferior (const char *exec_file,
  pass_signals ({});

  inf_ptrace_target::create_inferior (exec_file, allargs, env, from_tty);
+
+  open_proc_mem_file (inferior_ptid);
 }

 /* Callback for linux_proc_attach_tgid_threads.  Attach to PTID if not
@ -1204,6 +1210,8 @@ linux_nat_target::attach (const char *args, int from_tty)

  lp->stopped = 1;

+  open_proc_mem_file (lp->ptid);
+
  /* Save the wait status to report later.  */
  lp->resumed = 1;
  linux_nat_debug_printf ("waitpid %ld, saving status %s",
@ -1455,7 +1463,7 @@ linux_nat_target::detach (inferior *inf, int from_tty)
      detach_success (inf);
    }

-  maybe_close_proc_mem_file (pid);
+  close_proc_mem_file (pid);
 }

 /* Resume execution of the inferior process.  If STEP is nonzero,
@ -1894,6 +1902,8 @@ linux_handle_extended_wait (struct lwp_info *lp, int status)

      if (event == PTRACE_EVENT_FORK || event == PTRACE_EVENT_VFORK)
 	{
+	  open_proc_mem_file (child_ptid);
+
 	  /* The arch-specific native code may need to know about new
 	     forks even if those end up never mapped to an
 	     inferior.  */
@ -1999,9 +2009,13 @@ linux_handle_extended_wait (struct lwp_info *lp, int status)
    {
      linux_nat_debug_printf ("Got exec event from LWP %ld", lp->ptid.lwp ());

-      /* Close the /proc/<pid>/mem file if it was open for this
-	 inferior.  */
-      maybe_close_proc_mem_file (lp->ptid.pid ());
+      /* Close the previous /proc/PID/mem file for this inferior,
+	 which was using the address space which is now gone.
+	 Reading/writing from this file would return 0/EOF.  */
+      close_proc_mem_file (lp->ptid.pid ());
+
+      /* Open a new file for the new address space.  */
+      open_proc_mem_file (lp->ptid);

      ourstatus->set_execd
 	(make_unique_xstrdup (linux_proc_pid_to_exec_file (pid)));
@ -3566,9 +3580,7 @@ linux_nat_target::mourn_inferior ()

  purge_lwp_list (pid);

-  /* Close the /proc/<pid>/mem file if it was open for this
-     inferior.  */
-  maybe_close_proc_mem_file (pid);
+  close_proc_mem_file (pid);

  if (! forks_exist_p ())
    /* Normal case, no other forks available.  */
@ -3765,91 +3777,134 @@ linux_nat_target::pid_to_exec_file (int pid)
  return linux_proc_pid_to_exec_file (pid);
 }

-/* Keep the /proc/<pid>/mem file open between memory accesses, as a
-   cache to avoid constantly closing/opening the file in the common
-   case of multiple memory reads/writes from/to the same inferior.
-   Note we don't keep a file open per inferior to avoid keeping too
-   many file descriptors open, which can run into resource limits.  */
-static struct
+/* Object representing an /proc/PID/mem open file.  We keep one such
+   file open per inferior.
+
+   It might be tempting to think about only ever opening one file at
+   most for all inferiors, closing/reopening the file as we access
+   memory of different inferiors, to minimize number of file
+   descriptors open, which can otherwise run into resource limits.
+   However, that does not work correctly -- if the inferior execs and
+   we haven't processed the exec event yet, and, we opened a
+   /proc/PID/mem file, we will get a mem file accessing the post-exec
+   address space, thinking we're opening it for the pre-exec address
+   space.  That is dangerous as we can poke memory (e.g. clearing
+   breakpoints) in the post-exec memory by mistake, corrupting the
+   inferior.  For that reason, we open the mem file as early as
+   possible, right after spawning, forking or attaching to the
+   inferior, when the inferior is stopped and thus before it has a
+   chance of execing.
+
+   Note that after opening the file, even if the thread we opened it
+   for subsequently exits, the open file is still usable for accessing
+   memory.  It's only when the whole process exits or execs that the
+   file becomes invalid, at which point reads/writes return EOF.  */
+
+class proc_mem_file
 {
-  /* The LWP this open file is for.  Note that after opening the file,
-     even if the thread subsequently exits, the open file is still
-     usable for accessing memory.  It's only when the whole process
-     exits or execs that the file becomes invalid (at which point
-     reads/writes return EOF).  */
-  ptid_t ptid;
+public:
+  proc_mem_file (ptid_t ptid, int fd)
+    : m_ptid (ptid), m_fd (fd)
+  {
+    gdb_assert (m_fd != -1);
+  }

-  /* The file descriptor.  -1 if file is not open.  */
-  int fd = -1;
-
-  /* Close FD and clear it to -1.  */
-  void close ()
+  ~proc_mem_file ()
  {
    linux_nat_debug_printf ("closing fd %d for /proc/%d/task/%ld/mem",
-			    fd, ptid.pid (), ptid.lwp ());
-    ::close (fd);
-    fd = -1;
+			    m_fd, m_ptid.pid (), m_ptid.lwp ());
+    close (m_fd);
  }
-} last_proc_mem_file;

-/* Close the /proc/<pid>/mem file if its LWP matches PTID.  */
+  DISABLE_COPY_AND_ASSIGN (proc_mem_file);
+
+  int fd ()
+  {
+    return m_fd;
+  }
+
+private:
+  /* The LWP this file was opened for.  Just for debugging
+     purposes.  */
+  ptid_t m_ptid;
+
+  /* The file descriptor.  */
+  int m_fd = -1;
+};
+
+/* The map between an inferior process id, and the open /proc/PID/mem
+   file.  This is stored in a map instead of in a per-inferior
+   structure because we need to be able to access memory of processes
+   which don't have a corresponding struct inferior object.  E.g.,
+   with "detach-on-fork on" (the default), and "follow-fork parent"
+   (also default), we don't create an inferior for the fork child, but
+   we still need to remove breakpoints from the fork child's
+   memory.  */
+static std::unordered_map<int, proc_mem_file> proc_mem_file_map;
+
+/* Close the /proc/PID/mem file for PID.  */

 static void
-maybe_close_proc_mem_file (pid_t pid)
+close_proc_mem_file (pid_t pid)
 {
-  if (last_proc_mem_file.ptid.pid () == pid)
-    last_proc_mem_file.close ();
+  proc_mem_file_map.erase (pid);
 }

-/* Helper for linux_proc_xfer_memory_partial.  Accesses /proc via
-   PTID.  Returns -1 on error, with errno set.  Returns number of
-   read/written bytes otherwise.  Returns 0 on EOF, which indicates
-   the address space is gone (because the process exited or
-   execed).  */
+/* Open the /proc/PID/mem file for the process (thread group) of PTID.
+   We actually open /proc/PID/task/LWP/mem, as that's the LWP we know
+   exists and is stopped right now.  We prefer the
+   /proc/PID/task/LWP/mem form over /proc/LWP/mem to avoid tid-reuse
+   races, just in case this is ever called on an already-waited
+   LWP.  */

-static ssize_t
-linux_proc_xfer_memory_partial_pid (ptid_t ptid,
-				    gdb_byte *readbuf, const gdb_byte *writebuf,
-				    ULONGEST offset, LONGEST len)
+static void
+open_proc_mem_file (ptid_t ptid)
+{
+  auto iter = proc_mem_file_map.find (ptid.pid ());
+  gdb_assert (iter == proc_mem_file_map.end ());
+
+  char filename[64];
+  xsnprintf (filename, sizeof filename,
+	     "/proc/%d/task/%ld/mem", ptid.pid (), ptid.lwp ());
+
+  int fd = gdb_open_cloexec (filename, O_RDWR | O_LARGEFILE, 0).release ();
+
+  if (fd == -1)
+    {
+      warning (_("opening /proc/PID/mem file for lwp %d.%ld failed: %s (%d)"),
+	       ptid.pid (), ptid.lwp (),
+	       safe_strerror (errno), errno);
+      return;
+    }
+
+  proc_mem_file_map.emplace (std::piecewise_construct,
+			     std::forward_as_tuple (ptid.pid ()),
+			     std::forward_as_tuple (ptid, fd));
+
+  linux_nat_debug_printf ("opened fd %d for lwp %d.%ld\n",
+			  fd, ptid.pid (), ptid.lwp ());
+}
+
+/* Implement the to_xfer_partial target method using /proc/PID/mem.
+   Because we can use a single read/write call, this can be much more
+   efficient than banging away at PTRACE_PEEKTEXT.  Also, unlike
+   PTRACE_PEEKTEXT/PTRACE_POKETEXT, this works with running
+   threads.  */
+
+static enum target_xfer_status
+linux_proc_xfer_memory_partial (gdb_byte *readbuf, const gdb_byte *writebuf,
+				ULONGEST offset, LONGEST len,
+				ULONGEST *xfered_len)
 {
  ssize_t ret;

-  /* As long as we're hitting the same inferior, the previously open
-     file is good, even if the thread it was open for exits.  */
-  if (last_proc_mem_file.fd != -1
-      && last_proc_mem_file.ptid.pid () != ptid.pid ())
-    last_proc_mem_file.close ();
+  auto iter = proc_mem_file_map.find (inferior_ptid.pid ());
+  if (iter == proc_mem_file_map.end ())
+    return TARGET_XFER_EOF;

-  if (last_proc_mem_file.fd == -1)
-    {
-      /* Actually use /proc/<pid>/task/<lwp>/mem instead of
-	 /proc/<lwp>/mem to avoid PID-reuse races, as we may be trying
-	 to read memory via a thread which we've already reaped.
-	 /proc/<lwp>/mem could open a file for the wrong process.  If
-	 the LWPID is reused for the same process it's OK, we can read
-	 memory through it just fine.  If the LWPID is reused for a
-	 different process, then the open will fail because the path
-	 won't exist.  */
-      char filename[64];
-      xsnprintf (filename, sizeof filename,
-		 "/proc/%d/task/%ld/mem", ptid.pid (), ptid.lwp ());
+  int fd = iter->second.fd ();

-      last_proc_mem_file.fd
-	= gdb_open_cloexec (filename, O_RDWR | O_LARGEFILE, 0).release ();
-
-      if (last_proc_mem_file.fd == -1)
-	{
-	  linux_nat_debug_printf ("opening %s failed: %s (%d)\n",
-				  filename, safe_strerror (errno), errno);
-	  return -1;
-	}
-      last_proc_mem_file.ptid = ptid;
-
-      linux_nat_debug_printf ("opened fd %d for %s",
-			      last_proc_mem_file.fd, filename);
-    }
-
-  int fd = last_proc_mem_file.fd;
+  gdb_assert (fd != -1);

  /* Use pread64/pwrite64 if available, since they save a syscall and can
     handle 64-bit offsets even on 32-bit platforms (for instance, SPARC
@ -3866,125 +3921,24 @@ linux_proc_xfer_memory_partial_pid (ptid_t ptid,

  if (ret == -1)
    {
-      linux_nat_debug_printf ("accessing fd %d for pid %ld failed: %s (%d)\n",
-			      fd, ptid.lwp (),
+      linux_nat_debug_printf ("accessing fd %d for pid %d failed: %s (%d)\n",
+			      fd, inferior_ptid.pid (),
 			      safe_strerror (errno), errno);
+      return TARGET_XFER_EOF;
    }
  else if (ret == 0)
    {
-      linux_nat_debug_printf ("accessing fd %d for pid %ld got EOF\n",
-			      fd, ptid.lwp ());
-    }
-
-  return ret;
-}
-
-/* Implement the to_xfer_partial target method using /proc/<pid>/mem.
-   Because we can use a single read/write call, this can be much more
-   efficient than banging away at PTRACE_PEEKTEXT.  Also, unlike
-   PTRACE_PEEKTEXT/PTRACE_POKETEXT, this works with running
-   threads.  */
-
-static enum target_xfer_status
-linux_proc_xfer_memory_partial (gdb_byte *readbuf, const gdb_byte *writebuf,
-				ULONGEST offset, LONGEST len,
-				ULONGEST *xfered_len)
-{
-  /* Unlike PTRACE_PEEKTEXT/PTRACE_POKETEXT, reading/writing from/to
-     /proc/<pid>/mem works with running threads, and even exited
-     threads if the file was already open.  If we need to open or
-     reopen the /proc file though, we may get an EACCES error
-     ("Permission denied"), meaning the thread is gone but its exit
-     status isn't reaped yet, or ENOENT if the thread is gone and
-     already reaped.  In that case, just try opening the file for
-     another thread in the process.  If all threads fail, then it must
-     mean the whole process exited, in which case there's nothing else
-     to do and we just fail the memory access.
-
-     Note we don't simply always access via the leader thread because
-     the leader may exit without exiting the whole process.  See
-     gdb.threads/leader-exit.exp, for example.  */
-
-  /* It's frequently the case that the selected thread is stopped, and
-     is thus not likely to exit (unless something kills the process
-     outside our control, with e.g., SIGKILL).  Give that one a try
-     first.
-
-     Also, inferior_ptid may actually point at an LWP not in lwp_list.
-     This happens when we're detaching from a fork child that we don't
-     want to debug ("set detach-on-fork on"), and the breakpoints
-     module uninstalls breakpoints from the fork child.  Which process
-     to access is given by inferior_ptid.  */
-  int res = linux_proc_xfer_memory_partial_pid (inferior_ptid,
-						readbuf, writebuf,
-						offset, len);
-  if (res == 0)
-    {
-      /* EOF means the address space is gone, the whole
-	 process exited or execed.  */
+      /* EOF means the address space is gone, the whole process exited
+	 or execed.  */
+      linux_nat_debug_printf ("accessing fd %d for pid %d got EOF\n",
+			      fd, inferior_ptid.pid ());
      return TARGET_XFER_EOF;
    }
-  else if (res != -1)
-    {
-      *xfered_len = res;
-      return TARGET_XFER_OK;
-    }
  else
    {
-      /* If we simply raced with the thread exiting (EACCES), or the
-	 current thread is THREAD_EXITED (ENOENT), try some other
-	 thread.  It's easier to handle an ENOENT failure than check
-	 for THREAD_EXIT upfront because this function is called
-	 before a thread for inferior_ptid is added to the thread
-	 list.  */
-      if (errno != EACCES && errno != ENOENT)
-	return TARGET_XFER_EOF;
-    }
-
-  int cur_pid = current_inferior ()->pid;
-
-  if (inferior_ptid.pid () != cur_pid)
-    {
-      /* We're accessing a fork child, and the access above failed.
-	 Don't bother iterating the LWP list, since there's no other
-	 LWP for this process.  */
-      return TARGET_XFER_EOF;
-    }
-
-  /* Iterate over LWPs of the current inferior, trying to access
-     memory through one of them.  */
-  for (lwp_info *lp : all_lwps ())
-    {
-      if (lp->ptid.pid () != cur_pid)
-	continue;
-
-      res = linux_proc_xfer_memory_partial_pid (lp->ptid,
-						readbuf, writebuf,
-						offset, len);
-
-      if (res == 0)
-	{
-	  /* EOF means the address space is gone, the whole process
-	     exited or execed.  */
-	  return TARGET_XFER_EOF;
-	}
-      else if (res == -1)
-	{
-	  if (errno == EACCES)
-	    {
-	      /* This LWP is gone, try another one.  */
-	      continue;
-	    }
-
-	  return TARGET_XFER_EOF;
-	}
-
-      *xfered_len = res;
+      *xfered_len = ret;
      return TARGET_XFER_OK;
    }
-
-  /* No luck.  */
-  return TARGET_XFER_EOF;
 }

 /* Parse LINE as a signal set and add its set bits to SIGS.  */
--- a/gdb/testsuite/gdb.threads/access-mem-running-thread-exit.exp
+++ b/gdb/testsuite/gdb.threads/access-mem-running-thread-exit.exp
@ -26,12 +26,14 @@
 # memory through exits.
 #
 # The test spawns two processes and alternates memory accesses between
-# them to force flushing per-process caches.  At the time of writing,
-# the Linux backend accesses inferior memory via /proc/<pid>/mem, and
-# keeps one such file open, as a cache.  Alternating inferiors forces
-# opening such file for a different process, which fails if GDB tries
-# to open the file for a thread that exited.  The test does ensures
-# those reopen/fail code paths are exercised.
+# them to force flushing per-process caches.  When the testcase was
+# originally written, the Linux backend would access inferior memory
+# via /proc/PID/mem, and kept one such file open, as a cache.
+# Alternating inferiors would force re-opening such file for a
+# different process, which would fail if GDB tried to open the file
+# for a thread that exited.  The test thus ensured those reopen/fail
+# code paths were exercised.  Nowadays, GDB keeps one /proc/PID/mem
+# file open per inferior.

 standard_testfile