[DTrace-devel] [PATCH 3/3] proc: fix race between dt_proc_signal receipt and waitpid()

Tue Oct 31 13:51:41 UTC 2023

There is one tricky race in the hit-waitpid-with-a-signal approach used
in the last commit: sending messages to the monitor thread can race with
waitpid() itself.

We check to see if a proxy message has arrived immediately before
starting waitpid() (which requires some annoying prototype changes to
Pwait() to pass the relevant variable down), but there is a race here:

		if (return_early && *return_early > 0) {
			if (block && waitpid_lock_hook)
				waitpid_lock_hook(P, P->wrap_arg, 0);
			return 0;
		}
-->
                err = waitpid(P->pid, &status, __WALL | (!block ? WNOHANG : 0));

At the indicated point, if a proxy signal hits, waitpid() will not
return EINTR because it hasn't started yet, but the return_early
variable changing its state (via the signal handler) will not be spotted
either because it's arrived too late. The elegant way to fix this would
be something like ppoll() or pselect() but for waitpid(): a waitpid()
variant which atomically applies a signal mask as it starts up.

But we don't have that, and we're not likely to any time soon. Is there
some alternative approach? This race is, unfortunately, quite easy to
hit at dtrace -c startup: connection and breakpoint setup involves a
whole bunch of Pwait() calls, and at the same time the monitor thread is
likely to be doing a whole bunch of proxy calls too, so they're likely
to intersect badly quite fast.

Reviving test/unittest/pid/tst.manypids.sh -- whose purpose appears to
be to LD_PRELOAD a more or less random pile of libraries, some of who
might have constructors that malloc(), then run a lot of sleep(1)s under
those preloads and watch them for mallocs -- so that it preloads
libraries that actually exist on Linux easily triggers this problem.

The key is that the race is only brief. We can exploit this by setting
up a timer that repeatedly hits the monitor thread with the proxy-call
signal; if the first one is missed it will soon be followed by many
more, one of which will soon hit when the waitpid() is running and
-EINTR it properly: keep sending signals and as long as the monitor
thread is working at all it will soon get woken up.

The only tricky part is that doing a pthread_kill() from a timer (rather
than just a kill()) requires a non-portable "intended only for use by
threading libraries" SIGEV_THREAD_ID flag to timer_create(), in
conjunction with a use of a nonportable field in the struct sigevent:
but the latter field is part of the ABI and cannot change, and the
SIGEV_THREAD_ID option is used by glibc's implementation of SIGEV_THREAD
and thus can be assumed to have stable semantics as well. So we can use
it without fear.  (The only tricky part is getting the Linux-facing
thread ID, for which we need gettid() even if glibc is too old to
provide it.)

To avoid excessive expense doing all this madness on every proxy call,
we set up the actual timer in advance, as the monitor thread starts, and
only arm it when the proxy_call() happens.  We disarm it twice: in
proxy_call() in case the signal hit when the monitor thread wasn't in
waitpid() at all, and in the waitpid_lock_hook so that if it *is* in
waitpid(), it is disarmed immediately after waitpid() returns and
doesn't keep hitting us with pointless signals after they are no longer
needed.

We set the interval to 1ms as a reasonable latency on proxy calls when
we are unlucky enough to hit this race window, and an initial value of
1ms so that we don't take the cost of firing the timer at all in the
vast majority of cases.  (Any additional CPU load incurred by this
change is in the noise even summed over hundreds of runs. It adds three
syscalls per proxy call, which given the sheer overhead of doing
anything at all with ptrace() and the fact that doing a proxy call
already involves cross-thread condvar operations and lock manipulation
is not going to be noticeable.)

The timer should only ever fire briefly, but if anything goes
badly-enough wrong the timer might keep firing for some time. If the
timer *does* fire repeatedly, no harm is done: each firing has no effect
other than to set waitpid_interrupted and force some syscall to return
-EINTR.  Hopefully we've surrounded all blocking syscalls this might
interrupt with an EINTR loop... and if not, that's a bug that needs
fixing regardless.

(An audit of relevant bits of dt_proc and libproc is underway. I expect
a few changes, but not very many.)

Signed-off-by: Nick Alcock <nick.alcock at oracle.com>
---
 Makeconfig                        |  1 +
 include/port.h                    |  4 ++
 libdtrace/dt_proc.c               | 80 +++++++++++++++++++++++++++++--
 libdtrace/dt_proc.h               |  3 ++
 libport/Build                     |  8 ++--
 libport/gettid.c                  | 24 ++++++++++
 test/unittest/pid/tst.manypids.sh |  2 +-
 7 files changed, 115 insertions(+), 7 deletions(-)
 create mode 100644 libport/gettid.c

diff --git a/Makeconfig b/Makeconfig
index 8b9bcda2d22f4..768ca9c95ea47 100644
--- a/Makeconfig
+++ b/Makeconfig
@@ -102,3 +102,4 @@ $(eval $(call check-symbol-rule,LIBFUSE3,fuse_session_receive_buf,fuse3))
 endif
 $(eval $(call check-header-rule,FUSE_NUMA,fuse_set_numa,fuse/fuse_lowlevel))
 $(eval $(call check-header-symbol-rule,CLOSE_RANGE,close_range(3,~0U,0),c,unistd))
+$(eval $(call check-header-rule,GETTID,gettid,unistd))
diff --git a/include/port.h b/include/port.h
index 1d8ec97c4ab77..4b87f4adf91c3 100644
--- a/include/port.h
+++ b/include/port.h
@@ -44,6 +44,10 @@ unsigned long linux_version_code(void);
 int close_range(unsigned int first, unsigned int last, unsigned int flags);
 #endif
 
+#ifndef HAVE_GETTID
+pid_t gettid(void);
+#endif
+
 /*
  * New open() flags not supported in OL6 glibc.
  */
diff --git a/libdtrace/dt_proc.c b/libdtrace/dt_proc.c
index 75112ff0c4db0..a3298fdbc68f2 100644
--- a/libdtrace/dt_proc.c
+++ b/libdtrace/dt_proc.c
@@ -586,12 +586,27 @@ dt_proc_error(dtrace_hdl_t *dtp, dt_proc_t *dpr, const char *format, ...)
  * the control thread are related to the child process, and because some calls
  * to the child process are themselves involved in the implementation of the
  * exec-retry protocol.)
+ *
+ * The actual call involves
+ *  - write a message down the proxy pipe.
+ *  - hit the control thread with the dt_proc_signal, which will wake up
+ *    the waitpid() in Pwait() if it's waiting there and force an early exit to
+ *    check the proxy pipe, and will set the thread-local waitpid_interrupted
+ *    variable wich Pwait() checks before it enters waitpid().
+ *  - arm a timer which repeatedly hits the control thread with this signal;
+ *    it is disarmed by the control thread.
+ *  - wait for the dpr_proxy_rq to be reset and the dpr_msg_cv to be
+ *    signalled, indciating that the request is done.
+ *  - if an exec() has happened, jump out to the dpr_proxy_exec_retry
+ *    pad, which will attempt to reattach to the new process.
  */
 
 static long
 proxy_call(dt_proc_t *dpr, long (*proxy_rq)(), int exec_retry)
 {
 	char junk = '\0'; /* unimportant */
+	struct itimerspec pinger = {0};
+	struct itimerspec nonpinger = {0};
 
 	dpr->dpr_proxy_rq = proxy_rq;
 
@@ -613,10 +628,40 @@ proxy_call(dt_proc_t *dpr, long (*proxy_rq)(), int exec_retry)
 	}
 	pthread_kill(dpr->dpr_tid, dpr->dpr_hdl->dt_proc_signal);
 
+	/*
+	 * This timer's sole purpose is to hit the control thread with a signal
+	 * if we are unlucky enough for the initial signal to strike in the gap
+	 * between checking if the signal has hit and entering the waitpid().
+	 * If this race hits, the proxy call latency will be at least as great
+	 * as the interval of the timer, so it shouldn't be too long: but the
+	 * value isn't that important otherwise.  (If it's too short, it'll
+	 * waste time delivering useless signals.)
+	 *
+	 * Because this is only solving a rare race, if it can't be armed it's
+	 * not too serious a problem, and we can more or less just keep going.
+	 */
+	pinger.it_value.tv_nsec = 1000000; /* arbitrary, not too long: 1ms */
+	pinger.it_interval.tv_nsec = 1000000;
+	if (timer_settime(dpr->dpr_proxy_timer, 0, &pinger, NULL) < 0)
+		dt_proc_error(dpr->dpr_hdl, dpr,
+			      "Cannot create fallback wakeup timer: %s\n",
+			      strerror(errno));
+
 	while (dpr->dpr_proxy_rq != NULL)
 		pthread_cond_wait(&dpr->dpr_msg_cv, &dpr->dpr_lock);
 
-	dpr->dpr_lock_holder = pthread_self();
+	/*
+	 * Disarm the timer again.  This is also done from
+	 * dt_proc_waitpid_lock() so that the signal stops as soon as the
+	 * waitpid() is done: but if the control thread was not waiting at
+	 * waitpid() at all, we'll want to disarm it regardless.
+	 */
+	if (timer_settime(dpr->dpr_proxy_timer, 0, &nonpinger, NULL) < 0)
+		dt_proc_error(dpr->dpr_hdl, dpr,
+			      "Cannot disarm fallback wakeup timer: %s\n",
+			      strerror(errno));
+
+        dpr->dpr_lock_holder = pthread_self();
 
 	if (!dpr->dpr_done && exec_retry && dpr->dpr_proxy_exec_retry &&
 	    *unwinder_pad)
@@ -816,6 +861,7 @@ dt_proc_control(void *arg)
 	dt_proc_control_data_t * volatile datap = arg;
 	dtrace_hdl_t * volatile dtp = datap->dpcd_hdl;
 	dt_proc_t * volatile dpr = datap->dpcd_proc;
+	struct sigevent sev = {0};
 	int err;
 	jmp_buf exec_jmp;
 
@@ -842,8 +888,22 @@ dt_proc_control(void *arg)
 	 */
 	dt_proc_lock(dpr);
 
+	/*
+	 * Set up the machinery to allow the proxy thread to make requests of
+	 * us: two ends of a pipe and one timer to signal this thread with the
+	 * dt_proc_signal.  The timer is not yet armed.
+	 */
+
 	dpr->dpr_proxy_fd[0] = datap->dpcd_proxy_fd[0];
 	dpr->dpr_proxy_fd[1] = datap->dpcd_proxy_fd[1];
+	sev.sigev_notify = SIGEV_SIGNAL | SIGEV_THREAD_ID;
+	sev.sigev_signo = dtp->dt_proc_signal;
+	sev._sigev_un._tid = gettid();
+	if (timer_create(CLOCK_MONOTONIC, &sev, &dpr->dpr_proxy_timer) < 0) {
+		dt_proc_error(dtp, dpr, "failed to arm proxy timer for %i: %s\n",
+			      dpr->dpr_pid, strerror(errno));
+		pthread_exit(NULL);
+	}
 
 	/*
 	 * Either create the process, or grab it.  Whichever, on failure, quit
@@ -1351,6 +1411,7 @@ dt_proc_control_cleanup(void *arg)
 	dpr->dpr_proxy_errno = ESRCH;
 	dpr->dpr_proxy_rq = NULL;
 	pthread_cond_signal(&dpr->dpr_msg_cv);
+	timer_delete(dpr->dpr_proxy_timer);
 
 	/*
 	 * Death-notification queueing is complicated by the fact that we might
@@ -2087,8 +2148,21 @@ dt_proc_waitpid_lock(struct ps_prochandle *P, void *arg, int waitpidding)
 
 	if (waitpidding)
 		dt_proc_unlock(dpr);
-	else
-		dt_proc_lock(dpr);
+	else {
+		struct itimerspec nonpinger = {0};
+
+                /*
+		 * A waitpid() is done.  Disarm the signal-pinging timer
+		 * immediately: the waitpid() has woken up, so its job is done.
+		 */
+
+		if (timer_settime(dpr->dpr_proxy_timer, 0, &nonpinger, NULL) < 0)
+			dt_proc_error(dpr->dpr_hdl, dpr,
+				      "Cannot disarm fallback wakeup timer: %s\n",
+				      strerror(errno));
+
+                dt_proc_lock(dpr);
+	}
 }
 
 /*
diff --git a/libdtrace/dt_proc.h b/libdtrace/dt_proc.h
index 90ab6c9c8d4c1..1ca1aeb6f797a 100644
--- a/libdtrace/dt_proc.h
+++ b/libdtrace/dt_proc.h
@@ -8,6 +8,8 @@
 #ifndef	_DT_PROC_H
 #define	_DT_PROC_H
 
+#include <signal.h>
+#include <time.h>
 #include <rtld_db.h>
 #include <libproc.h>
 #include <dtrace.h>
@@ -35,6 +37,7 @@ typedef struct dt_proc {
 	pthread_t dpr_tid;		/* control thread (or zero if none) */
 	pid_t dpr_pid;			/* pid of process */
 	int dpr_proxy_fd[2];		/* proxy request pipe from main thread */
+	timer_t dpr_proxy_timer;	/* control-thread-signalling timer */
 	uint_t dpr_refs;		/* reference count */
 	uint8_t dpr_stop;		/* stop mask: see flag bits below */
 	uint8_t dpr_done;		/* done flag: ctl thread has exited */
diff --git a/libport/Build b/libport/Build
index e043a27efa5b7..48ed99e2b0170 100644
--- a/libport/Build
+++ b/libport/Build
@@ -8,10 +8,12 @@ LIBS += libport
 
 libport_TARGET = libport
 libport_DIR := $(current-dir)
-ifdef HAVE_CLOSE_RANGE
 libport_SOURCES = gmatch.c linux_version_code.c strlcat.c strlcpy.c p_online.c time.c daemonize.c
-else
-libport_SOURCES = gmatch.c linux_version_code.c strlcat.c strlcpy.c p_online.c time.c daemonize.c close_range.c
+ifndef HAVE_CLOSE_RANGE
+libport_SOURCES += close_range.c
+endif
+ifndef HAVE_GETTID
+libport_SOURCES += gettid.c
 endif
 libport_LIBSOURCES := libport
 libport_CPPFLAGS := -Ilibdtrace
diff --git a/libport/gettid.c b/libport/gettid.c
new file mode 100644
index 0000000000000..cca785027ef61
--- /dev/null
+++ b/libport/gettid.c
@@ -0,0 +1,24 @@
+/*
+ * Oracle Linux DTrace; get the kernel-level ID of the current thread.
+ * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
+ * Licensed under the Universal Permissive License v 1.0 as shown at
+ * http://oss.oracle.com/licenses/upl.
+ */
+
+#include <sys/syscall.h>
+#include <unistd.h>
+
+/*
+ * This is only used if glibc does not provide an implementation.
+ *
+ * The kernel always has an implementation of this function.
+ */
+
+pid_t gettid(void)
+{
+#ifdef __NR_gettid
+	return syscall(__NR_gettid);
+#else
+#error The kernel does not define gettid(), and glibc is too old to implement it.
+#endif
+}
diff --git a/test/unittest/pid/tst.manypids.sh b/test/unittest/pid/tst.manypids.sh
index 8f853c0f5bf25..b8ab152ca402a 100755
--- a/test/unittest/pid/tst.manypids.sh
+++ b/test/unittest/pid/tst.manypids.sh
@@ -15,7 +15,7 @@ dtrace=$1
 
 declare -a pids
 
-for lib in `ls -1 /lib/lib*.so.1 | grep -v ld.so.1`; do
+for lib in `ls -1 /lib64/lib*.so.1 | grep -v libthread_db.so.1`; do
 	preload=$lib:${preload}
 done
 
-- 
2.42.0.271.g85384428f1