[DTrace-devel] [PATCH v6] proc: fix race between dt_proc_signal receipt and waitpid()

Nick Alcock nick.alcock at oracle.com
Sat Oct 28 00:45:25 UTC 2023


So it turns out that the unfixable race mentioned in the parent commit is
quite easy to hit at dtrace -c startup: connection and breakpoint setup
involves a whole bunch of Pwait() calls, and at the same time the control
thread is likely to be doing a whole bunch of proxy calls too.  Reviving
test/unittest/pid/tst.manypids.so (whose purpose appears to be to LD_PRELOAD
a massive pile of libraries, some of who might have constructors that
malloc(), then run a lot of sleep(1)s under those preloads and watch for
mallocs) so that it preloads libraries that actually exist on Linux easily
triggers this problem.

It turns out the race is not unfixable! As noted in the last commit, the
problem is that if the proxy-call signal hits the control thread immediately
before the waitpid() starts, it won't exit with -EINTR because it hasn't
started yet, but it's too late for us to check the variable set by the
signal handler, so we miss the signal arriving and deadlock.

The solution is to set up a timer that *repeatedly* hits the control thread
with the proxy-call signal; thus, if the first one is missed it will soon be
followed by many more, one of which will soon hit when the waitpid() is
running and -EINTR it properly. The only tricky part is that doing a
pthread_kill() from a timer (rather than just a kill()) requires a
non-portable "intended only for use by threading libraries" SIGEV_THREAD_ID
flag to timer_create(), in conjunction with a use of a nonportable field in
the struct sigevent: but the latter field is part of the ABI and cannot
change, and the SIGEV_THREAD_ID option is used by glibc's implementation of
SIGEV_THREAD and thus can be assumed to have stable semantics as well. So we
can use it without fear.  (The only tricky part is getting the Linux-facing
thread ID, for which we need gettid() even if glibc doesn't provide it.)

To avoid excessive expense doing all this on every proxy call, we set up the
actual timer in advance, as the control thread starts, and only arm it when
the proxy_call happens.  We disarm it twice: in proxy_call() in case the
signal hit when the control thread wasn't in waitpid() at all, and in the
waitpid_lock_hook so that if it *is* in waitpid(), it is disarmed
immediately after waitpid() returns and doesn't keep hitting us with
pointless signals after they are no longer needed.

We set the interval to 1ms as a reasonable latency on proxy calls when we
are unlucky enough to hit this race window, and an initial value of 1ms so
that we don't take the cost of firing the timer at all in the vast majority
of cases.  (Any additional CPU load incurred by this change is in the noise
even summed over hundreds of runs.)

(If the timer *does* fire repeatedly, no harm is done: each firing has no
effect other than to set waitpid_interrupted and force some syscall to
return -EINTR.  Hopefully we've surrounded all blocking syscalls this might
interrupt with an EINTR loop... and if not, that's a bug that needs fixing
regardless.)
---
 Makeconfig                        |  1 +
 include/port.h                    |  4 ++
 libdtrace/dt_proc.c               | 80 +++++++++++++++++++++++++++++--
 libdtrace/dt_proc.h               |  3 ++
 libport/Build                     |  8 ++--
 libport/gettid.c                  | 24 ++++++++++
 test/unittest/pid/tst.manypids.sh |  2 +-
 7 files changed, 115 insertions(+), 7 deletions(-)
 create mode 100644 libport/gettid.c

More fallout.  This one's a genuine separate commit, fixing the "unfixable"
race mentioned earlier by the simple means of YELLING A LOT UNTIL WE GET
THE MESSAGE ACROSS.

Maybe I should just resend this series: all the rerolls and fixups
are getting a bit confusing...

diff --git a/Makeconfig b/Makeconfig
index 8b9bcda2d22f4..768ca9c95ea47 100644
--- a/Makeconfig
+++ b/Makeconfig
@@ -102,3 +102,4 @@ $(eval $(call check-symbol-rule,LIBFUSE3,fuse_session_receive_buf,fuse3))
 endif
 $(eval $(call check-header-rule,FUSE_NUMA,fuse_set_numa,fuse/fuse_lowlevel))
 $(eval $(call check-header-symbol-rule,CLOSE_RANGE,close_range(3,~0U,0),c,unistd))
+$(eval $(call check-header-rule,GETTID,gettid,unistd))
diff --git a/include/port.h b/include/port.h
index 9df5f8dfe8be1..3fdee049a6a3a 100644
--- a/include/port.h
+++ b/include/port.h
@@ -46,6 +46,10 @@ unsigned long linux_version_code(void);
 int close_range(unsigned int first, unsigned int last, unsigned int flags);
 #endif
 
+#ifndef HAVE_GETTID
+pid_t gettid(void);
+#endif
+
 /*
  * New open() flags not supported in OL6 glibc.
  */
diff --git a/libdtrace/dt_proc.c b/libdtrace/dt_proc.c
index 171917fbba29c..4347df20bf84f 100644
--- a/libdtrace/dt_proc.c
+++ b/libdtrace/dt_proc.c
@@ -586,12 +586,27 @@ dt_proc_error(dtrace_hdl_t *dtp, dt_proc_t *dpr, const char *format, ...)
  * the control thread are related to the child process, and because some calls
  * to the child process are themselves involved in the implementation of the
  * exec-retry protocol.)
+ *
+ * The actual call involves
+ *  - write a message down the proxy pipe.
+ *  - hit the control thread with the dt_proc_signal, which will wake up
+ *    the waitpid() in Pwait() if it's waiting there and force an early exit to
+ *    check the proxy pipe, and will set the thread-local waitpid_interrupted
+ *    variable wich Pwait() checks before it enters waitpid().
+ *  - arm a timer which repeatedly hits the control thread with this signal;
+ *    it is disarmed by the control thread.
+ *  - wait for the dpr_proxy_rq to be reset and the dpr_msg_cv to be
+ *    signalled, indciating that the request is done.
+ *  - if an exec() has happened, jump out to the dpr_proxy_exec_retry
+ *    pad, which will attempt to reattach to the new process.
  */
 
 static long
 proxy_call(dt_proc_t *dpr, long (*proxy_rq)(), int exec_retry)
 {
 	char junk = '\0'; /* unimportant */
+	struct itimerspec pinger = {0};
+	struct itimerspec nonpinger = {0};
 
 	dpr->dpr_proxy_rq = proxy_rq;
 
@@ -613,10 +628,40 @@ proxy_call(dt_proc_t *dpr, long (*proxy_rq)(), int exec_retry)
 	}
 	pthread_kill(dpr->dpr_tid, dpr->dpr_hdl->dt_proc_signal);
 
+	/*
+	 * This timer's sole purpose is to hit the control thread with a signal
+	 * if we are unlucky enough for the initial signal to strike in the gap
+	 * between checking if the signal has hit and entering the waitpid().
+	 * If this race hits, the proxy call latency will be at least as great
+	 * as the interval of the timer, so it shouldn't be too long: but the
+	 * value isn't that important otherwise.  (If it's too short, it'll
+	 * waste time delivering useless signals.)
+	 *
+	 * Because this is only solving a rare race, if it can't be armed it's
+	 * not too serious a problem, and we can more or less just keep going.
+	 */
+	pinger.it_value.tv_nsec = 1000000; /* arbitrary, not too long: 1ms */
+	pinger.it_interval.tv_nsec = 1000000;
+	if (timer_settime(dpr->dpr_proxy_timer, 0, &pinger, NULL) < 0)
+		dt_proc_error(dpr->dpr_hdl, dpr,
+			      "Cannot create fallback wakeup timer: %s\n",
+			      strerror(errno));
+
 	while (dpr->dpr_proxy_rq != NULL)
 		pthread_cond_wait(&dpr->dpr_msg_cv, &dpr->dpr_lock);
 
-	dpr->dpr_lock_holder = pthread_self();
+	/*
+	 * Disarm the timer again.  This is also done from
+	 * dt_proc_waitpid_lock() so that the signal stops as soon as the
+	 * waitpid() is done: but if the control thread was not waiting at
+	 * waitpid() at all, we'll want to disarm it regardless.
+	 */
+	if (timer_settime(dpr->dpr_proxy_timer, 0, &nonpinger, NULL) < 0)
+		dt_proc_error(dpr->dpr_hdl, dpr,
+			      "Cannot disarm fallback wakeup timer: %s\n",
+			      strerror(errno));
+
+        dpr->dpr_lock_holder = pthread_self();
 
 	if (!dpr->dpr_done && exec_retry && dpr->dpr_proxy_exec_retry &&
 	    *unwinder_pad)
@@ -816,6 +861,7 @@ dt_proc_control(void *arg)
 	dt_proc_control_data_t * volatile datap = arg;
 	dtrace_hdl_t * volatile dtp = datap->dpcd_hdl;
 	dt_proc_t * volatile dpr = datap->dpcd_proc;
+	struct sigevent sev = {0};
 	int err;
 	jmp_buf exec_jmp;
 
@@ -842,8 +888,22 @@ dt_proc_control(void *arg)
 	 */
 	dt_proc_lock(dpr);
 
+	/*
+	 * Set up the machinery to allow the proxy thread to make requests of
+	 * us: two ends of a pipe and one timer to signal this thread with the
+	 * dt_proc_signal.  The timer is not yet armed.
+	 */
+
 	dpr->dpr_proxy_fd[0] = datap->dpcd_proxy_fd[0];
 	dpr->dpr_proxy_fd[1] = datap->dpcd_proxy_fd[1];
+	sev.sigev_notify = SIGEV_SIGNAL | SIGEV_THREAD_ID;
+	sev.sigev_signo = dtp->dt_proc_signal;
+	sev._sigev_un._tid = gettid();
+	if (timer_create(CLOCK_MONOTONIC, &sev, &dpr->dpr_proxy_timer) < 0) {
+		dt_proc_error(dtp, dpr, "failed to arm proxy timer for %i: %s\n",
+			      dpr->dpr_pid, strerror(errno));
+		pthread_exit(NULL);
+	}
 
 	/*
 	 * Either create the process, or grab it.  Whichever, on failure, quit
@@ -1359,6 +1419,7 @@ dt_proc_control_cleanup(void *arg)
 	dpr->dpr_proxy_errno = ESRCH;
 	dpr->dpr_proxy_rq = NULL;
 	pthread_cond_signal(&dpr->dpr_msg_cv);
+	timer_delete(dpr->dpr_proxy_timer);
 
 	/*
 	 * Death-notification queueing is complicated by the fact that we might
@@ -2095,8 +2156,21 @@ dt_proc_waitpid_lock(struct ps_prochandle *P, void *arg, int waitpidding)
 
 	if (waitpidding)
 		dt_proc_unlock(dpr);
-	else
-		dt_proc_lock(dpr);
+	else {
+		struct itimerspec nonpinger = {0};
+
+                /*
+		 * A waitpid() is done.  Disarm the signal-pinging timer
+		 * immediately: the waitpid() has woken up, so its job is done.
+		 */
+
+		if (timer_settime(dpr->dpr_proxy_timer, 0, &nonpinger, NULL) < 0)
+			dt_proc_error(dpr->dpr_hdl, dpr,
+				      "Cannot disarm fallback wakeup timer: %s\n",
+				      strerror(errno));
+
+                dt_proc_lock(dpr);
+	}
 }
 
 /*
diff --git a/libdtrace/dt_proc.h b/libdtrace/dt_proc.h
index 90ab6c9c8d4c1..1ca1aeb6f797a 100644
--- a/libdtrace/dt_proc.h
+++ b/libdtrace/dt_proc.h
@@ -8,6 +8,8 @@
 #ifndef	_DT_PROC_H
 #define	_DT_PROC_H
 
+#include <signal.h>
+#include <time.h>
 #include <rtld_db.h>
 #include <libproc.h>
 #include <dtrace.h>
@@ -35,6 +37,7 @@ typedef struct dt_proc {
 	pthread_t dpr_tid;		/* control thread (or zero if none) */
 	pid_t dpr_pid;			/* pid of process */
 	int dpr_proxy_fd[2];		/* proxy request pipe from main thread */
+	timer_t dpr_proxy_timer;	/* control-thread-signalling timer */
 	uint_t dpr_refs;		/* reference count */
 	uint8_t dpr_stop;		/* stop mask: see flag bits below */
 	uint8_t dpr_done;		/* done flag: ctl thread has exited */
diff --git a/libport/Build b/libport/Build
index e043a27efa5b7..48ed99e2b0170 100644
--- a/libport/Build
+++ b/libport/Build
@@ -8,10 +8,12 @@ LIBS += libport
 
 libport_TARGET = libport
 libport_DIR := $(current-dir)
-ifdef HAVE_CLOSE_RANGE
 libport_SOURCES = gmatch.c linux_version_code.c strlcat.c strlcpy.c p_online.c time.c daemonize.c
-else
-libport_SOURCES = gmatch.c linux_version_code.c strlcat.c strlcpy.c p_online.c time.c daemonize.c close_range.c
+ifndef HAVE_CLOSE_RANGE
+libport_SOURCES += close_range.c
+endif
+ifndef HAVE_GETTID
+libport_SOURCES += gettid.c
 endif
 libport_LIBSOURCES := libport
 libport_CPPFLAGS := -Ilibdtrace
diff --git a/libport/gettid.c b/libport/gettid.c
new file mode 100644
index 0000000000000..cca785027ef61
--- /dev/null
+++ b/libport/gettid.c
@@ -0,0 +1,24 @@
+/*
+ * Oracle Linux DTrace; get the kernel-level ID of the current thread.
+ * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
+ * Licensed under the Universal Permissive License v 1.0 as shown at
+ * http://oss.oracle.com/licenses/upl.
+ */
+
+#include <sys/syscall.h>
+#include <unistd.h>
+
+/*
+ * This is only used if glibc does not provide an implementation.
+ *
+ * The kernel always has an implementation of this function.
+ */
+
+pid_t gettid(void)
+{
+#ifdef __NR_gettid
+	return syscall(__NR_gettid);
+#else
+#error The kernel does not define gettid(), and glibc is too old to implement it.
+#endif
+}
diff --git a/test/unittest/pid/tst.manypids.sh b/test/unittest/pid/tst.manypids.sh
index 8f853c0f5bf25..b8ab152ca402a 100755
--- a/test/unittest/pid/tst.manypids.sh
+++ b/test/unittest/pid/tst.manypids.sh
@@ -15,7 +15,7 @@ dtrace=$1
 
 declare -a pids
 
-for lib in `ls -1 /lib/lib*.so.1 | grep -v ld.so.1`; do
+for lib in `ls -1 /lib64/lib*.so.1 | grep -v libthread_db.so.1`; do
 	preload=$lib:${preload}
 done
 

base-commit: cb3a9e81efd9d4fbc413b9427f86af079b5faab6
prerequisite-patch-id: ce4dfaceb9baa012a354643f839cc413de5fd9d9
prerequisite-patch-id: b525e5cb144cc160e18531916c8e4bba2666a39a
prerequisite-patch-id: 010d59644b481e24a4fe5075d7cc7cca45f845b4
prerequisite-patch-id: 27b3d5f1b197280a050965e5050d7d9c70d58cd9
-- 
2.42.0.271.g85384428f1




More information about the DTrace-devel mailing list