[DTrace-devel] [PATCH 3/3] proc: fix race between dt_proc_signal receipt and waitpid()

Kris Van Hees kris.van.hees at oracle.com
Wed Nov 8 05:35:17 UTC 2023


On Tue, Oct 31, 2023 at 01:51:41PM +0000, Nick Alcock wrote:
> There is one tricky race in the hit-waitpid-with-a-signal approach used
> in the last commit: sending messages to the monitor thread can race with
> waitpid() itself.
> 
> We check to see if a proxy message has arrived immediately before
> starting waitpid() (which requires some annoying prototype changes to
> Pwait() to pass the relevant variable down), but there is a race here:
> 
> 		if (return_early && *return_early > 0) {
> 			if (block && waitpid_lock_hook)
> 				waitpid_lock_hook(P, P->wrap_arg, 0);
> 			return 0;
> 		}
> -->
>                 err = waitpid(P->pid, &status, __WALL | (!block ? WNOHANG : 0));
> 
> At the indicated point, if a proxy signal hits, waitpid() will not
> return EINTR because it hasn't started yet, but the return_early
> variable changing its state (via the signal handler) will not be spotted
> either because it's arrived too late. The elegant way to fix this would
> be something like ppoll() or pselect() but for waitpid(): a waitpid()
> variant which atomically applies a signal mask as it starts up.
> 
> But we don't have that, and we're not likely to any time soon. Is there
> some alternative approach? This race is, unfortunately, quite easy to
> hit at dtrace -c startup: connection and breakpoint setup involves a
> whole bunch of Pwait() calls, and at the same time the monitor thread is
> likely to be doing a whole bunch of proxy calls too, so they're likely
> to intersect badly quite fast.
> 
> Reviving test/unittest/pid/tst.manypids.sh -- whose purpose appears to
> be to LD_PRELOAD a more or less random pile of libraries, some of who
> might have constructors that malloc(), then run a lot of sleep(1)s under
> those preloads and watch them for mallocs -- so that it preloads
> libraries that actually exist on Linux easily triggers this problem.
> 
> The key is that the race is only brief. We can exploit this by setting
> up a timer that repeatedly hits the monitor thread with the proxy-call
> signal; if the first one is missed it will soon be followed by many
> more, one of which will soon hit when the waitpid() is running and
> -EINTR it properly: keep sending signals and as long as the monitor
> thread is working at all it will soon get woken up.
> 
> The only tricky part is that doing a pthread_kill() from a timer (rather
> than just a kill()) requires a non-portable "intended only for use by
> threading libraries" SIGEV_THREAD_ID flag to timer_create(), in
> conjunction with a use of a nonportable field in the struct sigevent:
> but the latter field is part of the ABI and cannot change, and the
> SIGEV_THREAD_ID option is used by glibc's implementation of SIGEV_THREAD
> and thus can be assumed to have stable semantics as well. So we can use
> it without fear.  (The only tricky part is getting the Linux-facing
> thread ID, for which we need gettid() even if glibc is too old to
> provide it.)
> 
> To avoid excessive expense doing all this madness on every proxy call,
> we set up the actual timer in advance, as the monitor thread starts, and
> only arm it when the proxy_call() happens.  We disarm it twice: in
> proxy_call() in case the signal hit when the monitor thread wasn't in
> waitpid() at all, and in the waitpid_lock_hook so that if it *is* in
> waitpid(), it is disarmed immediately after waitpid() returns and
> doesn't keep hitting us with pointless signals after they are no longer
> needed.
> 
> We set the interval to 1ms as a reasonable latency on proxy calls when
> we are unlucky enough to hit this race window, and an initial value of
> 1ms so that we don't take the cost of firing the timer at all in the
> vast majority of cases.  (Any additional CPU load incurred by this
> change is in the noise even summed over hundreds of runs. It adds three
> syscalls per proxy call, which given the sheer overhead of doing
> anything at all with ptrace() and the fact that doing a proxy call
> already involves cross-thread condvar operations and lock manipulation
> is not going to be noticeable.)
> 
> The timer should only ever fire briefly, but if anything goes
> badly-enough wrong the timer might keep firing for some time. If the
> timer *does* fire repeatedly, no harm is done: each firing has no effect
> other than to set waitpid_interrupted and force some syscall to return
> -EINTR.  Hopefully we've surrounded all blocking syscalls this might
> interrupt with an EINTR loop... and if not, that's a bug that needs
> fixing regardless.
> 
> (An audit of relevant bits of dt_proc and libproc is underway. I expect
> a few changes, but not very many.)
> 
> Signed-off-by: Nick Alcock <nick.alcock at oracle.com>

Reviewed-by: Kris Van Hees <kris.van.hees at oracle.com>

... assuming you fix a tiny nit mentioned below...

> ---
>  Makeconfig                        |  1 +
>  include/port.h                    |  4 ++
>  libdtrace/dt_proc.c               | 80 +++++++++++++++++++++++++++++--
>  libdtrace/dt_proc.h               |  3 ++
>  libport/Build                     |  8 ++--
>  libport/gettid.c                  | 24 ++++++++++
>  test/unittest/pid/tst.manypids.sh |  2 +-
>  7 files changed, 115 insertions(+), 7 deletions(-)
>  create mode 100644 libport/gettid.c
> 
> diff --git a/Makeconfig b/Makeconfig
> index 8b9bcda2d22f4..768ca9c95ea47 100644
> --- a/Makeconfig
> +++ b/Makeconfig
> @@ -102,3 +102,4 @@ $(eval $(call check-symbol-rule,LIBFUSE3,fuse_session_receive_buf,fuse3))
>  endif
>  $(eval $(call check-header-rule,FUSE_NUMA,fuse_set_numa,fuse/fuse_lowlevel))
>  $(eval $(call check-header-symbol-rule,CLOSE_RANGE,close_range(3,~0U,0),c,unistd))
> +$(eval $(call check-header-rule,GETTID,gettid,unistd))
> diff --git a/include/port.h b/include/port.h
> index 1d8ec97c4ab77..4b87f4adf91c3 100644
> --- a/include/port.h
> +++ b/include/port.h
> @@ -44,6 +44,10 @@ unsigned long linux_version_code(void);
>  int close_range(unsigned int first, unsigned int last, unsigned int flags);
>  #endif
>  
> +#ifndef HAVE_GETTID
> +pid_t gettid(void);
> +#endif
> +
>  /*
>   * New open() flags not supported in OL6 glibc.
>   */
> diff --git a/libdtrace/dt_proc.c b/libdtrace/dt_proc.c
> index 75112ff0c4db0..a3298fdbc68f2 100644
> --- a/libdtrace/dt_proc.c
> +++ b/libdtrace/dt_proc.c
> @@ -586,12 +586,27 @@ dt_proc_error(dtrace_hdl_t *dtp, dt_proc_t *dpr, const char *format, ...)
>   * the control thread are related to the child process, and because some calls
>   * to the child process are themselves involved in the implementation of the
>   * exec-retry protocol.)
> + *
> + * The actual call involves
> + *  - write a message down the proxy pipe.
> + *  - hit the control thread with the dt_proc_signal, which will wake up
> + *    the waitpid() in Pwait() if it's waiting there and force an early exit to
> + *    check the proxy pipe, and will set the thread-local waitpid_interrupted
> + *    variable wich Pwait() checks before it enters waitpid().
> + *  - arm a timer which repeatedly hits the control thread with this signal;
> + *    it is disarmed by the control thread.
> + *  - wait for the dpr_proxy_rq to be reset and the dpr_msg_cv to be
> + *    signalled, indciating that the request is done.
> + *  - if an exec() has happened, jump out to the dpr_proxy_exec_retry
> + *    pad, which will attempt to reattach to the new process.
>   */
>  
>  static long
>  proxy_call(dt_proc_t *dpr, long (*proxy_rq)(), int exec_retry)
>  {
>  	char junk = '\0'; /* unimportant */
> +	struct itimerspec pinger = {0};
> +	struct itimerspec nonpinger = {0};
>  
>  	dpr->dpr_proxy_rq = proxy_rq;
>  
> @@ -613,10 +628,40 @@ proxy_call(dt_proc_t *dpr, long (*proxy_rq)(), int exec_retry)
>  	}
>  	pthread_kill(dpr->dpr_tid, dpr->dpr_hdl->dt_proc_signal);
>  
> +	/*
> +	 * This timer's sole purpose is to hit the control thread with a signal
> +	 * if we are unlucky enough for the initial signal to strike in the gap
> +	 * between checking if the signal has hit and entering the waitpid().
> +	 * If this race hits, the proxy call latency will be at least as great
> +	 * as the interval of the timer, so it shouldn't be too long: but the
> +	 * value isn't that important otherwise.  (If it's too short, it'll
> +	 * waste time delivering useless signals.)
> +	 *
> +	 * Because this is only solving a rare race, if it can't be armed it's
> +	 * not too serious a problem, and we can more or less just keep going.
> +	 */
> +	pinger.it_value.tv_nsec = 1000000; /* arbitrary, not too long: 1ms */
> +	pinger.it_interval.tv_nsec = 1000000;
> +	if (timer_settime(dpr->dpr_proxy_timer, 0, &pinger, NULL) < 0)
> +		dt_proc_error(dpr->dpr_hdl, dpr,
> +			      "Cannot create fallback wakeup timer: %s\n",
> +			      strerror(errno));
> +
>  	while (dpr->dpr_proxy_rq != NULL)
>  		pthread_cond_wait(&dpr->dpr_msg_cv, &dpr->dpr_lock);
>  
> -	dpr->dpr_lock_holder = pthread_self();
> +	/*
> +	 * Disarm the timer again.  This is also done from
> +	 * dt_proc_waitpid_lock() so that the signal stops as soon as the
> +	 * waitpid() is done: but if the control thread was not waiting at
> +	 * waitpid() at all, we'll want to disarm it regardless.
> +	 */
> +	if (timer_settime(dpr->dpr_proxy_timer, 0, &nonpinger, NULL) < 0)
> +		dt_proc_error(dpr->dpr_hdl, dpr,
> +			      "Cannot disarm fallback wakeup timer: %s\n",
> +			      strerror(errno));
> +
> +        dpr->dpr_lock_holder = pthread_self();

Whitespace messed up (spaces vs tab).

>  
>  	if (!dpr->dpr_done && exec_retry && dpr->dpr_proxy_exec_retry &&
>  	    *unwinder_pad)
> @@ -816,6 +861,7 @@ dt_proc_control(void *arg)
>  	dt_proc_control_data_t * volatile datap = arg;
>  	dtrace_hdl_t * volatile dtp = datap->dpcd_hdl;
>  	dt_proc_t * volatile dpr = datap->dpcd_proc;
> +	struct sigevent sev = {0};
>  	int err;
>  	jmp_buf exec_jmp;
>  
> @@ -842,8 +888,22 @@ dt_proc_control(void *arg)
>  	 */
>  	dt_proc_lock(dpr);
>  
> +	/*
> +	 * Set up the machinery to allow the proxy thread to make requests of
> +	 * us: two ends of a pipe and one timer to signal this thread with the
> +	 * dt_proc_signal.  The timer is not yet armed.
> +	 */
> +
>  	dpr->dpr_proxy_fd[0] = datap->dpcd_proxy_fd[0];
>  	dpr->dpr_proxy_fd[1] = datap->dpcd_proxy_fd[1];
> +	sev.sigev_notify = SIGEV_SIGNAL | SIGEV_THREAD_ID;
> +	sev.sigev_signo = dtp->dt_proc_signal;
> +	sev._sigev_un._tid = gettid();
> +	if (timer_create(CLOCK_MONOTONIC, &sev, &dpr->dpr_proxy_timer) < 0) {
> +		dt_proc_error(dtp, dpr, "failed to arm proxy timer for %i: %s\n",
> +			      dpr->dpr_pid, strerror(errno));
> +		pthread_exit(NULL);
> +	}
>  
>  	/*
>  	 * Either create the process, or grab it.  Whichever, on failure, quit
> @@ -1351,6 +1411,7 @@ dt_proc_control_cleanup(void *arg)
>  	dpr->dpr_proxy_errno = ESRCH;
>  	dpr->dpr_proxy_rq = NULL;
>  	pthread_cond_signal(&dpr->dpr_msg_cv);
> +	timer_delete(dpr->dpr_proxy_timer);
>  
>  	/*
>  	 * Death-notification queueing is complicated by the fact that we might
> @@ -2087,8 +2148,21 @@ dt_proc_waitpid_lock(struct ps_prochandle *P, void *arg, int waitpidding)
>  
>  	if (waitpidding)
>  		dt_proc_unlock(dpr);
> -	else
> -		dt_proc_lock(dpr);
> +	else {
> +		struct itimerspec nonpinger = {0};
> +
> +                /*
> +		 * A waitpid() is done.  Disarm the signal-pinging timer
> +		 * immediately: the waitpid() has woken up, so its job is done.
> +		 */
> +
> +		if (timer_settime(dpr->dpr_proxy_timer, 0, &nonpinger, NULL) < 0)
> +			dt_proc_error(dpr->dpr_hdl, dpr,
> +				      "Cannot disarm fallback wakeup timer: %s\n",
> +				      strerror(errno));
> +
> +                dt_proc_lock(dpr);
> +	}
>  }
>  
>  /*
> diff --git a/libdtrace/dt_proc.h b/libdtrace/dt_proc.h
> index 90ab6c9c8d4c1..1ca1aeb6f797a 100644
> --- a/libdtrace/dt_proc.h
> +++ b/libdtrace/dt_proc.h
> @@ -8,6 +8,8 @@
>  #ifndef	_DT_PROC_H
>  #define	_DT_PROC_H
>  
> +#include <signal.h>
> +#include <time.h>
>  #include <rtld_db.h>
>  #include <libproc.h>
>  #include <dtrace.h>
> @@ -35,6 +37,7 @@ typedef struct dt_proc {
>  	pthread_t dpr_tid;		/* control thread (or zero if none) */
>  	pid_t dpr_pid;			/* pid of process */
>  	int dpr_proxy_fd[2];		/* proxy request pipe from main thread */
> +	timer_t dpr_proxy_timer;	/* control-thread-signalling timer */
>  	uint_t dpr_refs;		/* reference count */
>  	uint8_t dpr_stop;		/* stop mask: see flag bits below */
>  	uint8_t dpr_done;		/* done flag: ctl thread has exited */
> diff --git a/libport/Build b/libport/Build
> index e043a27efa5b7..48ed99e2b0170 100644
> --- a/libport/Build
> +++ b/libport/Build
> @@ -8,10 +8,12 @@ LIBS += libport
>  
>  libport_TARGET = libport
>  libport_DIR := $(current-dir)
> -ifdef HAVE_CLOSE_RANGE
>  libport_SOURCES = gmatch.c linux_version_code.c strlcat.c strlcpy.c p_online.c time.c daemonize.c
> -else
> -libport_SOURCES = gmatch.c linux_version_code.c strlcat.c strlcpy.c p_online.c time.c daemonize.c close_range.c
> +ifndef HAVE_CLOSE_RANGE
> +libport_SOURCES += close_range.c
> +endif
> +ifndef HAVE_GETTID
> +libport_SOURCES += gettid.c
>  endif
>  libport_LIBSOURCES := libport
>  libport_CPPFLAGS := -Ilibdtrace
> diff --git a/libport/gettid.c b/libport/gettid.c
> new file mode 100644
> index 0000000000000..cca785027ef61
> --- /dev/null
> +++ b/libport/gettid.c
> @@ -0,0 +1,24 @@
> +/*
> + * Oracle Linux DTrace; get the kernel-level ID of the current thread.
> + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
> + * Licensed under the Universal Permissive License v 1.0 as shown at
> + * http://oss.oracle.com/licenses/upl.
> + */
> +
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +/*
> + * This is only used if glibc does not provide an implementation.
> + *
> + * The kernel always has an implementation of this function.
> + */
> +
> +pid_t gettid(void)
> +{
> +#ifdef __NR_gettid
> +	return syscall(__NR_gettid);
> +#else
> +#error The kernel does not define gettid(), and glibc is too old to implement it.
> +#endif
> +}
> diff --git a/test/unittest/pid/tst.manypids.sh b/test/unittest/pid/tst.manypids.sh
> index 8f853c0f5bf25..b8ab152ca402a 100755
> --- a/test/unittest/pid/tst.manypids.sh
> +++ b/test/unittest/pid/tst.manypids.sh
> @@ -15,7 +15,7 @@ dtrace=$1
>  
>  declare -a pids
>  
> -for lib in `ls -1 /lib/lib*.so.1 | grep -v ld.so.1`; do
> +for lib in `ls -1 /lib64/lib*.so.1 | grep -v libthread_db.so.1`; do
>  	preload=$lib:${preload}
>  done
>  
> -- 
> 2.42.0.271.g85384428f1
> 
> 



More information about the DTrace-devel mailing list