[DTrace-devel] [PATCH v2 08/23] dtprobed: add the DOF stash

Nick Alcock nick.alcock at oracle.com
Mon Nov 27 16:47:14 UTC 2023


The DOF stash is a runtime directory (/run/dtrace/stash by default) in which
the dtprobed daemon keeps track of incoming DOF.  There's a comment
describing it at the top of dtprobed/dof_stash.c, but in brief we have two
directories under /run/dtrace/stash:

- dof/, which contains raw DOF named by mapping
- dof-pid/, which tracks DOF on a PID-by-PID and mapping-by-mapping basis.
  This contains subdirectories $pid/$mapping containing both the same raw
  DOF (hardlinked) and parsed DOF split up by probe, in
  $pid/$mapping/parsed/$spec.

To aid consumer globbing of probes we also put the DOF for each individual
probe in /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb; in future we may
introduce further symlinks not in per-pid directories to ease matching of
probes across processes.

Consumers can use inotify watches to detect newly added or removed DOF or
probes: because all the DOF is in one directory, a very small number of
watches is needed to keep track of all the DOF in the system, and only one
watch per PID is needed to track it on a PID-by-PID basis. (Currently,
nothing is doing this, but it is now a possibility where it never was
before.)

In a future commit, dtprobed will emit all the DOF it receives into this
structure, removing it when processes exit, and DTrace will use it to create
and remove probes.

dof_stash.c itself provides functions to add new raw DOF to the stash, to
add a stream of parsed DOF corresponding to a given piece of raw DOF to the
stash, to remove DOF as processes issue DTRACEHIOC_REMOVE requests, to check
for any DOF that needs reparsing to handle changes to struct dof_parsed, and
to remove entries from the DOF stash that is left behind by processes that
died uncleanly without issuing DTRACEHIOC_REMOVEs. (Currently, we don't deal
with newly-started processes that reuse the same PID as such a process, or
processes that leave DOF behind on exec() and then add more: this
improvement will come soon.)

It does not do any DOF parsing: that is left up to the caller (dtprobed.c).

All the writes to the dof_stash are done via the *at() functions to avoid
any risk of bugs causing the file creations and most especially deletions it
carries out from leaking outside /run/dtrace.  (We are not concerned about
symlink attacks etc because only root can write to /run/dtrace or any
directories underneath it.)

Signed-off-by: Nick Alcock <nick.alcock at oracle.com>
---
 dtprobed/Build            |    2 +-
 dtprobed/dof_stash.c      | 1621 +++++++++++++++++++++++++++++++++++++
 dtprobed/dof_stash.h      |   44 +
 dtprobed/dtprobed.service |    2 +
 dtrace.spec               |    4 +
 libcommon/dof_parser.h    |   20 +-
 6 files changed, 1686 insertions(+), 7 deletions(-)
 create mode 100644 dtprobed/dof_stash.c
 create mode 100644 dtprobed/dof_stash.h

diff --git a/dtprobed/Build b/dtprobed/Build
index c44dd40ab3d2..f1e0dfff456f 100644
--- a/dtprobed/Build
+++ b/dtprobed/Build
@@ -26,7 +26,7 @@ dtprobed_CPPFLAGS := -I. -Idtprobed -Ilibproc -Ilibcommon -Ilibport
 dtprobed_CFLAGS := $(shell pkg-config --cflags $(LIBFUSE))
 dtprobed_LIBS := -lcommon -lproc -lcommon -lport -lelf $(shell pkg-config --libs $(LIBFUSE))
 dtprobed_DEPS := libproc.a libcommon.a libport.a
-dtprobed_SOURCES := dtprobed.c seccomp-assistance.c
+dtprobed_SOURCES := dtprobed.c dof_stash.c seccomp-assistance.c
 dtprobed_LIBSOURCES := libproc libcommon
 
 ifdef HAVE_LIBSYSTEMD
diff --git a/dtprobed/dof_stash.c b/dtprobed/dof_stash.c
new file mode 100644
index 000000000000..a3a7af163398
--- /dev/null
+++ b/dtprobed/dof_stash.c
@@ -0,0 +1,1621 @@
+/*
+ * Oracle Linux DTrace; DOF storage for later probe removal.
+ * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
+ * Licensed under the Universal Permissive License v 1.0 as shown at
+ * http://oss.oracle.com/licenses/upl.
+ *
+ * The DOF stash is the principal state preservation mechanism used by dtprobed.
+ * Keeping DOF on disk keeps memory management complexity low, and makes it
+ * possible for dtprobed to track DOF across daemon restarts and upgrades.
+ *
+ * The state storage structure is as follows:
+ *
+ * /run/dtrace/stash: Things private to dtprobed.
+ *
+ * /run/dtrace/probes: Per-probe info, written by dtprobed, read by DTrace.
+ *
+ * /run/dtrace/stash/dof/$dev-$ino: DOF contributed by particular mappings,
+ * in raw form (as received from the client).
+ *
+ * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/raw: hardlink to the DOF for a given
+ * PID. Pruned of dead processes at startup and on occasion: entries also
+ * deleted on receipt of DTRACEHIOC_REMOVE ioctls. The principal effect of this
+ * is to bump the link count for the corresponding DOF in dof/ directory: when
+ * this falls to 1, the DOF is considered dead and the corresponding probe is
+ * removed.
+ *
+ * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/dh: Serialized form of the
+ * dof_helper_t received from a given client. Preserved for all
+ * DTRACEHIOC_ADDDOFs, even if the DOF was already contributed by some other
+ * process.  Keeping this around allows the DOF to be re-parsed as needed, even
+ * by a later re-execution of a different version of the same daemon with a
+ * different dof_parsed_t structure, and to be re-parsed with appropriate
+ * addresses for whichever mapping of this DOF is under consideration.
+ *
+ * /usr/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed: parsed DOF for a given PID.
+ * Generated from the helper for PIDs that are the first appearance of a given
+ * piece of DOF and used for the common case where that DOF has only one mapping
+ * until its deletion (short-lived processes, mostly).  Deleted at daemon
+ * startup.
+ *
+ * /run/dtrace/stash/dof-pid/$pid/$gen: symlink to a $dev-$ino dir. "$gen" is an
+ * incrementing generation counter starting at zero for every unique PID, used
+ * by clients to uniquely identify DOF pieces to remove.
+ *
+ * /run/dtrace/stash/dof-pid/$pid/next-gen: The next generation counter to use
+ * for a new piece of DOF from this process.  Encoded as a sparse file whose
+ * size is the next counter value to use.
+ *
+ * Per-probe info:
+ *
+ * /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb: Parsed representation of one
+ * probe in a given process. Removed by dtprobed when the process dies, or if
+ * all mappings containing the probe are unmmapped.  Used by DTrace for tracing
+ * by PID.  Hardlink to...
+ *
+ * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed/$prv$pid:$mod:$fun:$prb:
+ * parsed DOF for a given probe for a given PID.  Generated from the helper for
+ * PIDs that are the first appearance of a given piece of DOF and used for the
+ * common case where that DOF has only one mapping until its deletion
+ * (short-lived processes, mostly).  Deleted at daemon startup if the
+ * dof_parsed_t structure changed.  The file consists of a uint64_t version
+ * number (DOF_PARSED_VERSION), a uint64_t count of tracepoints, and then
+ * that many struct dof_parsed_t's of type DIT_TRACEPOINT.
+ *
+ * In future, symlinks in /run/dtrace/probes/$prv$pid may exist, used for
+ * globbing provider names across processes.
+ */
+
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include "dof_stash.h"
+
+#include <dt_list.h>
+
+#ifdef HAVE_FUSE_LOG
+#include <fuse_log.h>
+#else
+#include "rpl_fuse_log.h"
+#endif
+
+#include <port.h>
+
+static int dof_dir, pid_dir, probes_dir;
+
+/*
+ * Like close(), but only try to close positively-numbered fds.
+ */
+static int
+maybe_close(int fd)
+{
+	if (fd > -1)
+		return close(fd);
+	return 0;
+}
+
+/*
+ * Like mkdirat(), but already-existing directories are considered
+ * successful.
+ */
+static int
+mkexistingdirat(int dirfd, const char *name, mode_t mode)
+{
+	if (mkdirat(dirfd, name, mode) < 0) {
+		if (errno == EEXIST) {
+			struct stat s;
+
+			if (fstatat(dirfd, name, &s, 0) == 0 &&
+			    S_ISDIR(s.st_mode))
+				return 0;
+			errno = EEXIST;
+		}
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * Make a runtime state directory and return its fd.
+ */
+static int
+make_state_dir(const char *dir)
+{
+	int fd;
+
+        if (mkdir(dir, 0755) < 0 && errno != EEXIST) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: "
+			 "cannot create state directory %s: %s\n",
+			 dir, strerror(errno));
+		return -1;
+	}
+
+	if ((fd = open(dir, O_PATH | O_CLOEXEC)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: "
+			 "cannot open state directory %s: %s\n",
+			 dir, strerror(errno));
+		return -1;
+	}
+	return fd;
+}
+
+/*
+ * Make a state directory under some other directory and return an fd to it.
+ *
+ * If "readable" is set, the fd is opened with O_RDONLY rather than O_PATH.
+ */
+static int
+make_state_dirat(int dirfd, const char *name, const char *errmsg, int readable)
+{
+	int fd;
+	int flags = O_PATH;
+
+	if (readable)
+		flags = O_RDONLY;
+
+	if (mkexistingdirat(dirfd, name, 0755) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot create %s dir named %s: %s\n",
+			 errmsg, name, strerror(errno));
+		return -1;
+	}
+
+	if ((fd = openat(dirfd, name, flags | O_CLOEXEC)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open %s dir named %s: %s\n",
+			 errmsg, name, strerror(errno));
+		return -1;
+	}
+	return fd;
+}
+
+/*
+ * Initialize the DOF-stashing machinery, if needed.
+ */
+int
+dof_stash_init(const char *statedir)
+{
+	int root;
+	int stash;
+
+	if (statedir == NULL)
+		statedir="/run/dtrace";
+
+        if ((root = make_state_dir(statedir)) < 0)
+		return -1;
+
+        if ((stash = make_state_dirat(root, "stash", "statedir", 0)) < 0)
+		return -1;
+
+	if ((dof_dir = make_state_dirat(stash, "dof", "DOF", 0)) < 0)
+		return -1;
+        if ((pid_dir = make_state_dirat(stash, "dof-pid", "per-pid DOF", 1)) < 0)
+		return -1;
+
+        if ((probes_dir = make_state_dirat(root, "probes", "probesdir", 1)) < 0)
+		return -1;
+
+        close(root);
+	close(stash);
+
+	return 0;
+}
+
+/*
+ * Compose a provider's name from pieces.
+ */
+static char *
+make_provider_name(const char *prov, pid_t pid)
+{
+	char *ret;
+
+        if (asprintf(&ret, "%s%li", prov, (long) pid) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making provider name\n");
+		return NULL;
+	}
+	return ret;
+}
+
+/*
+ * Compose a full probespec's name from pieces.
+ */
+static char *
+make_probespec_name(const char *prov, const char *mod, const char *fn,
+		    const char *prb)
+{
+	char *ret;
+
+        if (asprintf(&ret, "%s:%s:%s:%s", prov, mod, fn, prb) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making probespec\n");
+		return NULL;
+	}
+	return ret;
+}
+
+/*
+ * Split the pieces back up again.  The input spec becomes the provider name.
+ * The pieces' lifetime is no longer than that of the input string.
+ */
+static int
+split_probespec_name(char *spec, const char **mod, const char **fn,
+		     const char **prb)
+{
+	char *tmp;
+
+	tmp = strchr(spec, ':');
+	if (tmp == NULL)
+		return -1;
+	*tmp = 0;
+	*mod = tmp + 1;
+
+	tmp = strchr(*mod, ':');
+	if (tmp == NULL)
+		return -1;
+	*tmp = 0;
+	*fn = tmp + 1;
+
+	tmp = strchr(*fn, ':');
+	if (tmp == NULL)
+		return -1;
+	*tmp = 0;
+	*prb = tmp + 1;
+
+	return 0;
+}
+
+/*
+ * Compose a DOF directory's name from pieces.
+ */
+static char *
+make_dof_name(dev_t dev, ino_t ino)
+{
+	char *ret;
+
+        if (asprintf(&ret, "%li-%li", (long) dev, (long) ino) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making DOF name\n");
+		return NULL;
+	}
+	return ret;
+}
+
+/*
+ * Split the pieces back up again.
+ */
+static int
+split_dof_name(const char *dof_name, dev_t *dev, ino_t *ino)
+{
+	if (sscanf(dof_name, "%li-%li", (long *) dev, (long *) ino) < 2) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: per-pid directory name %s unparseable\n",
+			 dof_name);
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * A trivial wrapper.
+ */
+static char *
+make_numeric_name(long num)
+{
+	char *ret;
+
+        if (asprintf(&ret, "%li", num) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making filename\n");
+		return NULL;
+	}
+	return ret;
+}
+
+
+/*
+ * Push a piece of parsed DOF into a list to be emitted by
+ * dof_stash_write_parsed().
+ */
+int
+dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed)
+{
+	dof_parsed_list_t *entry;
+
+	if ((entry = calloc(1, sizeof(dof_parsed_list_t))) == NULL)
+		return -1;
+
+	entry->parsed = parsed;
+        dt_list_append(accum, entry);
+
+        return 0;
+}
+
+/*
+ * Free all the accumulated parsed DOF in the passed list.
+ */
+void
+dof_stash_flush(dt_list_t *accum)
+{
+	dof_parsed_list_t *accump;
+	dof_parsed_list_t *last_accump = NULL;
+
+	for (accump = dt_list_next(accum); accump != NULL;
+	     accump = dt_list_next(accump)) {
+		dt_list_delete(accum, accump);
+		free(accump->parsed);
+		free(last_accump);
+		last_accump = accump;
+	}
+	free(last_accump);
+}
+
+/*
+ * Write a chunk of data to an fd.
+ */
+static int
+write_chunk(int fd, const void *buf, size_t size)
+{
+	char *bufptr = (char *) buf;
+
+	while (size > 0) {
+		size_t len;
+
+                if ((len = write(fd, bufptr, size)) < 0 && errno != EINTR)
+			return -1;
+
+		size -= len;
+		bufptr += len;
+	}
+	return 0;
+}
+
+/*
+ * Write out a piece of raw DOF.  Returns the length of the file written,
+ * or 0 if none was needed (or -1 on error).
+ */
+static int
+dof_stash_write(int dirfd, const char *name, const void *buf, size_t size)
+{
+	struct stat s;
+	int fd;
+
+        /*
+	 * Sanity check: if the DOF already exists but is not the same size
+	 * as the DOF we already have, complain, and replace it and its parsed
+	 * equivalent.  If it does exist, there's no need to write it out.
+	 *
+	 * If we can't even unlink it or write it out, we give up -- the stash
+	 * has failed and we won't be able to do anything it implies.
+	 *
+	 * (This is only a crude check -- obviously distinct raw DOF could be
+	 * the same size by pure chance.)
+	 */
+	if (fstatat(dirfd, name, &s, 0) == 0) {
+		if (s.st_size == size)
+			return 0;
+
+                fuse_log(FUSE_LOG_ERR, "dtprobed: DOF %s already exists, "
+			 "but is %zx bytes long, not %zx: replacing\n",
+			 name, s.st_size, size);
+		if (unlinkat(dirfd, name, 0) < 0) {
+			fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
+				 name, strerror(errno));
+			return -1;
+		}
+	}
+
+	if ((fd = openat(dirfd, name, O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out raw DOF: %s\n",
+			 strerror(errno));
+		return -1;
+	}
+
+	if (write_chunk(fd, buf, size) < 0)
+		goto err;
+
+	if (close(fd) < 0)
+		goto err;
+	return size + sizeof(uint64_t);
+
+  err:
+	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF: %s\n",
+		 strerror(errno));
+	unlinkat(dirfd, name, 0);
+	close(fd);
+	return -1;
+}
+
+/*
+ * Write out all the parsed DOF from the passed-in accumulator.  Split out from
+ * dof_stash_add() because we don't want to create anything in the DOF stash for
+ * a given piece of DOF until we have verified that all the parsed data for all
+ * the probes in that DOF is valid.
+ *
+ * Each parsed file starts with a single 64-bit word giving the version of
+ * dof_parsed_t in use, to ensure that readers never use an incompatible
+ * version, and a single uint64_t giving the number of tracepoints following,
+ * each of which is a struct dof_parsed_t. The name of the probe is not given,
+ * since the filename (or, for the probes/ dirs, the file and directory names)
+ * already contains it.
+ *
+ * We also write the parsed version out to the parsed/version file in the parsed
+ * mapping dir, to make it easier to determine whether to reparse all the DOF
+ * for outdatedness.
+ */
+int
+dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum)
+{
+	char *pid_name = make_numeric_name(pid);
+	char *dof_name = make_dof_name(dev, ino);
+	char *provpid_name = NULL;
+	int perpid_dir, perpid_dof_dir, probepid_dir, parsed_mapping_dir,
+		prov_dir = -1, mod_dir = -1, fun_dir = -1;
+        dof_parsed_list_t *accump;
+	int version_fd, fd = -1;
+	const char *op;
+	const char *probe_err = "unknown probe";
+	uint64_t version = DOF_PARSED_VERSION;
+	int state = -1;
+	int err;
+
+	if (dt_list_next(accum) == NULL) {
+		free(pid_name);
+		free(dof_name);
+		return 0;
+	}
+
+        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: open failed for PID dir for parsed DOF write: %s\n",
+			 pid, dev, ino, strerror(errno));
+		goto early_err;
+	}
+
+	op = "per-mapping open";
+        if ((perpid_dof_dir = openat(perpid_dir, dof_name, O_PATH | O_CLOEXEC)) < 0)
+		goto close_perpid_err;
+
+	op = "per-mapping parsed DOF mkdirat";
+	if ((parsed_mapping_dir = make_state_dirat(perpid_dof_dir, "parsed", op, 0)) < 0)
+		goto close_perpid_dof_err;
+
+	op = "probe PID mkdirat";
+	if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
+		goto remove_parsed_mapping_err;
+
+        if ((version_fd = openat(parsed_mapping_dir, "version",
+				 O_CREAT | O_TRUNC | O_WRONLY, 0644)) < 0)
+		goto remove_probepid_dir_err;
+
+	if (write_chunk(version_fd, &version, sizeof(uint64_t)) < 0) {
+		close(version_fd);
+		goto remove_probepid_dir_err;
+	}
+	close(version_fd);
+
+	/*
+	 * We can just use assert() to check for overruns here, because any
+	 * overruns are not corrupted input but a seriously buggy parser.
+	 *
+	 * On error, from this point on, we remove as little as possible on
+	 * error, just a single probe's worth: better some probes survive an
+	 * error condition than none do.
+	 */
+	for (accump = dt_list_next(accum); accump != NULL;
+	     accump = dt_list_next(accump)) {
+		char *mod, *fun, *prb;
+		char *specfn = NULL;
+
+		switch (accump->parsed->type) {
+		/*
+		 * Provider: make new provider dir.
+		 */
+		case DIT_PROVIDER: {
+			state = accump->parsed->type;
+
+			free(provpid_name);
+
+			provpid_name = make_provider_name(accump->parsed->provider.name, pid);
+			probe_err = accump->parsed->provider.name;
+
+			op = "parsed provider close";
+			if (maybe_close(prov_dir) < 0)
+				goto err_provider;
+
+			op = "probe provider mkdirat";
+			if ((prov_dir = make_state_dirat(probepid_dir, provpid_name,
+							 op, 1)) < 0)
+				goto err_provider;
+			break;
+		}
+		/*
+		 * Probe: make (mod, fun, probe) dirs as needed: close
+		 * out file, open new one, make hardlink to it.
+		 */
+		case DIT_PROBE: {
+                        assert(state >= 0);
+			state = accump->parsed->type;
+
+			mod = accump->parsed->probe.name;
+			assert(accump->parsed->size > (mod - (char *) accump->parsed));
+			fun = mod + strlen(mod) + 1;
+			assert(accump->parsed->size > (fun - (char *) accump->parsed));
+			prb = fun + strlen(fun) + 1;
+			assert(accump->parsed->size >= (prb - (char *) accump->parsed));
+
+			op = "probe name construction";
+			probe_err = provpid_name;
+			if ((specfn = make_probespec_name(provpid_name, mod, fun, prb)) == NULL)
+				goto err_provider;
+
+			/*
+			 * Ban "." and ".." as name components.  Obviously names
+			 * containing dots are commonplace (shared libraries,
+			 * for instance), but allowing straight . and .. would
+			 * have obviously horrible consequences.  They can't be
+			 * filenames anyway, and you can't create them with
+			 * dtrace -h because they aren't valid C identifier
+			 * names.
+			 */
+			op = "probe name validation";
+			probe_err = specfn;
+
+                        if (strcmp(accump->parsed->provider.name, ".") == 0 ||
+			    strcmp(accump->parsed->provider.name, "..") == 0 ||
+			    strcmp(mod, ".") == 0 ||
+			    strcmp(mod, "..") == 0 ||
+			    strcmp(fun, ".") == 0 ||
+			    strcmp(fun, "..") == 0 ||
+			    strcmp(prb, ".") == 0 ||
+			    strcmp(prb, "..") == 0)
+				goto err_provider;
+
+			op = "parsed probe close";
+
+			/*
+			 * This rather weird pattern carries out all closes and
+			 * then errors if any of them fail.
+			 */
+                        err = maybe_close(mod_dir);
+			err |= maybe_close(fun_dir);
+			err |= maybe_close(fd);
+
+			if (err != 0)
+				goto err_provider;
+
+			op = "probe module";
+
+			if ((mod_dir = make_state_dirat(prov_dir, mod, op, 0)) < 0)
+				goto err_provider;
+
+			op = "probe fun";
+			if ((fun_dir = make_state_dirat(mod_dir, fun, op, 0)) < 0)
+				goto err_probemod;
+
+			op = "parsed probe creation";
+			if ((fd = openat(parsed_mapping_dir, specfn,
+					 O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
+				goto err_probefn;
+
+			op = "parsed probe hardlink";
+			if (linkat(parsed_mapping_dir, specfn, fun_dir, prb, 0) < 0)
+				goto err_parsed_probe;
+
+			op = "parsed version write";
+			if (write_chunk(fd, &version, sizeof(uint64_t)) < 0)
+				goto err_probe_link;
+
+			op = "parsed nprobes write";
+			_Static_assert(sizeof(accump->parsed->probe.ntp) == sizeof(uint64_t),
+				       "chunk write assumes parsed.probe.ntp is a uint64_t");
+                        if (write_chunk(fd, &accump->parsed->probe.ntp,
+					sizeof(accump->parsed->probe.ntp)) < 0)
+				goto err_probe_link;
+			break;
+		}
+
+		/* Tracepoint: add to already-open file.  */
+		case DIT_TRACEPOINT:
+			assert(state == DIT_PROBE || state == DIT_TRACEPOINT);
+			state = accump->parsed->type;
+
+			if (write_chunk(fd, accump->parsed, accump->parsed->size) < 0)
+				goto err_probe_link;
+			break;
+
+		/* Error return from parser.  */
+		case DIT_ERR:
+			op = "parsing";
+			fuse_log(FUSE_LOG_ERR, "dtprobed: parser error: %s\n",
+				 accump->parsed->err.err);
+			goto err_provider;
+
+		default:
+			/*
+			 * This is a parser bug, and a bad one.  Just fail
+			 * hard.
+			 */
+			fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: internal error: corrupt parsed DOF: type %i\n",
+				 pid, dev, ino, accump->parsed->type);
+			assert(1);
+		}
+
+		free(specfn);
+		continue;
+
+	err_probe_link:
+		unlinkat(fun_dir, prb, 0);
+	err_parsed_probe:
+		unlinkat(parsed_mapping_dir, specfn, 0);
+	err_probefn:
+		unlinkat(mod_dir, fun, AT_REMOVEDIR);
+	err_probemod:
+		if (prov_dir >= 0) /* Always true: placate compiler */
+			unlinkat(prov_dir, mod, AT_REMOVEDIR);
+	err_provider:
+		maybe_close(fd);
+		maybe_close(fun_dir);
+		maybe_close(mod_dir);
+		maybe_close(prov_dir);
+		unlinkat(probepid_dir, provpid_name, AT_REMOVEDIR);
+		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write for %s: %s\n",
+			 pid, dev, ino, op, probe_err, strerror(errno));
+		free(specfn);
+		goto err_free;
+	}
+
+	err = maybe_close(prov_dir);
+	err |= maybe_close(mod_dir);
+	err |= maybe_close(fun_dir);
+	err |= maybe_close(fd);
+
+        if (err != 0) {
+		unlinkat(perpid_dof_dir, "parsed", 0);
+		goto err_free;
+	}
+
+	close(probepid_dir);
+	close(parsed_mapping_dir);
+	close(perpid_dof_dir);
+	close(perpid_dir);
+	free(provpid_name);
+	free(pid_name);
+	free(dof_name);
+	return 0;
+
+	/* Errors requiring freeing and error return but no removals.  */
+err_free:
+	close(probepid_dir);
+	close(parsed_mapping_dir);
+	close(perpid_dof_dir);
+	close(perpid_dir);
+	close(prov_dir);
+	free(provpid_name);
+	free(pid_name);
+	free(dof_name);
+	return -1;
+
+	/*
+	 * Errors before parsed-probe creation starts require wholesale
+	 * removals of stale dirs.
+	 */
+
+remove_probepid_dir_err:
+	close(probepid_dir);
+	unlinkat(probes_dir, pid_name, AT_REMOVEDIR);
+	unlinkat(parsed_mapping_dir, "version", 0);
+remove_parsed_mapping_err:
+	close(parsed_mapping_dir);
+	unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR);
+close_perpid_dof_err:
+	close(perpid_dof_dir);
+close_perpid_err:
+	fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write: %s\n",
+		 pid, dev, ino, op, strerror(errno));
+	close(perpid_dir);
+early_err:
+	free(pid_name);
+	free(dof_name);
+	return -1;
+}
+
+/*
+ * Write out the DOF helper.  A trivial wrapper.
+ */
+static
+int dof_stash_write_dh(int dirfd, const dof_helper_t *dh)
+{
+	int fd;
+
+        if ((fd = openat(dirfd, "dh", O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
+		goto err;
+
+	if (write_chunk(fd, dh, sizeof(dof_helper_t *)) < 0) {
+		close(fd);
+		goto err;
+	}
+
+        if (close(fd) < 0)
+		goto err;
+
+	return 0;
+
+ err:
+	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF helper: %s\n",
+		 strerror(errno));
+	return -1;
+}
+
+/*
+ * Add a piece of raw DOF from a given (pid, dev, ino) triplet.
+ *
+ * Return the new DOF generation number, or -1 on error.
+ */
+int
+dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
+	      const void *dof, size_t size)
+{
+	char *dof_name = make_dof_name(dev, ino);
+	char *pid_name = make_numeric_name(pid);
+	int perpid_dir = -1;
+	int perpid_dof_dir = -1;
+	int gen_fd = -1;
+	struct stat gen_stat;
+	char *gen_name = NULL;
+	int new_pid = 1;
+	int new_dof = 0;
+	int err = -1;
+
+	if (!pid_name || !dof_name)
+		goto out_free;
+
+
+	/* Make the directories. */
+
+        perpid_dir = make_state_dirat(pid_dir, pid_name, "PID", 0);
+
+        if (perpid_dir < 0)
+		goto out_free;
+
+        perpid_dof_dir = make_state_dirat(perpid_dir, dof_name, "per-pid DOF", 0);
+
+        if (perpid_dof_dir < 0)
+		goto err_unlink_nomsg;
+
+        fuse_log(FUSE_LOG_DEBUG, "%i: adding DOF in mapping %lx/%lx, l_addr %lx\n",
+		 pid, dev, ino, dh->dofhp_addr);
+
+	/* Get the current generation counter. */
+
+        if ((gen_fd = openat(perpid_dir, "next-gen", O_CREAT | O_WRONLY, 0644)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open DOF generation counter for PID "
+			 "%i: %s\n", pid, strerror(errno));
+		goto err_unlink_nomsg;
+	}
+
+        if (fstat(gen_fd, &gen_stat) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat DOF generation counter for PID "
+			 "%i: %s\n", pid, strerror(errno));
+		goto err_unlink_nomsg;
+	}
+	if (gen_stat.st_size > 0)
+		new_pid = 0;
+
+	/*
+	 * Write out the raw DOF, if not already known, and link it into its
+	 * per-PID dir: write the helper directly out.  dof_stash_write tells us
+	 * whether the DOF actually needed writing or not (via a straight size
+	 * comparison, which suffices to spot almost everything other than the
+	 * pathological case of stale DOF from a killed process whose inode is
+	 * unlinked and replaced with another identically-numbered inode with
+	 * DOF of the *exact same length*: this seems too unlikely to be worth
+	 * bothering with.)
+	 */
+	new_dof = 1;
+	switch (dof_stash_write(dof_dir, dof_name, dof, size)) {
+	case 0: new_dof = 0; break;
+	case -1: goto err_unlink_nomsg; break;
+	}
+
+        if (linkat(dof_dir, dof_name, perpid_dof_dir, "raw", 0) < 0)
+		goto err_unlink_msg;
+
+	if (dof_stash_write_dh(perpid_dof_dir, dh) < 0)
+		goto err_unlink_nomsg;
+
+	/*
+	 * Update the generation counter and mark this DOF's generation.  On
+	 * error after this point we leak generation counter values.
+	 */
+	if(ftruncate(gen_fd, gen_stat.st_size + 1) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot update generation counter for PID %i: %s\n",
+			 pid, strerror(errno));
+		goto err_unlink_nomsg;
+	}
+
+	gen_name = make_numeric_name(gen_stat.st_size);
+	if (symlinkat(dof_name, perpid_dir, gen_name) < 0)
+		goto err_unlink_msg;
+
+        err = 0;
+
+out_free:
+        free(dof_name);
+	free(pid_name);
+
+        maybe_close(perpid_dir);
+        maybe_close(perpid_dof_dir);
+	maybe_close(gen_fd);
+
+        return err == 0 ? gen_stat.st_size : err;
+
+err_unlink_msg:
+	fuse_log(FUSE_LOG_ERR, "Cannot link pids for DOF from PID %i, "
+		 "DOF mapping %lx/%lx into place: %s\n", pid, dev, ino,
+		 strerror(errno));
+err_unlink_nomsg:
+        unlinkat(perpid_dof_dir, "raw", 0);
+        unlinkat(perpid_dof_dir, "dh", 0);
+	unlinkat(perpid_dir, dof_name, AT_REMOVEDIR);
+
+	if (gen_name)
+		unlinkat(perpid_dir, gen_name, 0);
+
+	if (new_pid) {
+		unlinkat(perpid_dir, "next-gen", 0);
+		unlinkat(pid_dir, pid_name, AT_REMOVEDIR);
+	}
+
+        if (new_dof)
+		unlinkat(dof_dir, dof_name, 0);
+
+	goto out_free;
+}
+
+/*
+ * The guts of unlinkat_many (below).  Returns -1 on error, or positive
+ * ENOTEMPTY if a subdir deletion failed because the dir was nonempty (any
+ * positive return leads to unwinding without doing any more deletions).
+ */
+static int
+unlinkat_many_internal(int dirfd, int diroff, const char **dirs)
+{
+	int fd;
+	struct stat s;
+	const char *err;
+	int flags = 0;
+	int ret;
+
+	/*
+	 * Open the dir/file at the next level down, then recurse to delete the
+	 * next one in first.  This is a directory unless the next element in
+	 * the list is NULL.
+	 */
+
+	if (dirs[diroff] == NULL)
+		return 0;
+
+	if (dirs[diroff + 1] != NULL)
+		flags = AT_REMOVEDIR;
+
+	if (fstatat(dirfd, dirs[diroff], &s, AT_SYMLINK_NOFOLLOW) < 0) {
+		if (errno == ENOENT)
+			return 0;
+		fuse_log(FUSE_LOG_ERR, "Cannot unlink %s during probe state dir deletion: %s\n",
+			 dirs[diroff], strerror(errno));
+		return -1;
+	}
+
+	if (flags == AT_REMOVEDIR && !S_ISDIR(s.st_mode)) {
+		err = "must be a directory, but is not";
+		goto err_keep_going;
+	} else if (flags == 0 && !S_ISREG(s.st_mode)) {
+		err = "not a regular file";
+		goto err_keep_going;
+	}
+
+	if ((fd = openat(dirfd, dirs[diroff], O_RDONLY | O_CLOEXEC)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "Cannot open %s during probe state dir deletion: %s\n",
+			 dirs[diroff], strerror(errno));
+		char foo[PATH_MAX];
+		sprintf(foo, "ls -l /proc/%i/fd", getpid());
+		system(foo);
+		return -1;
+	}
+
+        ret = unlinkat_many_internal(fd, diroff + 1, dirs);
+	close(fd);
+
+	if (ret < 0)
+		return -1;
+	else if (ret > 0)
+		return 0;			/* ENOTEMPTY */
+
+	/*
+	 * Now do the deletion.
+	 */
+	if (unlinkat(dirfd, dirs[diroff], flags) < 0) {
+		if (errno == ENOENT)
+			return 0;
+		if (errno == ENOTEMPTY)
+			return ENOTEMPTY;
+        }
+	return 0;
+
+err_keep_going:
+	fuse_log(FUSE_LOG_ERR, "Cannot unlink %s during probe state dir deletion: "
+		 "%s\n", dirs[diroff], err);
+	return 0;
+}
+
+/*
+ * Unlink a file and then a bunch of directories, as given by the names
+ * argument (the last entry in which is a file).  Each directory in the dirs
+ * array is a subdirectory of the one before it.  If any directory removal
+ * fails due to non-emptiness, all later removals are skipped.  If any directory
+ * does not exist, removals start from that point.
+ *
+ * The first entry must be a file: the next must be directories.  If this is
+ * violated, just return without unlinking anything (but do log an error).
+ */
+static int
+unlinkat_many(int dirfd, const char **dirs)
+{
+	return unlinkat_many_internal(dirfd, 0, dirs);
+}
+
+/*
+ * Determine if a file or directory (in the DOF stash) should be deleted.
+ */
+static int
+refcount_cleanup_p(int dirfd, const char *name, int isdir)
+{
+	struct stat s;
+
+	if (fstatat(dirfd, name, &s, 0) != 0) {
+		fuse_log(FUSE_LOG_ERR, "Cannot stat %s for cleanup: %s\n",
+			 name, strerror(errno));
+		return -1;
+	}
+
+	if ((isdir && s.st_nlink != 2) || (!isdir && s.st_nlink != 1))
+		return 0;
+
+	return 1;
+}
+
+
+/*
+ * Delete a file or directory (in the DOF stash) if it has no other links.
+ */
+static int
+refcount_cleanup(int dirfd, const char *name, int isdir)
+{
+	switch (refcount_cleanup_p(dirfd, name, isdir)) {
+	case -1: return -1;
+	case 0: return 0;
+	default: break;
+	}
+
+	if (unlinkat(dirfd, name, isdir ? AT_REMOVEDIR : 0) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
+			 name, strerror(errno));
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * Remove the parsed directory under a given mapping directory.
+ */
+static int
+dof_stash_remove_parsed(pid_t pid, int mapping_dir, const char *mapping)
+{
+	char *pid_name = make_numeric_name(pid);
+	int parsed_dir;
+	DIR *parsed;
+	struct dirent *ent;
+
+        if (((parsed_dir = openat(mapping_dir, "parsed", O_RDONLY | O_CLOEXEC)) < 0) ||
+	    ((parsed = fdopendir(parsed_dir)) == NULL)) {
+
+		if (parsed_dir >= 0)
+			close(parsed_dir);
+
+		if (errno == ENOENT)
+			return 0;
+
+                fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory for PID %i mapping %s for cleanup: %s\n",
+                         pid, mapping, strerror(errno));
+		free(pid_name);
+		return -1;
+        }
+
+	/*
+	 * Scan the parsed dir, clean up all the matching per-probe
+	 * dirs, and delete the parsed dir's contents too.
+	 */
+
+        while (errno = 0, (ent = readdir(parsed)) != NULL) {
+		const char *mod, *fun, *prb;
+		const char *unlink_err;
+		char *prov;
+
+                if (errno != 0)
+			goto err;
+
+		if (ent->d_type != DT_REG)
+			continue;
+
+		if ((prov = strdup(ent->d_name)) == NULL)
+			goto err;
+
+		if (unlinkat(parsed_dir, ent->d_name, 0) < 0) {
+			unlink_err = "parsed dir";
+			goto next;
+		}
+
+                if (strcmp(ent->d_name, "version") == 0)
+			continue;
+
+		if (split_probespec_name(prov, &mod, &fun, &prb) < 0)
+			continue;
+
+		const char *dirnames[] = {pid_name, prov, mod, fun, prb, NULL};
+		if (unlinkat_many(probes_dir, dirnames) < 0) {
+			unlink_err = "parsed probes dir entries";
+			goto next;
+		}
+		free(prov);
+		continue;
+
+	  next:
+		free(prov);
+                fuse_log(FUSE_LOG_ERR, "cannot unlink per-PID %s %s (mapping %s) for cleanup: %s\n",
+			 ent->d_name, unlink_err, mapping, strerror(errno));
+	}
+	closedir(parsed);
+	free(pid_name);
+
+        return 0;
+
+  err:
+	closedir(parsed);
+	free(pid_name);
+	fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory entry for PID %i mapping %s for cleanup: %s\n",
+		 pid, mapping, strerror(errno));
+	return -1;
+}
+
+/*
+ * Remove a piece of DOF, identified by generation counter.  Return -1 on error.
+ *
+ * Not knowing about the DOF is not an error.
+ */
+int
+dof_stash_remove(pid_t pid, int gen)
+{
+	char *pid_name = make_numeric_name(pid);
+	char *gen_name = make_numeric_name(gen);
+	char *gen_linkname = NULL;
+	int perpid_dir = -1;
+	int perpid_dof_dir = -1;
+	struct stat pid_stat;
+	struct stat gen_stat;
+	int err = -1;
+	const char *unlink_err = NULL;
+
+	/*
+	 * Figure out the per-PID DOF directory by following the gen-counter
+	 * symlink (by hand, because we want to reuse its name in the non-
+	 * PID-specific dof/ subdir too).
+	 */
+
+	if (!pid_name || !gen_name)
+		goto oom;
+
+        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0)
+		goto out;
+
+	if (fstatat(perpid_dir, gen_name, &gen_stat, AT_SYMLINK_NOFOLLOW) < 0)
+		goto out;
+
+	if ((gen_linkname = malloc(gen_stat.st_size + 1)) == NULL)
+		goto oom;
+
+	/*
+	 * Truncated dir -> concurrent modifications, not allowed.
+	 */
+	if (readlinkat(perpid_dir, gen_name, gen_linkname, gen_stat.st_size + 1) ==
+	    gen_stat.st_size + 1) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: state dir is under concurrent modification, maybe multiple dtprobeds are running\n");
+		goto err;
+	}
+	gen_linkname[gen_stat.st_size] = 0;
+
+	perpid_dof_dir = openat(perpid_dir, gen_linkname, O_PATH | O_CLOEXEC);
+
+        if (strchr(gen_linkname, '/') != NULL ||
+	    perpid_dof_dir < 0) {
+                if (perpid_dof_dir >= 0)
+			close(perpid_dof_dir);
+                fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which is not valid\n",
+			 pid, gen, gen_linkname);
+		goto err;
+	}
+
+	if (faccessat(dof_dir, gen_linkname, F_OK, 0) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which has no raw DOF\n",
+			 pid, gen, gen_linkname);
+		goto err;
+	}
+
+	/*
+	 * Unlink the per-PID raw and parsed DOF and the probe dirs, then check
+	 * the link counts on the non-per-PID stuff; if zero, unlink that too.
+	 */
+
+	fuse_log(FUSE_LOG_DEBUG, "gen_name: %s; gen_linkname: %s; perpid_dof_dir: %i\n", gen_name, gen_linkname, perpid_dof_dir);
+        if (unlinkat(perpid_dof_dir, "raw", 0) != 0 && errno != ENOENT) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID raw DOF for PID %i generation %i: %s\n",
+			 pid, gen, strerror(errno));
+	}
+
+	if (dof_stash_remove_parsed(pid, perpid_dof_dir, gen_linkname) < 0)
+		unlink_err = "parsed probes dir entries";
+
+	if (unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR) < 0)
+		unlink_err = "parsed dir";
+
+	if (unlinkat(perpid_dof_dir, "dh", 0) < 0)
+		unlink_err = "dh";
+
+	if (unlinkat(perpid_dir, gen_linkname, AT_REMOVEDIR) < 0)
+		unlink_err = gen_linkname;
+
+	if (unlinkat(perpid_dir, gen_name, 0) < 0)
+		unlink_err = gen_name;
+
+	if (refcount_cleanup(dof_dir, gen_linkname, 0) < 0)
+		unlink_err = gen_linkname;
+
+	refcount_cleanup(dof_dir, gen_linkname, 0);
+
+	/*
+	 * Clean up the PID directory itself, if it is now empty.  We can't just
+	 * use refcount_cleanup here because we also have to delete the
+	 * generation counter right before deleting the directory.
+	 */
+
+	if (fstatat(pid_dir, pid_name, &pid_stat, 0) < 0) {
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat per-PID for pid %i: %s\n",
+			 pid, strerror(errno));
+		goto err;
+	}
+
+	if (pid_stat.st_nlink == 2) {
+		if (unlinkat(perpid_dir, "next-gen", 0) < 0) {
+			fuse_log(FUSE_LOG_ERR, "dtprobed: cannot clean up gen counter in per-PID dir for PID %i: %s\n",
+				 pid, strerror(errno));
+			goto err;
+		}
+		refcount_cleanup(pid_dir, pid_name, 1);
+	}
+
+	if (unlink_err)
+		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID DOF %s for PID %i generation %i: %s\n",
+			 unlink_err, pid, gen, strerror(errno));
+
+ out:
+	err = 0;
+
+ err:
+        free(pid_name);
+	free(gen_name);
+	free(gen_linkname);
+
+        maybe_close(perpid_dir);
+        maybe_close(perpid_dof_dir);
+
+	return err;
+
+oom:
+	fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory unlinking DOF\n");
+	goto err;
+}
+
+/*
+ * Prune dead processes out of the DOF stash.  This is not a correctness
+ * operation, just a space-waste reducer. If a process with DOF died uncleanly
+ * and its PID was recycled since a cleanup was last run, we'll leave some DOF
+ * around for a bit, but there are no correctness implications: either the new
+ * process has no DOF at all, in which case we'll never learn about it, or it
+ * does, in which case either it has identical mappings (in which case we'll
+ * just reuse them) or it has different ones (in which case we'll just add to
+ * them): in neither case is there any negative consequence of the nonexistent
+ * mappings other than a bit of space wastage.  (This implies that any process
+ * reading DOF out of the state directory must check that a process that claims
+ * to have DOF at a given mapping actually has a corresponding mapping with the
+ * same dev/ino pair at the right address. There's no need for that process to
+ * clean it up if found to be wrong, though: this function will eventually get
+ * around to it.)
+ */
+void
+dof_stash_prune_dead(void)
+{
+	DIR *all_pids_dir;
+	struct dirent *pid_ent;
+	int tmp;
+
+	fuse_log(FUSE_LOG_DEBUG, "Pruning dead PIDs\n");
+
+	if ((tmp = dup(pid_dir)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "cannot dup() for cleanup: %s\n", strerror(errno));
+		return;
+	}
+
+        if ((all_pids_dir = fdopendir(tmp)) == NULL) {
+		close(tmp);
+		fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF directory: %s\n",
+			 strerror(errno));
+		return;
+	}
+	rewinddir(all_pids_dir);
+
+	/*
+	 * Work over all the PIDs.
+	 */
+	while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
+		DIR *perpid_dir;
+		struct dirent *mapping_ent;
+		char *end_pid;
+		pid_t pid;
+
+		if (errno != 0)
+			goto scan_failure;
+
+		if (pid_ent->d_type != DT_DIR)
+			continue;
+
+		/*
+		 * Only directories with numeric names can be PIDs: skip all the
+		 * rest.
+		 */
+		pid = strtol(pid_ent->d_name, &end_pid, 10);
+		if (*end_pid != '\0')
+			continue;
+
+		if (kill(pid, 0) == 0 || errno != ESRCH)
+			continue;
+
+		/*
+		 * Process dead: clean up its PIDs, deleting everything in the
+		 * per-PID directory by calling dof_stash_remove() on every
+		 * mapping in turn.
+		 */
+
+		if ((tmp = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
+			fuse_log(FUSE_LOG_ERR,"cannot open per-PID DOF mappings directory for cleanup: %s\n",
+				 strerror(errno));
+			goto out;
+		}
+
+		if ((perpid_dir = fdopendir(tmp)) == NULL) {
+			fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF mappings: %s\n",
+				 strerror(errno));
+			close(tmp);
+			goto out;
+		}
+
+		fuse_log(FUSE_LOG_DEBUG, "Pruning dead PID %s\n", pid_ent->d_name);
+
+		/* Work over all the mappings in this PID. */
+
+                while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
+			int gen;
+			char *end_gen;
+
+                        if (errno != 0) {
+				fuse_log(FUSE_LOG_ERR, "cannot read per-PID DOF mappings for cleanup: %s\n",
+					 strerror(errno));
+				closedir(perpid_dir);
+				goto out;
+			}
+
+			if (mapping_ent->d_type != DT_LNK)
+				continue;
+
+			fuse_log(FUSE_LOG_DEBUG, "Working over generation %s\n", mapping_ent->d_name);
+
+			gen = strtol(mapping_ent->d_name, &end_gen, 10);
+			if (*end_gen != '\0')
+				continue;
+
+                        dof_stash_remove(pid, gen);
+		}
+		closedir(perpid_dir);
+		errno = 0;
+	}
+
+	if (errno != 0)
+		goto scan_failure;
+
+out:
+	closedir(all_pids_dir);
+	return;
+
+scan_failure:
+	fuse_log(FUSE_LOG_ERR, "cannot scan per-PID DOF directory for cleanup: %s\n",
+		 strerror(errno));
+	goto out;
+}
+
+/*
+ * Read a file into a buffer and return it.  If READ_SIZE is non-negative, read
+ * only that many bytes (and silently fail if the file isn't at least that
+ * long).
+ */
+static void *
+read_file(int fd, ssize_t read_size, size_t *size)
+{
+	struct stat s;
+	char *buf;
+	char *bufptr;
+	int len;
+
+        if (fstat(fd, &s) < 0) {
+		fuse_log(FUSE_LOG_ERR, "cannot stat: %s\n", strerror(errno));
+		return NULL;
+	}
+	if (read_size >= 0) {
+		if (s.st_size < read_size)
+			return NULL;
+	} else
+		read_size = s.st_size;
+
+	if ((buf = malloc(read_size)) == NULL) {
+		fuse_log(FUSE_LOG_ERR, "out of memory allocating %zi bytes\n",
+			 read_size);
+		return NULL;
+	}
+
+	if (size)
+		*size = read_size;
+
+        bufptr = buf;
+	while ((len = read(fd, bufptr, read_size)) < read_size) {
+		if (len < 0) {
+			if (errno != EINTR) {
+				fuse_log(FUSE_LOG_ERR, "cannot read: %s",
+					 strerror(errno));
+				free(buf);
+				return NULL;
+			}
+			continue;
+		}
+		read_size -= len;
+		bufptr += len;
+	}
+	return buf;
+}
+
+/*
+ * Reparse all DOF.  Mappings that cannot be reparsed are simply ignored, on the
+ * grounds that most DOF, most of the time, is not used, so this will likely be
+ * ignorable.
+ *
+ * Normally, DOF with the same DOF_VERSION as the currently running daemon is
+ * left alone: if force is set, even that is reparsed.
+ */
+int
+reparse_dof(int out, int in,
+	    int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
+			   dof_helper_t *dh, const void *in_buf, size_t in_bufsz,
+			   int reparsing),
+	    int force)
+{
+	DIR *all_pids_dir;
+	struct dirent *pid_ent;
+	int err = -1;
+	int tmp;
+
+	if ((tmp = dup(pid_dir)) < 0) {
+		fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
+		return -1;
+	}
+
+        if ((all_pids_dir = fdopendir(tmp)) == NULL) {
+		close(tmp);
+		fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF directory: %s\n",
+			 strerror(errno));
+		return -1;
+	}
+
+	/* Work over all the PIDs. */
+
+	rewinddir(all_pids_dir);
+	while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
+		int perpid_fd;
+		char *end_pid;
+		pid_t pid;
+		DIR *perpid_dir;
+		struct dirent *mapping_ent;
+
+		if (errno != 0)
+			goto scan_failure;
+
+		if (pid_ent->d_type != DT_DIR)
+			continue;
+
+		/*
+		 * Only directories with numeric names can be PIDs: skip all the
+		 * rest.
+		 */
+		pid = strtol(pid_ent->d_name, &end_pid, 10);
+		if (*end_pid != '\0')
+			continue;
+
+                if ((perpid_fd = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
+			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mappings directory for PID %s: %s\n",
+				 pid_ent->d_name, strerror(errno));
+			goto out;
+		}
+
+		if ((tmp = dup(perpid_fd)) < 0) {
+			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
+			close(perpid_fd);
+			goto out;
+		}
+
+		if ((perpid_dir = fdopendir(tmp)) == NULL) {
+			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
+				 pid_ent->d_name, strerror(errno));
+			close(tmp);
+			close(perpid_fd);
+			goto out;
+		}
+
+		/* Work over all the mappings in this PID. */
+
+                while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
+			int mapping_fd;
+			int fd;
+			dev_t dev;
+			ino_t ino;
+			size_t dof_size, dh_size;
+			void *dof = NULL;
+			void *dh = NULL;
+
+                        if (errno != 0) {
+				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
+					 pid_ent->d_name, strerror(errno));
+				goto perpid_err;
+			}
+
+			if (mapping_ent->d_type != DT_DIR)
+				continue;
+
+			/*
+			 * Hidden files cannot be mappings.  (This also skips
+			 * . and ..).
+			 */
+			if (mapping_ent->d_name[0] == '.')
+				continue;
+
+                        if ((mapping_fd = openat(perpid_fd, mapping_ent->d_name,
+						 O_PATH | O_CLOEXEC)) < 0) {
+				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mapping for PID %s, mapping %s: %s\n",
+					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+				goto perpid_err;
+			}
+
+			fuse_log(FUSE_LOG_DEBUG, "Parsed DOF is present: evaluating\n");
+
+			/*
+			 * Only DOF with a different parsed representation gets
+			 * reparsed.
+			 */
+			if (!force) {
+				if ((fd = openat(mapping_fd, "parsed/version",
+						 O_RDONLY | O_CLOEXEC)) == 0) {
+					uint64_t *parsed_version;
+
+					parsed_version = read_file(fd, sizeof(uint64_t),
+								   NULL);
+					close(fd);
+
+					if (parsed_version != NULL &&
+					    *parsed_version == DOF_PARSED_VERSION) {
+						fuse_log(FUSE_LOG_DEBUG, "No need to reparse\n");
+						continue;
+					}
+
+					free(parsed_version);
+				}
+			}
+
+			fuse_log(FUSE_LOG_DEBUG, "Deleting\n");
+
+			if (dof_stash_remove_parsed(pid, mapping_fd, mapping_ent->d_name) < 0) {
+				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot delete parsed DOF for PID %s, mapping %s: %s\n",
+					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+				close(mapping_fd);
+				goto perpid_err;
+			}
+
+			if (split_dof_name(mapping_ent->d_name, &dev, &ino) < 0) {
+				fuse_log(FUSE_LOG_ERR, "when reparsing DOF for PID %s, cannot derive dev/inum from %s: ignored\n",
+					 pid_ent->d_name, mapping_ent->d_name);
+				continue;
+			}
+
+			if ((fd = openat(mapping_fd, "raw", O_RDONLY | O_CLOEXEC)) < 0) {
+				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open raw DOF for PID %s, mapping %s: ignored: %s\n",
+					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+				close(mapping_fd);
+				continue;
+			}
+
+			if ((dof = read_file(fd, -1, &dof_size)) == NULL) {
+				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read raw DOF for PID %s, mapping %s: ignored: %s\n",
+					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+				close(mapping_fd);
+				close(fd);
+				continue;
+			}
+			close(fd);
+
+			if ((fd = openat(mapping_fd, "dh", O_RDONLY | O_CLOEXEC)) < 0) {
+				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open dh for PID %s, mapping %s: ignored: %s\n",
+					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+				free(dof);
+				close(mapping_fd);
+				continue;
+			}
+
+			if ((dh = read_file(fd, -1, &dh_size)) == NULL) {
+				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read dh for PID %s, mapping %s: ignored: %s\n",
+					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+				free(dof);
+				close(mapping_fd);
+				close(fd);
+				continue;
+			}
+			close(fd);
+
+			fuse_log(FUSE_LOG_DEBUG, "Reparsing DOF for PID %s, mapping %s\n",
+				 pid_ent->d_name, mapping_ent->d_name);
+			if (reparse(pid, out, in, dev, ino, dh, dof, dof_size, 1) < 0)
+				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot parse DOF for PID %s, mapping %s: ignored\n",
+					    pid_ent->d_name, mapping_ent->d_name);
+			free(dof);
+			free(dh);
+			close(mapping_fd);
+
+			continue;
+
+		perpid_err:
+			closedir(perpid_dir);
+			close(perpid_fd);
+			goto out;
+		}
+		closedir(perpid_dir);
+		close(perpid_fd);
+		errno = 0;
+	}
+
+	if (errno != 0)
+		goto scan_failure;
+
+	err = 0;
+out:
+	closedir(all_pids_dir);
+	return err;
+
+scan_failure:
+	fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot scan per-PID DOF directory: %s\n",
+		 strerror(errno));
+	goto out;
+}
diff --git a/dtprobed/dof_stash.h b/dtprobed/dof_stash.h
new file mode 100644
index 000000000000..2754a8180622
--- /dev/null
+++ b/dtprobed/dof_stash.h
@@ -0,0 +1,44 @@
+/*
+ * Oracle Linux DTrace; DOF storage for later probe removal.
+ * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
+ * Licensed under the Universal Permissive License v 1.0 as shown at
+ * http://oss.oracle.com/licenses/upl.
+ */
+
+#ifndef	_DOF_STASH_H
+#define	_DOF_STASH_H
+
+#include <sys/types.h>
+#include <stdint.h>
+
+#include <dt_list.h>
+#include "dof_parser.h"
+
+typedef struct dof_parsed_list {
+	dt_list_t list;
+	dof_parsed_t *parsed;
+} dof_parsed_list_t;
+
+int dof_stash_init(const char *statedir);
+
+int dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed);
+
+int dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum);
+
+void dof_stash_flush(dt_list_t *accum);
+
+int dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
+		  const void *dof, size_t size);
+
+int dof_stash_remove(pid_t pid, int gen);
+
+void dof_stash_prune_dead(void);
+
+int
+reparse_dof(int out, int in,
+	    int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
+			   dof_helper_t *dh, const void *in_buf,
+			   size_t in_bufsz, int reparsing),
+	    int force);
+
+#endif
diff --git a/dtprobed/dtprobed.service b/dtprobed/dtprobed.service
index 95f0a1d6d606..70443f08fb9b 100644
--- a/dtprobed/dtprobed.service
+++ b/dtprobed/dtprobed.service
@@ -20,6 +20,8 @@ ProtectHome=true
 PrivateDevices=false
 PrivateNetwork=true
 ProtectControlGroups=true
+RuntimeDirectory=/run/dtrace
+RuntimeDirectoryPreserve=restart
 
 [Install]
 WantedBy=basic.target
diff --git a/dtrace.spec b/dtrace.spec
index 6d81222088fb..5306256efdfa 100644
--- a/dtrace.spec
+++ b/dtrace.spec
@@ -185,6 +185,10 @@ make DESTDIR=$RPM_BUILD_ROOT VERSION=%{version} \
      HDRPREFIX="$RPM_BUILD_ROOT/usr/include" \
      install install-test
 
+%if "%{?dist}" == ".el7"
+sed -i '/^ProtectSystem=/d; /^ProtectControlGroups=/d; /^RuntimeDirectory/d;' $RPM_BUILD_ROOT/usr/lib/systemd/system/dtprobed.service
+%endif
+
 # Because systemtap creates a sdt.h header file we have to rename
 # ours and then shift theirs out of the way.
 mv $RPM_BUILD_ROOT/usr/include/sys/sdt.h \
diff --git a/libcommon/dof_parser.h b/libcommon/dof_parser.h
index 7ab18f4dee01..41ef5a193b7f 100644
--- a/libcommon/dof_parser.h
+++ b/libcommon/dof_parser.h
@@ -1,6 +1,6 @@
 /*
  * Oracle Linux DTrace; DOF parser interface with the outside world
- * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2022, 2023, Oracle and/or its affiliates. All rights reserved.
  * Licensed under the Universal Permissive License v 1.0 as shown at
  * http://oss.oracle.com/licenses/upl.
  */
@@ -15,11 +15,10 @@
 #include <dtrace/helpers.h>
 
 /*
- * Result of DOF probe parsing (currently used only for probe creation. We
- * receive a provider info structure, followed by N probe info structures each
- * of which is followed by possibly many tracepoint info structures, all tagged.
- * Things not useful for probe creation (like args, translated types, etc) are
- * not returned.
+ * Result of DOF probe parsing.  We receive a provider info structure, followed
+ * by N probe info structures each of which is followed by possibly many
+ * tracepoint info structures, all tagged.  Things not useful for probe creation
+ * (like args, translated types, etc) are not returned.
  *
  * On error, a DIT_ERR structure is returned with an error message.
  */
@@ -31,6 +30,15 @@ typedef enum dof_parsed_info {
 	DIT_ERR = 3
 } dof_parsed_info_t;
 
+/*
+ * Bump this whenever dof_parsed changes.
+ *
+ * Files consisting of arrays of dof_parsed have a single 64-bit word at the
+ * start which is the version of the dof_parseds within it.  The data flowing
+ * over the stream from the seccomped parser has no such prefix.
+ */
+#define DOF_PARSED_VERSION 1
+
 typedef struct dof_parsed {
 	/*
 	 * Size of this instance of this structure.
-- 
2.42.0.271.g85384428f1




More information about the DTrace-devel mailing list