[DTrace-devel] [PATCH v3 06/21] dtprobed: add the DOF stash

Kris Van Hees kris.van.hees at oracle.com
Thu Feb 15 04:32:53 UTC 2024


Continuing at dof_stash_write_parsed()...

On Tue, Jan 16, 2024 at 09:13:02PM +0000, Nick Alcock via DTrace-devel wrote:
> The DOF stash is a runtime directory (/run/dtrace/stash by default) in which
> the dtprobed daemon keeps track of incoming DOF.  There's a comment
> describing it at the top of dtprobed/dof_stash.c, but in brief we have two
> directories under /run/dtrace/stash:
> 
> - dof/, which contains raw DOF named by mapping
> - dof-pid/, which tracks DOF on a PID-by-PID and mapping-by-mapping basis.
>   This contains subdirectories $pid/$mapping containing both the same raw
>   DOF (hardlinked) and parsed DOF split up by probe, in
>   $pid/$mapping/parsed/$spec.
> 
> To aid consumer globbing of probes we also put the DOF for each individual
> probe in /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb; in future we may
> introduce further symlinks not in per-pid directories to ease matching of
> probes across processes.
> 
> Consumers can use inotify watches to detect newly added or removed DOF or
> probes: because all the DOF is in one directory, a very small number of
> watches is needed to keep track of all the DOF in the system, and only one
> watch per PID is needed to track it on a PID-by-PID basis. (Currently,
> nothing is doing this, but it is now a possibility where it never was
> before.)
> 
> In a future commit, dtprobed will emit all the DOF it receives into this
> structure, removing it when processes exit, and DTrace will use it to create
> and remove probes.
> 
> dof_stash.c itself provides functions to add new raw DOF to the stash, to
> add a stream of parsed DOF corresponding to a given piece of raw DOF to the
> stash, to remove DOF as processes issue DTRACEHIOC_REMOVE requests, to check
> for any DOF that needs reparsing to handle changes to struct dof_parsed, and
> to remove entries from the DOF stash that is left behind by processes that
> died uncleanly without issuing DTRACEHIOC_REMOVEs. (Currently, we don't deal
> with newly-started processes that reuse the same PID as such a process, or
> processes that leave DOF behind on exec() and then add more: this
> improvement will come soon.)
> 
> It does not do any DOF parsing: that is left up to the caller (dtprobed.c).
> 
> All the writes to the dof_stash are done via the *at() functions to avoid
> any risk of bugs causing the file creations and most especially deletions it
> carries out from leaking outside /run/dtrace.  (We are not concerned about
> symlink attacks etc because only root can write to /run/dtrace or any
> directories underneath it.)
> 
> Signed-off-by: Nick Alcock <nick.alcock at oracle.com>
> ---
>  dtprobed/Build            |    2 +-
>  dtprobed/dof_stash.c      | 1621 +++++++++++++++++++++++++++++++++++++
>  dtprobed/dof_stash.h      |   44 +
>  dtprobed/dtprobed.service |    2 +
>  dtrace.spec               |    4 +
>  libcommon/dof_parser.h    |   20 +-
>  6 files changed, 1686 insertions(+), 7 deletions(-)
>  create mode 100644 dtprobed/dof_stash.c
>  create mode 100644 dtprobed/dof_stash.h
> 
> diff --git a/dtprobed/Build b/dtprobed/Build
> index c44dd40ab3d2..f1e0dfff456f 100644
> --- a/dtprobed/Build
> +++ b/dtprobed/Build
> @@ -26,7 +26,7 @@ dtprobed_CPPFLAGS := -I. -Idtprobed -Ilibproc -Ilibcommon -Ilibport
>  dtprobed_CFLAGS := $(shell pkg-config --cflags $(LIBFUSE))
>  dtprobed_LIBS := -lcommon -lproc -lcommon -lport -lelf $(shell pkg-config --libs $(LIBFUSE))
>  dtprobed_DEPS := libproc.a libcommon.a libport.a
> -dtprobed_SOURCES := dtprobed.c seccomp-assistance.c
> +dtprobed_SOURCES := dtprobed.c dof_stash.c seccomp-assistance.c
>  dtprobed_LIBSOURCES := libproc libcommon
>  
>  ifdef HAVE_LIBSYSTEMD
> diff --git a/dtprobed/dof_stash.c b/dtprobed/dof_stash.c
> new file mode 100644
> index 000000000000..a3a7af163398
> --- /dev/null
> +++ b/dtprobed/dof_stash.c
> @@ -0,0 +1,1621 @@
> +/*
> + * Oracle Linux DTrace; DOF storage for later probe removal.
> + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
> + * Licensed under the Universal Permissive License v 1.0 as shown at
> + * http://oss.oracle.com/licenses/upl.
> + *
> + * The DOF stash is the principal state preservation mechanism used by dtprobed.
> + * Keeping DOF on disk keeps memory management complexity low, and makes it
> + * possible for dtprobed to track DOF across daemon restarts and upgrades.
> + *
> + * The state storage structure is as follows:
> + *
> + * /run/dtrace/stash: Things private to dtprobed.
> + *
> + * /run/dtrace/probes: Per-probe info, written by dtprobed, read by DTrace.
> + *
> + * /run/dtrace/stash/dof/$dev-$ino: DOF contributed by particular mappings,
> + * in raw form (as received from the client).
> + *
> + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/raw: hardlink to the DOF for a given
> + * PID. Pruned of dead processes at startup and on occasion: entries also
> + * deleted on receipt of DTRACEHIOC_REMOVE ioctls. The principal effect of this
> + * is to bump the link count for the corresponding DOF in dof/ directory: when
> + * this falls to 1, the DOF is considered dead and the corresponding probe is
> + * removed.
> + *
> + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/dh: Serialized form of the
> + * dof_helper_t received from a given client. Preserved for all
> + * DTRACEHIOC_ADDDOFs, even if the DOF was already contributed by some other
> + * process.  Keeping this around allows the DOF to be re-parsed as needed, even
> + * by a later re-execution of a different version of the same daemon with a
> + * different dof_parsed_t structure, and to be re-parsed with appropriate
> + * addresses for whichever mapping of this DOF is under consideration.
> + *
> + * /usr/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed: parsed DOF for a given PID.
> + * Generated from the helper for PIDs that are the first appearance of a given
> + * piece of DOF and used for the common case where that DOF has only one mapping
> + * until its deletion (short-lived processes, mostly).  Deleted at daemon
> + * startup.
> + *
> + * /run/dtrace/stash/dof-pid/$pid/$gen: symlink to a $dev-$ino dir. "$gen" is an
> + * incrementing generation counter starting at zero for every unique PID, used
> + * by clients to uniquely identify DOF pieces to remove.
> + *
> + * /run/dtrace/stash/dof-pid/$pid/next-gen: The next generation counter to use
> + * for a new piece of DOF from this process.  Encoded as a sparse file whose
> + * size is the next counter value to use.
> + *
> + * Per-probe info:
> + *
> + * /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb: Parsed representation of one
> + * probe in a given process. Removed by dtprobed when the process dies, or if
> + * all mappings containing the probe are unmmapped.  Used by DTrace for tracing
> + * by PID.  Hardlink to...
> + *
> + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed/$prv$pid:$mod:$fun:$prb:
> + * parsed DOF for a given probe for a given PID.  Generated from the helper for
> + * PIDs that are the first appearance of a given piece of DOF and used for the
> + * common case where that DOF has only one mapping until its deletion
> + * (short-lived processes, mostly).  Deleted at daemon startup if the
> + * dof_parsed_t structure changed.  The file consists of a uint64_t version
> + * number (DOF_PARSED_VERSION), a uint64_t count of tracepoints, and then
> + * that many struct dof_parsed_t's of type DIT_TRACEPOINT.
> + *
> + * In future, symlinks in /run/dtrace/probes/$prv$pid may exist, used for
> + * globbing provider names across processes.
> + */
> +
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <assert.h>
> +#include "dof_stash.h"
> +
> +#include <dt_list.h>
> +
> +#ifdef HAVE_FUSE_LOG
> +#include <fuse_log.h>
> +#else
> +#include "rpl_fuse_log.h"
> +#endif
> +
> +#include <port.h>
> +
> +static int dof_dir, pid_dir, probes_dir;
> +
> +/*
> + * Like close(), but only try to close positively-numbered fds.
> + */
> +static int
> +maybe_close(int fd)
> +{
> +	if (fd > -1)
> +		return close(fd);
> +	return 0;
> +}
> +
> +/*
> + * Like mkdirat(), but already-existing directories are considered
> + * successful.
> + */
> +static int
> +mkexistingdirat(int dirfd, const char *name, mode_t mode)
> +{
> +	if (mkdirat(dirfd, name, mode) < 0) {
> +		if (errno == EEXIST) {
> +			struct stat s;
> +
> +			if (fstatat(dirfd, name, &s, 0) == 0 &&
> +			    S_ISDIR(s.st_mode))
> +				return 0;
> +			errno = EEXIST;
> +		}
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Make a runtime state directory and return its fd.
> + */
> +static int
> +make_state_dir(const char *dir)
> +{
> +	int fd;
> +
> +        if (mkdir(dir, 0755) < 0 && errno != EEXIST) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: "
> +			 "cannot create state directory %s: %s\n",
> +			 dir, strerror(errno));
> +		return -1;
> +	}
> +
> +	if ((fd = open(dir, O_PATH | O_CLOEXEC)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: "
> +			 "cannot open state directory %s: %s\n",
> +			 dir, strerror(errno));
> +		return -1;
> +	}
> +	return fd;
> +}
> +
> +/*
> + * Make a state directory under some other directory and return an fd to it.
> + *
> + * If "readable" is set, the fd is opened with O_RDONLY rather than O_PATH.
> + */
> +static int
> +make_state_dirat(int dirfd, const char *name, const char *errmsg, int readable)
> +{
> +	int fd;
> +	int flags = O_PATH;
> +
> +	if (readable)
> +		flags = O_RDONLY;
> +
> +	if (mkexistingdirat(dirfd, name, 0755) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot create %s dir named %s: %s\n",
> +			 errmsg, name, strerror(errno));
> +		return -1;
> +	}
> +
> +	if ((fd = openat(dirfd, name, flags | O_CLOEXEC)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open %s dir named %s: %s\n",
> +			 errmsg, name, strerror(errno));
> +		return -1;
> +	}
> +	return fd;
> +}
> +
> +/*
> + * Initialize the DOF-stashing machinery, if needed.
> + */
> +int
> +dof_stash_init(const char *statedir)
> +{
> +	int root;
> +	int stash;
> +
> +	if (statedir == NULL)
> +		statedir="/run/dtrace";
> +
> +        if ((root = make_state_dir(statedir)) < 0)
> +		return -1;
> +
> +        if ((stash = make_state_dirat(root, "stash", "statedir", 0)) < 0)
> +		return -1;
> +
> +	if ((dof_dir = make_state_dirat(stash, "dof", "DOF", 0)) < 0)
> +		return -1;
> +        if ((pid_dir = make_state_dirat(stash, "dof-pid", "per-pid DOF", 1)) < 0)
> +		return -1;
> +
> +        if ((probes_dir = make_state_dirat(root, "probes", "probesdir", 1)) < 0)
> +		return -1;
> +
> +        close(root);
> +	close(stash);
> +
> +	return 0;
> +}
> +
> +/*
> + * Compose a provider's name from pieces.
> + */
> +static char *
> +make_provider_name(const char *prov, pid_t pid)
> +{
> +	char *ret;
> +
> +        if (asprintf(&ret, "%s%li", prov, (long) pid) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making provider name\n");
> +		return NULL;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Compose a full probespec's name from pieces.
> + */
> +static char *
> +make_probespec_name(const char *prov, const char *mod, const char *fn,
> +		    const char *prb)
> +{
> +	char *ret;
> +
> +        if (asprintf(&ret, "%s:%s:%s:%s", prov, mod, fn, prb) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making probespec\n");
> +		return NULL;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Split the pieces back up again.  The input spec becomes the provider name.
> + * The pieces' lifetime is no longer than that of the input string.
> + */
> +static int
> +split_probespec_name(char *spec, const char **mod, const char **fn,
> +		     const char **prb)
> +{
> +	char *tmp;
> +
> +	tmp = strchr(spec, ':');
> +	if (tmp == NULL)
> +		return -1;
> +	*tmp = 0;
> +	*mod = tmp + 1;
> +
> +	tmp = strchr(*mod, ':');
> +	if (tmp == NULL)
> +		return -1;
> +	*tmp = 0;
> +	*fn = tmp + 1;
> +
> +	tmp = strchr(*fn, ':');
> +	if (tmp == NULL)
> +		return -1;
> +	*tmp = 0;
> +	*prb = tmp + 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * Compose a DOF directory's name from pieces.
> + */
> +static char *
> +make_dof_name(dev_t dev, ino_t ino)
> +{
> +	char *ret;
> +
> +        if (asprintf(&ret, "%li-%li", (long) dev, (long) ino) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making DOF name\n");
> +		return NULL;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Split the pieces back up again.
> + */
> +static int
> +split_dof_name(const char *dof_name, dev_t *dev, ino_t *ino)
> +{
> +	if (sscanf(dof_name, "%li-%li", (long *) dev, (long *) ino) < 2) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: per-pid directory name %s unparseable\n",
> +			 dof_name);
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * A trivial wrapper.
> + */
> +static char *
> +make_numeric_name(long num)
> +{
> +	char *ret;
> +
> +        if (asprintf(&ret, "%li", num) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making filename\n");
> +		return NULL;
> +	}
> +	return ret;
> +}
> +
> +
> +/*
> + * Push a piece of parsed DOF into a list to be emitted by
> + * dof_stash_write_parsed().
> + */
> +int
> +dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed)
> +{
> +	dof_parsed_list_t *entry;
> +
> +	if ((entry = calloc(1, sizeof(dof_parsed_list_t))) == NULL)
> +		return -1;
> +
> +	entry->parsed = parsed;
> +        dt_list_append(accum, entry);
> +
> +        return 0;
> +}
> +
> +/*
> + * Free all the accumulated parsed DOF in the passed list.
> + */
> +void
> +dof_stash_flush(dt_list_t *accum)
> +{
> +	dof_parsed_list_t *accump;
> +	dof_parsed_list_t *last_accump = NULL;
> +
> +	for (accump = dt_list_next(accum); accump != NULL;
> +	     accump = dt_list_next(accump)) {
> +		dt_list_delete(accum, accump);
> +		free(accump->parsed);
> +		free(last_accump);
> +		last_accump = accump;
> +	}
> +	free(last_accump);
> +}
> +
> +/*
> + * Write a chunk of data to an fd.
> + */
> +static int
> +write_chunk(int fd, const void *buf, size_t size)
> +{
> +	char *bufptr = (char *) buf;
> +
> +	while (size > 0) {
> +		size_t len;
> +
> +                if ((len = write(fd, bufptr, size)) < 0 && errno != EINTR)
> +			return -1;
> +
> +		size -= len;
> +		bufptr += len;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Write out a piece of raw DOF.  Returns the length of the file written,
> + * or 0 if none was needed (or -1 on error).
> + */
> +static int
> +dof_stash_write(int dirfd, const char *name, const void *buf, size_t size)
> +{
> +	struct stat s;
> +	int fd;
> +
> +        /*
> +	 * Sanity check: if the DOF already exists but is not the same size
> +	 * as the DOF we already have, complain, and replace it and its parsed
> +	 * equivalent.  If it does exist, there's no need to write it out.
> +	 *
> +	 * If we can't even unlink it or write it out, we give up -- the stash
> +	 * has failed and we won't be able to do anything it implies.
> +	 *
> +	 * (This is only a crude check -- obviously distinct raw DOF could be
> +	 * the same size by pure chance.)
> +	 */
> +	if (fstatat(dirfd, name, &s, 0) == 0) {
> +		if (s.st_size == size)
> +			return 0;
> +
> +                fuse_log(FUSE_LOG_ERR, "dtprobed: DOF %s already exists, "
> +			 "but is %zx bytes long, not %zx: replacing\n",
> +			 name, s.st_size, size);
> +		if (unlinkat(dirfd, name, 0) < 0) {
> +			fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
> +				 name, strerror(errno));
> +			return -1;
> +		}
> +	}
> +
> +	if ((fd = openat(dirfd, name, O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out raw DOF: %s\n",
> +			 strerror(errno));
> +		return -1;
> +	}
> +
> +	if (write_chunk(fd, buf, size) < 0)
> +		goto err;
> +
> +	if (close(fd) < 0)
> +		goto err;
> +	return size + sizeof(uint64_t);
> +
> +  err:
> +	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF: %s\n",
> +		 strerror(errno));
> +	unlinkat(dirfd, name, 0);
> +	close(fd);
> +	return -1;
> +}
> +

Review continues here...

> +/*
> + * Write out all the parsed DOF from the passed-in accumulator.  Split out from
> + * dof_stash_add() because we don't want to create anything in the DOF stash for
> + * a given piece of DOF until we have verified that all the parsed data for all
> + * the probes in that DOF is valid.
> + *
> + * Each parsed file starts with a single 64-bit word giving the version of
> + * dof_parsed_t in use, to ensure that readers never use an incompatible
> + * version, and a single uint64_t giving the number of tracepoints following,
> + * each of which is a struct dof_parsed_t. The name of the probe is not given,
> + * since the filename (or, for the probes/ dirs, the file and directory names)
> + * already contains it.
> + *
> + * We also write the parsed version out to the parsed/version file in the parsed
> + * mapping dir, to make it easier to determine whether to reparse all the DOF
> + * for outdatedness.
> + */
> +int
> +dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum)
> +{
> +	char *pid_name = make_numeric_name(pid);
> +	char *dof_name = make_dof_name(dev, ino);
> +	char *provpid_name = NULL;
> +	int perpid_dir, perpid_dof_dir, probepid_dir, parsed_mapping_dir,
> +		prov_dir = -1, mod_dir = -1, fun_dir = -1;
> +        dof_parsed_list_t *accump;
> +	int version_fd, fd = -1;

Why oh why use 'fd' when all other directory and file descriptors have actual
descriptive names?  Especially since fd is carried across multiple iterations
in the loop.

> +	const char *op;
> +	const char *probe_err = "unknown probe";
> +	uint64_t version = DOF_PARSED_VERSION;
> +	int state = -1;
> +	int err;
> +
> +	if (dt_list_next(accum) == NULL) {
> +		free(pid_name);
> +		free(dof_name);
> +		return 0;
> +	}

Why not

	if (dt_list_next(accum) == NULL)
		goto early_out;

and set a label below at the successful case return that ends with these exact
three statements?  Might as well, given the presence of so many other gotos
anyway in this function.

> +
> +        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: open failed for PID dir for parsed DOF write: %s\n",
> +			 pid, dev, ino, strerror(errno));
> +		goto early_err;
> +	}
> +
> +	op = "per-mapping open";
> +        if ((perpid_dof_dir = openat(perpid_dir, dof_name, O_PATH | O_CLOEXEC)) < 0)
> +		goto close_perpid_err;
> +
> +	op = "per-mapping parsed DOF mkdirat";
> +	if ((parsed_mapping_dir = make_state_dirat(perpid_dof_dir, "parsed", op, 0)) < 0)
> +		goto close_perpid_dof_err;
> +
> +	op = "probe PID mkdirat";
> +	if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
> +		goto remove_parsed_mapping_err;
> +
> +        if ((version_fd = openat(parsed_mapping_dir, "version",
> +				 O_CREAT | O_TRUNC | O_WRONLY, 0644)) < 0)
> +		goto remove_probepid_dir_err;
> +
> +	if (write_chunk(version_fd, &version, sizeof(uint64_t)) < 0) {
> +		close(version_fd);

Shouldn't this perform an unlinkat() on version_fd?

> +		goto remove_probepid_dir_err;

The code path starting at the remove_probepid_dir_err label does not actually
contain an unlinkat() for probepid_dir, so it never gets cleaned up in case of
error?

> +	}
> +	close(version_fd);
> +
> +	/*
> +	 * We can just use assert() to check for overruns here, because any
> +	 * overruns are not corrupted input but a seriously buggy parser.
> +	 *
> +	 * On error, from this point on, we remove as little as possible on
> +	 * error, just a single probe's worth: better some probes survive an
> +	 * error condition than none do.
> +	 */
> +	for (accump = dt_list_next(accum); accump != NULL;
> +	     accump = dt_list_next(accump)) {
> +		char *mod, *fun, *prb;
> +		char *specfn = NULL;
> +
> +		switch (accump->parsed->type) {
> +		/*
> +		 * Provider: make new provider dir.
> +		 */
> +		case DIT_PROVIDER: {
> +			state = accump->parsed->type;
> +
> +			free(provpid_name);
> +
> +			provpid_name = make_provider_name(accump->parsed->provider.name, pid);
> +			probe_err = accump->parsed->provider.name;
> +
> +			op = "parsed provider close";
> +			if (maybe_close(prov_dir) < 0)
> +				goto err_provider;
> +
> +			op = "probe provider mkdirat";
> +			if ((prov_dir = make_state_dirat(probepid_dir, provpid_name,
> +							 op, 1)) < 0)
> +				goto err_provider;
> +			break;
> +		}
> +		/*
> +		 * Probe: make (mod, fun, probe) dirs as needed: close
> +		 * out file, open new one, make hardlink to it.
> +		 */
> +		case DIT_PROBE: {
> +                        assert(state >= 0);

This is ok, but DIT_TRACEPOINT uses:

			assert(state == DIT_PROBE || state == DIT_TRACEPOINT);

which is much better, right?  So why not use that here?

> +			state = accump->parsed->type;
> +
> +			mod = accump->parsed->probe.name;
> +			assert(accump->parsed->size > (mod - (char *) accump->parsed));
> +			fun = mod + strlen(mod) + 1;
> +			assert(accump->parsed->size > (fun - (char *) accump->parsed));
> +			prb = fun + strlen(fun) + 1;
> +			assert(accump->parsed->size >= (prb - (char *) accump->parsed));
> +
> +			op = "probe name construction";
> +			probe_err = provpid_name;
> +			if ((specfn = make_probespec_name(provpid_name, mod, fun, prb)) == NULL)
> +				goto err_provider;
> +
> +			/*
> +			 * Ban "." and ".." as name components.  Obviously names
> +			 * containing dots are commonplace (shared libraries,
> +			 * for instance), but allowing straight . and .. would
> +			 * have obviously horrible consequences.  They can't be
> +			 * filenames anyway, and you can't create them with
> +			 * dtrace -h because they aren't valid C identifier
> +			 * names.
> +			 */
> +			op = "probe name validation";
> +			probe_err = specfn;
> +
> +                        if (strcmp(accump->parsed->provider.name, ".") == 0 ||
> +			    strcmp(accump->parsed->provider.name, "..") == 0 ||
> +			    strcmp(mod, ".") == 0 ||
> +			    strcmp(mod, "..") == 0 ||
> +			    strcmp(fun, ".") == 0 ||
> +			    strcmp(fun, "..") == 0 ||
> +			    strcmp(prb, ".") == 0 ||
> +			    strcmp(prb, "..") == 0)
> +				goto err_provider;
> +
> +			op = "parsed probe close";
> +
> +			/*
> +			 * This rather weird pattern carries out all closes and
> +			 * then errors if any of them fail.
> +			 */
> +                        err = maybe_close(mod_dir);
> +			err |= maybe_close(fun_dir);
> +			err |= maybe_close(fd);

Why not do this at the beginning of the DIT_PROBE processing?  Obviously, if
a previous DIT_PROBE existed, this new one ends the previous one.  There does
not seem to be any point in not doing these closes immediately.  Even if the
processing of this probe fails somehow, the previous one is known to be done.

> +
> +			if (err != 0)
> +				goto err_provider;
> +
> +			op = "probe module";
> +
> +			if ((mod_dir = make_state_dirat(prov_dir, mod, op, 0)) < 0)
> +				goto err_provider;
> +
> +			op = "probe fun";
> +			if ((fun_dir = make_state_dirat(mod_dir, fun, op, 0)) < 0)
> +				goto err_probemod;
> +
> +			op = "parsed probe creation";
> +			if ((fd = openat(parsed_mapping_dir, specfn,
> +					 O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
> +				goto err_probefn;
> +
> +			op = "parsed probe hardlink";
> +			if (linkat(parsed_mapping_dir, specfn, fun_dir, prb, 0) < 0)
> +				goto err_parsed_probe;
> +
> +			op = "parsed version write";
> +			if (write_chunk(fd, &version, sizeof(uint64_t)) < 0)
> +				goto err_probe_link;
> +
> +			op = "parsed nprobes write";
> +			_Static_assert(sizeof(accump->parsed->probe.ntp) == sizeof(uint64_t),
> +				       "chunk write assumes parsed.probe.ntp is a uint64_t");

Why not use a regular assert or something else?  Since this is the only use
of static assertions in the entire code base, perhaps use the typical assert
suffices (and we can always do an audit of the code base and introduce static
assertions if we find there are enough uses for it)?

> +                        if (write_chunk(fd, &accump->parsed->probe.ntp,
> +					sizeof(accump->parsed->probe.ntp)) < 0)
> +				goto err_probe_link;
> +			break;
> +		}
> +
> +		/* Tracepoint: add to already-open file.  */
> +		case DIT_TRACEPOINT:
> +			assert(state == DIT_PROBE || state == DIT_TRACEPOINT);
> +			state = accump->parsed->type;
> +
> +			if (write_chunk(fd, accump->parsed, accump->parsed->size) < 0)
> +				goto err_probe_link;
> +			break;
> +
> +		/* Error return from parser.  */
> +		case DIT_ERR:
> +			op = "parsing";
> +			fuse_log(FUSE_LOG_ERR, "dtprobed: parser error: %s\n",
> +				 accump->parsed->err.err);
> +			goto err_provider;
> +
> +		default:
> +			/*
> +			 * This is a parser bug, and a bad one.  Just fail
> +			 * hard.
> +			 */
> +			fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: internal error: corrupt parsed DOF: type %i\n",
> +				 pid, dev, ino, accump->parsed->type);
> +			assert(1);
> +		}
> +
> +		free(specfn);
> +		continue;
> +
> +	err_probe_link:
> +		unlinkat(fun_dir, prb, 0);
> +	err_parsed_probe:
> +		unlinkat(parsed_mapping_dir, specfn, 0);
> +	err_probefn:
> +		unlinkat(mod_dir, fun, AT_REMOVEDIR);
> +	err_probemod:
> +		if (prov_dir >= 0) /* Always true: placate compiler */
> +			unlinkat(prov_dir, mod, AT_REMOVEDIR);
> +	err_provider:
> +		maybe_close(fd);
> +		maybe_close(fun_dir);
> +		maybe_close(mod_dir);
> +		maybe_close(prov_dir);
> +		unlinkat(probepid_dir, provpid_name, AT_REMOVEDIR);
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write for %s: %s\n",
> +			 pid, dev, ino, op, probe_err, strerror(errno));
> +		free(specfn);
> +		goto err_free;
> +	}
> +
> +	err = maybe_close(prov_dir);
> +	err |= maybe_close(mod_dir);
> +	err |= maybe_close(fun_dir);
> +	err |= maybe_close(fd);
> +
> +        if (err != 0) {
> +		unlinkat(perpid_dof_dir, "parsed", 0);
> +		goto err_free;
> +	}
> +
> +	close(probepid_dir);
> +	close(parsed_mapping_dir);
> +	close(perpid_dof_dir);
> +	close(perpid_dir);
> +	free(provpid_name);

early_out: (see above for explanation)

> +	free(pid_name);
> +	free(dof_name);
> +	return 0;
> +
> +	/* Errors requiring freeing and error return but no removals.  */
> +err_free:
> +	close(probepid_dir);
> +	close(parsed_mapping_dir);
> +	close(perpid_dof_dir);
> +	close(perpid_dir);
> +	close(prov_dir);
> +	free(provpid_name);
> +	free(pid_name);
> +	free(dof_name);
> +	return -1;
> +
> +	/*
> +	 * Errors before parsed-probe creation starts require wholesale
> +	 * removals of stale dirs.
> +	 */
> +
> +remove_probepid_dir_err:
> +	close(probepid_dir);
> +	unlinkat(probes_dir, pid_name, AT_REMOVEDIR);
> +	unlinkat(parsed_mapping_dir, "version", 0);
> +remove_parsed_mapping_err:
> +	close(parsed_mapping_dir);
> +	unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR);
> +close_perpid_dof_err:
> +	close(perpid_dof_dir);
> +close_perpid_err:
> +	fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write: %s\n",
> +		 pid, dev, ino, op, strerror(errno));
> +	close(perpid_dir);
> +early_err:
> +	free(pid_name);
> +	free(dof_name);
> +	return -1;
> +}
> +
> +/*
> + * Write out the DOF helper.  A trivial wrapper.
> + */
> +static
> +int dof_stash_write_dh(int dirfd, const dof_helper_t *dh)
> +{
> +	int fd;
> +
> +        if ((fd = openat(dirfd, "dh", O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
> +		goto err;
> +
> +	if (write_chunk(fd, dh, sizeof(dof_helper_t *)) < 0) {
> +		close(fd);
> +		goto err;
> +	}
> +
> +        if (close(fd) < 0)
> +		goto err;
> +
> +	return 0;
> +
> + err:
> +	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF helper: %s\n",
> +		 strerror(errno));

I would think that this function takes care of the unlink if the writee fails.
Why leave that to the caller, given that this function actually creates the
file?

> +	return -1;
> +}
> +
> +/*
> + * Add a piece of raw DOF from a given (pid, dev, ino) triplet.
> + *
> + * Return the new DOF generation number, or -1 on error.
> + */
> +int
> +dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
> +	      const void *dof, size_t size)
> +{
> +	char *dof_name = make_dof_name(dev, ino);
> +	char *pid_name = make_numeric_name(pid);
> +	int perpid_dir = -1;
> +	int perpid_dof_dir = -1;
> +	int gen_fd = -1;
> +	struct stat gen_stat;
> +	char *gen_name = NULL;
> +	int new_pid = 1;
> +	int new_dof = 0;
> +	int err = -1;
> +
> +	if (!pid_name || !dof_name)
> +		goto out_free;
> +
> +
> +	/* Make the directories. */
> +
> +        perpid_dir = make_state_dirat(pid_dir, pid_name, "PID", 0);
> +
> +        if (perpid_dir < 0)
> +		goto out_free;
> +
> +        perpid_dof_dir = make_state_dirat(perpid_dir, dof_name, "per-pid DOF", 0);
> +
> +        if (perpid_dof_dir < 0)
> +		goto err_unlink_nomsg;
> +
> +        fuse_log(FUSE_LOG_DEBUG, "%i: adding DOF in mapping %lx/%lx, l_addr %lx\n",
> +		 pid, dev, ino, dh->dofhp_addr);
> +
> +	/* Get the current generation counter. */
> +
> +        if ((gen_fd = openat(perpid_dir, "next-gen", O_CREAT | O_WRONLY, 0644)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open DOF generation counter for PID "
> +			 "%i: %s\n", pid, strerror(errno));
> +		goto err_unlink_nomsg;
> +	}
> +
> +        if (fstat(gen_fd, &gen_stat) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat DOF generation counter for PID "
> +			 "%i: %s\n", pid, strerror(errno));
> +		goto err_unlink_nomsg;
> +	}
> +	if (gen_stat.st_size > 0)
> +		new_pid = 0;
> +
> +	/*
> +	 * Write out the raw DOF, if not already known, and link it into its
> +	 * per-PID dir: write the helper directly out.  dof_stash_write tells us
> +	 * whether the DOF actually needed writing or not (via a straight size
> +	 * comparison, which suffices to spot almost everything other than the
> +	 * pathological case of stale DOF from a killed process whose inode is
> +	 * unlinked and replaced with another identically-numbered inode with
> +	 * DOF of the *exact same length*: this seems too unlikely to be worth
> +	 * bothering with.)
> +	 */
> +	new_dof = 1;
> +	switch (dof_stash_write(dof_dir, dof_name, dof, size)) {
> +	case 0: new_dof = 0; break;
> +	case -1: goto err_unlink_nomsg; break;
> +	}

Why this odd switch statement?  For one, the -1 case doesn't need a break
because it has a goto right before that break.  Couldn't this be simply
written as:

	new_dof = 1;
	if (dof_stash_write(dof_dir, dof_name, dof, size) == -1)
		goto err_unlink_nomsg;
	new_dof = 0;

or even:

	new_dof = dof_stash_write(dof_dir, dof_name, dof, size);
	if (new_dof == -1)
		goto err_unlink_nomsg;

> +
> +        if (linkat(dof_dir, dof_name, perpid_dof_dir, "raw", 0) < 0)
> +		goto err_unlink_msg;
> +
> +	if (dof_stash_write_dh(perpid_dof_dir, dh) < 0)
> +		goto err_unlink_nomsg;
> +
> +	/*
> +	 * Update the generation counter and mark this DOF's generation.  On
> +	 * error after this point we leak generation counter values.
> +	 */
> +	if(ftruncate(gen_fd, gen_stat.st_size + 1) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot update generation counter for PID %i: %s\n",
> +			 pid, strerror(errno));
> +		goto err_unlink_nomsg;
> +	}
> +
> +	gen_name = make_numeric_name(gen_stat.st_size);
> +	if (symlinkat(dof_name, perpid_dir, gen_name) < 0)
> +		goto err_unlink_msg;
> +
> +        err = 0;
> +
> +out_free:
> +        free(dof_name);
> +	free(pid_name);
> +
> +        maybe_close(perpid_dir);
> +        maybe_close(perpid_dof_dir);
> +	maybe_close(gen_fd);
> +
> +        return err == 0 ? gen_stat.st_size : err;
> +
> +err_unlink_msg:
> +	fuse_log(FUSE_LOG_ERR, "Cannot link pids for DOF from PID %i, "
> +		 "DOF mapping %lx/%lx into place: %s\n", pid, dev, ino,
> +		 strerror(errno));
> +err_unlink_nomsg:
> +        unlinkat(perpid_dof_dir, "raw", 0);
> +        unlinkat(perpid_dof_dir, "dh", 0);
> +	unlinkat(perpid_dir, dof_name, AT_REMOVEDIR);
> +
> +	if (gen_name)
> +		unlinkat(perpid_dir, gen_name, 0);
> +
> +	if (new_pid) {
> +		unlinkat(perpid_dir, "next-gen", 0);
> +		unlinkat(pid_dir, pid_name, AT_REMOVEDIR);
> +	}
> +
> +        if (new_dof)
> +		unlinkat(dof_dir, dof_name, 0);
> +
> +	goto out_free;
> +}
> +
> +/*
> + * The guts of unlinkat_many (below).  Returns -1 on error, or positive
> + * ENOTEMPTY if a subdir deletion failed because the dir was nonempty (any
> + * positive return leads to unwinding without doing any more deletions).
> + */
> +static int
> +unlinkat_many_internal(int dirfd, int diroff, const char **dirs)
> +{
> +	int fd;
> +	struct stat s;
> +	const char *err;
> +	int flags = 0;
> +	int ret;
> +
> +	/*
> +	 * Open the dir/file at the next level down, then recurse to delete the
> +	 * next one in first.  This is a directory unless the next element in
> +	 * the list is NULL.
> +	 */
> +
> +	if (dirs[diroff] == NULL)
> +		return 0;
> +
> +	if (dirs[diroff + 1] != NULL)
> +		flags = AT_REMOVEDIR;
> +
> +	if (fstatat(dirfd, dirs[diroff], &s, AT_SYMLINK_NOFOLLOW) < 0) {
> +		if (errno == ENOENT)
> +			return 0;
> +		fuse_log(FUSE_LOG_ERR, "Cannot unlink %s during probe state dir deletion: %s\n",
> +			 dirs[diroff], strerror(errno));
> +		return -1;
> +	}
> +
> +	if (flags == AT_REMOVEDIR && !S_ISDIR(s.st_mode)) {
> +		err = "must be a directory, but is not";
> +		goto err_keep_going;
> +	} else if (flags == 0 && !S_ISREG(s.st_mode)) {
> +		err = "not a regular file";
> +		goto err_keep_going;
> +	}
> +
> +	if ((fd = openat(dirfd, dirs[diroff], O_RDONLY | O_CLOEXEC)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "Cannot open %s during probe state dir deletion: %s\n",
> +			 dirs[diroff], strerror(errno));
> +		char foo[PATH_MAX];
> +		sprintf(foo, "ls -l /proc/%i/fd", getpid());
> +		system(foo);

Isn't this a potential security concern?  Which 'ls' is going to get executed?
It seems to me that the potential red flags that this could raise do not really
justify the limited benefit of havinng this output collected.  Is this worth
it?

> +		return -1;
> +	}
> +
> +        ret = unlinkat_many_internal(fd, diroff + 1, dirs);
> +	close(fd);
> +
> +	if (ret < 0)
> +		return -1;
> +	else if (ret > 0)
> +		return 0;			/* ENOTEMPTY */
> +
> +	/*
> +	 * Now do the deletion.
> +	 */
> +	if (unlinkat(dirfd, dirs[diroff], flags) < 0) {
> +		if (errno == ENOENT)
> +			return 0;
> +		if (errno == ENOTEMPTY)
> +			return ENOTEMPTY;
> +        }
> +	return 0;
> +
> +err_keep_going:
> +	fuse_log(FUSE_LOG_ERR, "Cannot unlink %s during probe state dir deletion: "
> +		 "%s\n", dirs[diroff], err);
> +	return 0;
> +}
> +
> +/*
> + * Unlink a file and then a bunch of directories, as given by the names
> + * argument (the last entry in which is a file).  Each directory in the dirs
> + * array is a subdirectory of the one before it.  If any directory removal
> + * fails due to non-emptiness, all later removals are skipped.  If any directory
> + * does not exist, removals start from that point.

How do I reconcile this comment with the actual function?  Where is a file
being unlinked?  What names argument?

> + *
> + * The first entry must be a file: the next must be directories.  If this is
> + * violated, just return without unlinking anything (but do log an error).

First entry of what?  Next of what?  Where is this validated?

> + */
> +static int
> +unlinkat_many(int dirfd, const char **dirs)
> +{
> +	return unlinkat_many_internal(dirfd, 0, dirs);
> +}
> +
> +/*
> + * Determine if a file or directory (in the DOF stash) should be deleted.
> + */
> +static int
> +refcount_cleanup_p(int dirfd, const char *name, int isdir)
> +{
> +	struct stat s;
> +
> +	if (fstatat(dirfd, name, &s, 0) != 0) {
> +		fuse_log(FUSE_LOG_ERR, "Cannot stat %s for cleanup: %s\n",
> +			 name, strerror(errno));
> +		return -1;
> +	}
> +
> +	if ((isdir && s.st_nlink != 2) || (!isdir && s.st_nlink != 1))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +
> +/*
> + * Delete a file or directory (in the DOF stash) if it has no other links.
> + */
> +static int
> +refcount_cleanup(int dirfd, const char *name, int isdir)
> +{
> +	switch (refcount_cleanup_p(dirfd, name, isdir)) {
> +	case -1: return -1;
> +	case 0: return 0;
> +	default: break;
> +	}
> +
> +	if (unlinkat(dirfd, name, isdir ? AT_REMOVEDIR : 0) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
> +			 name, strerror(errno));
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Remove the parsed directory under a given mapping directory.
> + */
> +static int
> +dof_stash_remove_parsed(pid_t pid, int mapping_dir, const char *mapping)
> +{
> +	char *pid_name = make_numeric_name(pid);
> +	int parsed_dir;
> +	DIR *parsed;
> +	struct dirent *ent;
> +
> +        if (((parsed_dir = openat(mapping_dir, "parsed", O_RDONLY | O_CLOEXEC)) < 0) ||
> +	    ((parsed = fdopendir(parsed_dir)) == NULL)) {
> +
> +		if (parsed_dir >= 0)
> +			close(parsed_dir);
> +
> +		if (errno == ENOENT)
> +			return 0;
> +
> +                fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory for PID %i mapping %s for cleanup: %s\n",
> +                         pid, mapping, strerror(errno));
> +		free(pid_name);
> +		return -1;
> +        }
> +
> +	/*
> +	 * Scan the parsed dir, clean up all the matching per-probe
> +	 * dirs, and delete the parsed dir's contents too.
> +	 */
> +
> +        while (errno = 0, (ent = readdir(parsed)) != NULL) {
> +		const char *mod, *fun, *prb;
> +		const char *unlink_err;
> +		char *prov;
> +
> +                if (errno != 0)
> +			goto err;
> +
> +		if (ent->d_type != DT_REG)
> +			continue;
> +
> +		if ((prov = strdup(ent->d_name)) == NULL)
> +			goto err;
> +
> +		if (unlinkat(parsed_dir, ent->d_name, 0) < 0) {
> +			unlink_err = "parsed dir";
> +			goto next;
> +		}
> +
> +                if (strcmp(ent->d_name, "version") == 0)
> +			continue;
> +
> +		if (split_probespec_name(prov, &mod, &fun, &prb) < 0)
> +			continue;
> +
> +		const char *dirnames[] = {pid_name, prov, mod, fun, prb, NULL};
> +		if (unlinkat_many(probes_dir, dirnames) < 0) {
> +			unlink_err = "parsed probes dir entries";
> +			goto next;
> +		}
> +		free(prov);
> +		continue;
> +
> +	  next:
> +		free(prov);
> +                fuse_log(FUSE_LOG_ERR, "cannot unlink per-PID %s %s (mapping %s) for cleanup: %s\n",
> +			 ent->d_name, unlink_err, mapping, strerror(errno));
> +	}
> +	closedir(parsed);
> +	free(pid_name);
> +
> +        return 0;
> +
> +  err:
> +	closedir(parsed);
> +	free(pid_name);
> +	fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory entry for PID %i mapping %s for cleanup: %s\n",
> +		 pid, mapping, strerror(errno));
> +	return -1;
> +}
> +
> +/*
> + * Remove a piece of DOF, identified by generation counter.  Return -1 on error.
> + *
> + * Not knowing about the DOF is not an error.
> + */
> +int
> +dof_stash_remove(pid_t pid, int gen)
> +{
> +	char *pid_name = make_numeric_name(pid);
> +	char *gen_name = make_numeric_name(gen);
> +	char *gen_linkname = NULL;
> +	int perpid_dir = -1;
> +	int perpid_dof_dir = -1;
> +	struct stat pid_stat;
> +	struct stat gen_stat;
> +	int err = -1;
> +	const char *unlink_err = NULL;
> +
> +	/*
> +	 * Figure out the per-PID DOF directory by following the gen-counter
> +	 * symlink (by hand, because we want to reuse its name in the non-
> +	 * PID-specific dof/ subdir too).
> +	 */
> +
> +	if (!pid_name || !gen_name)
> +		goto oom;
> +
> +        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0)
> +		goto out;
> +
> +	if (fstatat(perpid_dir, gen_name, &gen_stat, AT_SYMLINK_NOFOLLOW) < 0)
> +		goto out;
> +
> +	if ((gen_linkname = malloc(gen_stat.st_size + 1)) == NULL)
> +		goto oom;
> +
> +	/*
> +	 * Truncated dir -> concurrent modifications, not allowed.
> +	 */
> +	if (readlinkat(perpid_dir, gen_name, gen_linkname, gen_stat.st_size + 1) ==
> +	    gen_stat.st_size + 1) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: state dir is under concurrent modification, maybe multiple dtprobeds are running\n");

OK, this could happen (but never should).  But isn't the possibility of
concurrent modifications pretty much ignored everywhere else?  If it is a
possible concern, wouldn't it need to be addressed everywhere (and honestly,
in that case there is nothing that can be done I think other than to abort).

> +		goto err;
> +	}
> +	gen_linkname[gen_stat.st_size] = 0;
> +
> +	perpid_dof_dir = openat(perpid_dir, gen_linkname, O_PATH | O_CLOEXEC);
> +
> +        if (strchr(gen_linkname, '/') != NULL ||
> +	    perpid_dof_dir < 0) {
> +                if (perpid_dof_dir >= 0)
> +			close(perpid_dof_dir);
> +                fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which is not valid\n",
> +			 pid, gen, gen_linkname);
> +		goto err;
> +	}
> +
> +	if (faccessat(dof_dir, gen_linkname, F_OK, 0) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which has no raw DOF\n",
> +			 pid, gen, gen_linkname);
> +		goto err;
> +	}
> +
> +	/*
> +	 * Unlink the per-PID raw and parsed DOF and the probe dirs, then check
> +	 * the link counts on the non-per-PID stuff; if zero, unlink that too.
> +	 */
> +
> +	fuse_log(FUSE_LOG_DEBUG, "gen_name: %s; gen_linkname: %s; perpid_dof_dir: %i\n", gen_name, gen_linkname, perpid_dof_dir);
> +        if (unlinkat(perpid_dof_dir, "raw", 0) != 0 && errno != ENOENT) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID raw DOF for PID %i generation %i: %s\n",
> +			 pid, gen, strerror(errno));
> +	}
> +
> +	if (dof_stash_remove_parsed(pid, perpid_dof_dir, gen_linkname) < 0)
> +		unlink_err = "parsed probes dir entries";
> +
> +	if (unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR) < 0)
> +		unlink_err = "parsed dir";
> +
> +	if (unlinkat(perpid_dof_dir, "dh", 0) < 0)
> +		unlink_err = "dh";
> +
> +	if (unlinkat(perpid_dir, gen_linkname, AT_REMOVEDIR) < 0)
> +		unlink_err = gen_linkname;
> +
> +	if (unlinkat(perpid_dir, gen_name, 0) < 0)
> +		unlink_err = gen_name;
> +
> +	if (refcount_cleanup(dof_dir, gen_linkname, 0) < 0)
> +		unlink_err = gen_linkname;
> +
> +	refcount_cleanup(dof_dir, gen_linkname, 0);
> +
> +	/*
> +	 * Clean up the PID directory itself, if it is now empty.  We can't just
> +	 * use refcount_cleanup here because we also have to delete the
> +	 * generation counter right before deleting the directory.
> +	 */
> +
> +	if (fstatat(pid_dir, pid_name, &pid_stat, 0) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat per-PID for pid %i: %s\n",
> +			 pid, strerror(errno));
> +		goto err;
> +	}
> +
> +	if (pid_stat.st_nlink == 2) {
> +		if (unlinkat(perpid_dir, "next-gen", 0) < 0) {
> +			fuse_log(FUSE_LOG_ERR, "dtprobed: cannot clean up gen counter in per-PID dir for PID %i: %s\n",
> +				 pid, strerror(errno));
> +			goto err;
> +		}
> +		refcount_cleanup(pid_dir, pid_name, 1);
> +	}
> +
> +	if (unlink_err)
> +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID DOF %s for PID %i generation %i: %s\n",
> +			 unlink_err, pid, gen, strerror(errno));
> +
> + out:
> +	err = 0;
> +
> + err:
> +        free(pid_name);
> +	free(gen_name);
> +	free(gen_linkname);
> +
> +        maybe_close(perpid_dir);
> +        maybe_close(perpid_dof_dir);
> +
> +	return err;
> +
> +oom:
> +	fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory unlinking DOF\n");
> +	goto err;
> +}
> +
> +/*
> + * Prune dead processes out of the DOF stash.  This is not a correctness
> + * operation, just a space-waste reducer. If a process with DOF died uncleanly
> + * and its PID was recycled since a cleanup was last run, we'll leave some DOF
> + * around for a bit, but there are no correctness implications: either the new
> + * process has no DOF at all, in which case we'll never learn about it, or it
> + * does, in which case either it has identical mappings (in which case we'll
> + * just reuse them) or it has different ones (in which case we'll just add to
> + * them): in neither case is there any negative consequence of the nonexistent
> + * mappings other than a bit of space wastage.  (This implies that any process
> + * reading DOF out of the state directory must check that a process that claims
> + * to have DOF at a given mapping actually has a corresponding mapping with the
> + * same dev/ino pair at the right address. There's no need for that process to
> + * clean it up if found to be wrong, though: this function will eventually get
> + * around to it.)
> + */
> +void
> +dof_stash_prune_dead(void)
> +{
> +	DIR *all_pids_dir;
> +	struct dirent *pid_ent;
> +	int tmp;
> +
> +	fuse_log(FUSE_LOG_DEBUG, "Pruning dead PIDs\n");
> +
> +	if ((tmp = dup(pid_dir)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "cannot dup() for cleanup: %s\n", strerror(errno));
> +		return;
> +	}
> +
> +        if ((all_pids_dir = fdopendir(tmp)) == NULL) {
> +		close(tmp);
> +		fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF directory: %s\n",
> +			 strerror(errno));
> +		return;
> +	}
> +	rewinddir(all_pids_dir);
> +
> +	/*
> +	 * Work over all the PIDs.
> +	 */
> +	while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
> +		DIR *perpid_dir;
> +		struct dirent *mapping_ent;
> +		char *end_pid;
> +		pid_t pid;
> +
> +		if (errno != 0)
> +			goto scan_failure;
> +
> +		if (pid_ent->d_type != DT_DIR)
> +			continue;
> +
> +		/*
> +		 * Only directories with numeric names can be PIDs: skip all the
> +		 * rest.
> +		 */
> +		pid = strtol(pid_ent->d_name, &end_pid, 10);
> +		if (*end_pid != '\0')
> +			continue;
> +
> +		if (kill(pid, 0) == 0 || errno != ESRCH)
> +			continue;
> +
> +		/*
> +		 * Process dead: clean up its PIDs, deleting everything in the
> +		 * per-PID directory by calling dof_stash_remove() on every
> +		 * mapping in turn.
> +		 */
> +
> +		if ((tmp = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
> +			fuse_log(FUSE_LOG_ERR,"cannot open per-PID DOF mappings directory for cleanup: %s\n",
> +				 strerror(errno));
> +			goto out;
> +		}
> +
> +		if ((perpid_dir = fdopendir(tmp)) == NULL) {
> +			fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF mappings: %s\n",
> +				 strerror(errno));
> +			close(tmp);
> +			goto out;
> +		}
> +
> +		fuse_log(FUSE_LOG_DEBUG, "Pruning dead PID %s\n", pid_ent->d_name);
> +
> +		/* Work over all the mappings in this PID. */
> +
> +                while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
> +			int gen;
> +			char *end_gen;
> +
> +                        if (errno != 0) {
> +				fuse_log(FUSE_LOG_ERR, "cannot read per-PID DOF mappings for cleanup: %s\n",
> +					 strerror(errno));
> +				closedir(perpid_dir);
> +				goto out;
> +			}
> +
> +			if (mapping_ent->d_type != DT_LNK)
> +				continue;
> +
> +			fuse_log(FUSE_LOG_DEBUG, "Working over generation %s\n", mapping_ent->d_name);
> +
> +			gen = strtol(mapping_ent->d_name, &end_gen, 10);
> +			if (*end_gen != '\0')
> +				continue;
> +
> +                        dof_stash_remove(pid, gen);
> +		}
> +		closedir(perpid_dir);
> +		errno = 0;
> +	}
> +
> +	if (errno != 0)
> +		goto scan_failure;
> +
> +out:
> +	closedir(all_pids_dir);
> +	return;
> +
> +scan_failure:
> +	fuse_log(FUSE_LOG_ERR, "cannot scan per-PID DOF directory for cleanup: %s\n",
> +		 strerror(errno));
> +	goto out;
> +}
> +
> +/*
> + * Read a file into a buffer and return it.  If READ_SIZE is non-negative, read
> + * only that many bytes (and silently fail if the file isn't at least that
> + * long).
> + */
> +static void *
> +read_file(int fd, ssize_t read_size, size_t *size)
> +{
> +	struct stat s;
> +	char *buf;
> +	char *bufptr;
> +	int len;
> +
> +        if (fstat(fd, &s) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "cannot stat: %s\n", strerror(errno));
> +		return NULL;
> +	}
> +	if (read_size >= 0) {
> +		if (s.st_size < read_size)
> +			return NULL;
> +	} else
> +		read_size = s.st_size;
> +
> +	if ((buf = malloc(read_size)) == NULL) {
> +		fuse_log(FUSE_LOG_ERR, "out of memory allocating %zi bytes\n",
> +			 read_size);
> +		return NULL;
> +	}
> +
> +	if (size)
> +		*size = read_size;
> +
> +        bufptr = buf;
> +	while ((len = read(fd, bufptr, read_size)) < read_size) {
> +		if (len < 0) {
> +			if (errno != EINTR) {
> +				fuse_log(FUSE_LOG_ERR, "cannot read: %s",
> +					 strerror(errno));
> +				free(buf);
> +				return NULL;
> +			}
> +			continue;
> +		}
> +		read_size -= len;
> +		bufptr += len;
> +	}
> +	return buf;
> +}
> +
> +/*
> + * Reparse all DOF.  Mappings that cannot be reparsed are simply ignored, on the
> + * grounds that most DOF, most of the time, is not used, so this will likely be
> + * ignorable.
> + *
> + * Normally, DOF with the same DOF_VERSION as the currently running daemon is
> + * left alone: if force is set, even that is reparsed.
> + */
> +int
> +reparse_dof(int out, int in,
> +	    int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
> +			   dof_helper_t *dh, const void *in_buf, size_t in_bufsz,
> +			   int reparsing),
> +	    int force)
> +{
> +	DIR *all_pids_dir;
> +	struct dirent *pid_ent;
> +	int err = -1;
> +	int tmp;
> +
> +	if ((tmp = dup(pid_dir)) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
> +		return -1;
> +	}
> +
> +        if ((all_pids_dir = fdopendir(tmp)) == NULL) {
> +		close(tmp);
> +		fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF directory: %s\n",
> +			 strerror(errno));
> +		return -1;
> +	}
> +
> +	/* Work over all the PIDs. */
> +
> +	rewinddir(all_pids_dir);
> +	while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
> +		int perpid_fd;
> +		char *end_pid;
> +		pid_t pid;
> +		DIR *perpid_dir;
> +		struct dirent *mapping_ent;
> +
> +		if (errno != 0)
> +			goto scan_failure;
> +
> +		if (pid_ent->d_type != DT_DIR)
> +			continue;
> +
> +		/*
> +		 * Only directories with numeric names can be PIDs: skip all the
> +		 * rest.
> +		 */
> +		pid = strtol(pid_ent->d_name, &end_pid, 10);
> +		if (*end_pid != '\0')
> +			continue;
> +
> +                if ((perpid_fd = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
> +			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mappings directory for PID %s: %s\n",
> +				 pid_ent->d_name, strerror(errno));
> +			goto out;
> +		}
> +
> +		if ((tmp = dup(perpid_fd)) < 0) {
> +			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
> +			close(perpid_fd);
> +			goto out;
> +		}
> +
> +		if ((perpid_dir = fdopendir(tmp)) == NULL) {
> +			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
> +				 pid_ent->d_name, strerror(errno));
> +			close(tmp);
> +			close(perpid_fd);
> +			goto out;
> +		}
> +
> +		/* Work over all the mappings in this PID. */
> +
> +                while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
> +			int mapping_fd;
> +			int fd;
> +			dev_t dev;
> +			ino_t ino;
> +			size_t dof_size, dh_size;
> +			void *dof = NULL;
> +			void *dh = NULL;
> +
> +                        if (errno != 0) {
> +				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
> +					 pid_ent->d_name, strerror(errno));
> +				goto perpid_err;
> +			}
> +
> +			if (mapping_ent->d_type != DT_DIR)
> +				continue;
> +
> +			/*
> +			 * Hidden files cannot be mappings.  (This also skips
> +			 * . and ..).
> +			 */
> +			if (mapping_ent->d_name[0] == '.')
> +				continue;
> +
> +                        if ((mapping_fd = openat(perpid_fd, mapping_ent->d_name,
> +						 O_PATH | O_CLOEXEC)) < 0) {
> +				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mapping for PID %s, mapping %s: %s\n",
> +					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> +				goto perpid_err;
> +			}
> +
> +			fuse_log(FUSE_LOG_DEBUG, "Parsed DOF is present: evaluating\n");
> +
> +			/*
> +			 * Only DOF with a different parsed representation gets
> +			 * reparsed.
> +			 */
> +			if (!force) {
> +				if ((fd = openat(mapping_fd, "parsed/version",
> +						 O_RDONLY | O_CLOEXEC)) == 0) {
> +					uint64_t *parsed_version;
> +
> +					parsed_version = read_file(fd, sizeof(uint64_t),
> +								   NULL);
> +					close(fd);
> +
> +					if (parsed_version != NULL &&
> +					    *parsed_version == DOF_PARSED_VERSION) {
> +						fuse_log(FUSE_LOG_DEBUG, "No need to reparse\n");
> +						continue;
> +					}
> +
> +					free(parsed_version);
> +				}
> +			}
> +
> +			fuse_log(FUSE_LOG_DEBUG, "Deleting\n");
> +
> +			if (dof_stash_remove_parsed(pid, mapping_fd, mapping_ent->d_name) < 0) {
> +				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot delete parsed DOF for PID %s, mapping %s: %s\n",
> +					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> +				close(mapping_fd);
> +				goto perpid_err;
> +			}
> +
> +			if (split_dof_name(mapping_ent->d_name, &dev, &ino) < 0) {
> +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF for PID %s, cannot derive dev/inum from %s: ignored\n",
> +					 pid_ent->d_name, mapping_ent->d_name);
> +				continue;
> +			}
> +
> +			if ((fd = openat(mapping_fd, "raw", O_RDONLY | O_CLOEXEC)) < 0) {
> +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open raw DOF for PID %s, mapping %s: ignored: %s\n",
> +					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> +				close(mapping_fd);
> +				continue;
> +			}
> +
> +			if ((dof = read_file(fd, -1, &dof_size)) == NULL) {
> +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read raw DOF for PID %s, mapping %s: ignored: %s\n",
> +					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> +				close(mapping_fd);
> +				close(fd);
> +				continue;
> +			}
> +			close(fd);
> +
> +			if ((fd = openat(mapping_fd, "dh", O_RDONLY | O_CLOEXEC)) < 0) {
> +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open dh for PID %s, mapping %s: ignored: %s\n",
> +					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> +				free(dof);
> +				close(mapping_fd);
> +				continue;
> +			}
> +
> +			if ((dh = read_file(fd, -1, &dh_size)) == NULL) {
> +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read dh for PID %s, mapping %s: ignored: %s\n",
> +					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> +				free(dof);
> +				close(mapping_fd);
> +				close(fd);
> +				continue;
> +			}
> +			close(fd);
> +
> +			fuse_log(FUSE_LOG_DEBUG, "Reparsing DOF for PID %s, mapping %s\n",
> +				 pid_ent->d_name, mapping_ent->d_name);
> +			if (reparse(pid, out, in, dev, ino, dh, dof, dof_size, 1) < 0)
> +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot parse DOF for PID %s, mapping %s: ignored\n",
> +					    pid_ent->d_name, mapping_ent->d_name);
> +			free(dof);
> +			free(dh);
> +			close(mapping_fd);
> +
> +			continue;
> +
> +		perpid_err:
> +			closedir(perpid_dir);
> +			close(perpid_fd);
> +			goto out;
> +		}
> +		closedir(perpid_dir);
> +		close(perpid_fd);
> +		errno = 0;
> +	}
> +
> +	if (errno != 0)
> +		goto scan_failure;
> +
> +	err = 0;
> +out:
> +	closedir(all_pids_dir);
> +	return err;
> +
> +scan_failure:
> +	fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot scan per-PID DOF directory: %s\n",
> +		 strerror(errno));
> +	goto out;
> +}
> diff --git a/dtprobed/dof_stash.h b/dtprobed/dof_stash.h
> new file mode 100644
> index 000000000000..2754a8180622
> --- /dev/null
> +++ b/dtprobed/dof_stash.h
> @@ -0,0 +1,44 @@
> +/*
> + * Oracle Linux DTrace; DOF storage for later probe removal.
> + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
> + * Licensed under the Universal Permissive License v 1.0 as shown at
> + * http://oss.oracle.com/licenses/upl.
> + */
> +
> +#ifndef	_DOF_STASH_H
> +#define	_DOF_STASH_H
> +
> +#include <sys/types.h>
> +#include <stdint.h>
> +
> +#include <dt_list.h>
> +#include "dof_parser.h"
> +
> +typedef struct dof_parsed_list {
> +	dt_list_t list;
> +	dof_parsed_t *parsed;
> +} dof_parsed_list_t;
> +
> +int dof_stash_init(const char *statedir);
> +
> +int dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed);
> +
> +int dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum);
> +
> +void dof_stash_flush(dt_list_t *accum);
> +
> +int dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
> +		  const void *dof, size_t size);
> +
> +int dof_stash_remove(pid_t pid, int gen);
> +
> +void dof_stash_prune_dead(void);
> +
> +int
> +reparse_dof(int out, int in,
> +	    int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
> +			   dof_helper_t *dh, const void *in_buf,
> +			   size_t in_bufsz, int reparsing),
> +	    int force);
> +
> +#endif
> diff --git a/dtprobed/dtprobed.service b/dtprobed/dtprobed.service
> index 95f0a1d6d606..70443f08fb9b 100644
> --- a/dtprobed/dtprobed.service
> +++ b/dtprobed/dtprobed.service
> @@ -20,6 +20,8 @@ ProtectHome=true
>  PrivateDevices=false
>  PrivateNetwork=true
>  ProtectControlGroups=true
> +RuntimeDirectory=/run/dtrace
> +RuntimeDirectoryPreserve=restart
>  
>  [Install]
>  WantedBy=basic.target
> diff --git a/dtrace.spec b/dtrace.spec
> index 777dd28ded84..c1fbd17f4180 100644
> --- a/dtrace.spec
> +++ b/dtrace.spec
> @@ -185,6 +185,10 @@ make DESTDIR=$RPM_BUILD_ROOT VERSION=%{version} \
>       HDRPREFIX="$RPM_BUILD_ROOT/usr/include" \
>       install install-test
>  
> +%if "%{?dist}" == ".el7"
> +sed -i '/^ProtectSystem=/d; /^ProtectControlGroups=/d; /^RuntimeDirectory/d;' $RPM_BUILD_ROOT/usr/lib/systemd/system/dtprobed.service
> +%endif
> +
>  # Because systemtap creates a sdt.h header file we have to rename
>  # ours and then shift theirs out of the way.
>  mv $RPM_BUILD_ROOT/usr/include/sys/sdt.h \
> diff --git a/libcommon/dof_parser.h b/libcommon/dof_parser.h
> index 7ab18f4dee01..41ef5a193b7f 100644
> --- a/libcommon/dof_parser.h
> +++ b/libcommon/dof_parser.h
> @@ -1,6 +1,6 @@
>  /*
>   * Oracle Linux DTrace; DOF parser interface with the outside world
> - * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
> + * Copyright (c) 2022, 2023, Oracle and/or its affiliates. All rights reserved.
>   * Licensed under the Universal Permissive License v 1.0 as shown at
>   * http://oss.oracle.com/licenses/upl.
>   */
> @@ -15,11 +15,10 @@
>  #include <dtrace/helpers.h>
>  
>  /*
> - * Result of DOF probe parsing (currently used only for probe creation. We
> - * receive a provider info structure, followed by N probe info structures each
> - * of which is followed by possibly many tracepoint info structures, all tagged.
> - * Things not useful for probe creation (like args, translated types, etc) are
> - * not returned.
> + * Result of DOF probe parsing.  We receive a provider info structure, followed
> + * by N probe info structures each of which is followed by possibly many
> + * tracepoint info structures, all tagged.  Things not useful for probe creation
> + * (like args, translated types, etc) are not returned.

Well, probe args are actually useful (and needed) for probe creation, as are
translated types and mapping info.  I understand that we currently do not
support them yet, but it is misleading to write in this comment that those are
not needed for probe creation, because we actually do consider them part of the
probe info.

>   *
>   * On error, a DIT_ERR structure is returned with an error message.
>   */
> @@ -31,6 +30,15 @@ typedef enum dof_parsed_info {
>  	DIT_ERR = 3
>  } dof_parsed_info_t;
>  
> +/*
> + * Bump this whenever dof_parsed changes.
> + *
> + * Files consisting of arrays of dof_parsed have a single 64-bit word at the
> + * start which is the version of the dof_parseds within it.  The data flowing
> + * over the stream from the seccomped parser has no such prefix.
> + */
> +#define DOF_PARSED_VERSION 1
> +
>  typedef struct dof_parsed {
>  	/*
>  	 * Size of this instance of this structure.
> -- 
> 2.43.0.272.gce700b77fd
> 
> 
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel



More information about the DTrace-devel mailing list