[DTrace-devel] [PATCH 03/23] dtprobed: add the DOF stash
Nick Alcock
nick.alcock at oracle.com
Thu Feb 22 18:38:59 UTC 2024
The DOF stash is a runtime directory (/run/dtrace by default) in which
the dtprobed daemon will keep track of incoming DOF. There's a comment
describing it at the top of dtprobed/dof_stash.c, but in brief we have three
directories under this:
- stash/dof/, which contains raw DOF named by mapping
- stash/dof-pid/, which tracks DOF on a PID-by-PID and mapping-by-mapping
basis. This contains subdirectories $pid/$mapping containing both the same
raw DOF (hardlinked) and parsed DOF split up by probe, in
$pid/$mapping/parsed/$prv:$mod:$fun$:prb.
- probes/, which contains parsed DOF for each individual probe in
/run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb to aid consumer globbing
of probes; in future we may introduce further symlinks not in per-pid
directories to ease matching of probes across processes.
Consumers can use inotify watches to detect newly added or removed DOF or
probes: because all the DOF is in one directory, a very small number of
watches is needed to keep track of all the DOF in the system, and only one
watch per PID is needed to track it on a PID-by-PID basis. (Currently,
nothing is doing this, but it is now a possibility where it never was
before.)
In a future commit, dtprobed will emit all the DOF it receives into this
structure, removing it when processes exit, and DTrace will use it to create
and remove probes.
dof_stash.c itself provides functions to add new raw DOF to the stash, to
add a stream of parsed DOF corresponding to a given piece of raw DOF to the
stash, to remove DOF as processes issue DTRACEHIOC_REMOVE requests, to check
for any DOF that needs reparsing to handle changes to struct dof_parsed, and
to remove entries from the DOF stash that is left behind by processes that
died uncleanly without issuing DTRACEHIOC_REMOVEs. (Currently, we don't deal
with newly-started processes that reuse the same PID as such a process, or
processes that leave DOF behind on exec() and then add more: this
improvement will come soon.)
It does not do any DOF parsing: that is left up to the caller (dtprobed.c).
All the writes to the dof_stash are done via the *at() functions to avoid
any risk of bugs causing the file creations and most especially deletions it
carries out from leaking outside /run/dtrace. (We are not concerned about
symlink attacks etc because only root can write to /run/dtrace or any
directories underneath it.)
Signed-off-by: Nick Alcock <nick.alcock at oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees at oracle.com>
---
dtprobed/Build | 4 +-
dtprobed/dof_stash.c | 1639 ++++++++++++++++++++++++++++++++++++++++
dtprobed/dof_stash.h | 41 +
libcommon/dof_parser.h | 20 +-
4 files changed, 1696 insertions(+), 8 deletions(-)
create mode 100644 dtprobed/dof_stash.c
create mode 100644 dtprobed/dof_stash.h
diff --git a/dtprobed/Build b/dtprobed/Build
index 610e1352..0f5d1c08 100644
--- a/dtprobed/Build
+++ b/dtprobed/Build
@@ -1,5 +1,5 @@
# Oracle Linux DTrace.
-# Copyright (c) 2022, 2023, Oracle and/or its affiliates. All rights reserved.
+# Copyright (c) 2022, 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at
# http://oss.oracle.com/licenses/upl.
@@ -26,7 +26,7 @@ dtprobed_CPPFLAGS := -I. -Idtprobed -Ilibproc -Ilibcommon -Ilibport
dtprobed_CFLAGS := $(shell pkg-config --cflags $(LIBFUSE))
dtprobed_LIBS := -lcommon -lproc -lcommon -lport -lelf $(shell pkg-config --libs $(LIBFUSE))
dtprobed_DEPS := libproc.a libcommon.a libport.a
-dtprobed_SOURCES := dtprobed.c seccomp-assistance.c
+dtprobed_SOURCES := dtprobed.c dof_stash.c seccomp-assistance.c
dtprobed_LIBSOURCES := libproc libcommon
ifdef HAVE_LIBSYSTEMD
diff --git a/dtprobed/dof_stash.c b/dtprobed/dof_stash.c
new file mode 100644
index 00000000..7b80256d
--- /dev/null
+++ b/dtprobed/dof_stash.c
@@ -0,0 +1,1639 @@
+/*
+ * Oracle Linux DTrace; DOF state storage management.
+ * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved.
+ * Licensed under the Universal Permissive License v 1.0 as shown at
+ * http://oss.oracle.com/licenses/upl.
+ *
+ * The DOF stash is the principal state preservation mechanism used by dtprobed.
+ * Keeping DOF on disk keeps memory management complexity low, makes it
+ * possible for dtprobed to track DOF across daemon restarts and upgrades,
+ * and gives the rest of dtrace a place to acquire the DOF contributed by
+ * processes on the system for USDT probe operations.
+ *
+ * The state storage structure is as follows:
+ *
+ * /run/dtrace/stash/: Things private to dtprobed.
+ *
+ * .../dof/$dev-$ino: DOF contributed by particular mappings, in raw form
+ * (as received from some probe-containing program).
+ *
+ * .../dof-pid/$pid/$dev-$ino/: contains everything relating to DOF
+ * contributed by a particular USDT-containing ELF object within a given
+ * process (henceforth "DOF source"). Each DTRACEHIOC_ADDDOF ioctl call
+ * creates one of these.
+ *
+ * .../dof-pid/$pid/$gen: symlink to a single $dev-$ino dir. "$gen" is an
+ * incrementing generation counter starting at zero for every unique PID,
+ * used to uniquely identify DOF pieces to remove.
+ *
+ * .../dof-pid/$pid/next-gen: The next generation counter to use for a new
+ * piece of DOF from this process. Encoded as a sparse file whose size is
+ * the next counter value to use.
+ *
+ * .../dof-pid/$pid/$dev-$ino/raw: hardlink to the DOF for a given DOF
+ * source. Pruned of dead processes at startup and on occasion: entries also
+ * deleted on receipt of DTRACEHIOC_REMOVE ioctls. A hardlink is used in
+ * order to bump the link count for the corresponding DOF in the dof/
+ * directory: when this link count falls to 1, the DOF is considered dead and
+ * the corresponding probe is removed.
+ *
+ * .../dof-pid/$pid/$dev-$ino/dh: Raw form of the dof_helper_t received from
+ * a given DTRACEHIOC_ADDDOF, serialized straight to disk with no changes.
+ * Preserved for all DTRACEHIOC_ADDDOFs, even if the DOF was already
+ * contributed by some other process. Keeping this around allows the DOF to
+ * be re-parsed as needed, even by a later re-execution of a different
+ * version of the same daemon with a different dof_parsed_t structure, and to
+ * be re-parsed with appropriate addresses for whichever mapping of this DOF
+ * is under consideration.
+ *
+ * .../dof-pid/$pid/$dev-$ino/parsed/: Directory containing parsed DOF for a
+ * given DOF source. Deleted when empty.
+ *
+ * .../dof-pid/$pid/$dev-$ino/parsed/$prv:$mod:$fun:$prb: parsed DOF for a
+ * given PID/mapping/probe triplet. Generated from the helper for PIDs that
+ * are the first appearance of a given piece of DOF. Deleted at daemon
+ * startup if the dof_parsed_t structure changed. The file consists of a
+ * uint64_t version number (DOF_PARSED_VERSION), then a stream of
+ * dof_parsed_t records, the first of type DIT_PROVIDER, the second
+ * DIT_PROBE, then as many struct dof_parsed_t's of type DIT_TRACEPOINT as
+ * the DIT_PROBE record specifies.
+ *
+ * /run/dtrace/probes/: Per-probe info, written by dtprobed, read by DTrace.
+ *
+ * .../$pid/$prv$pid/$mod/$fun/$prb: Hardlink from $prv$pid:$mod:$fun:$prb
+ * above; parsed representation of one probe in a given process. Removed by
+ * dtprobed when the process dies, or if all mappings containing the probe
+ * are unmmapped. Used by DTrace for tracing by PID.
+ *
+ * (If a probe exists in multiple different mappings in a given process, they
+ * must all be identical.)
+ *
+ * In future, symlinks in /run/dtrace/probes/$prv$pid may exist, used for
+ * globbing provider names across processes.
+ */
+
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include "dof_stash.h"
+
+#include <dt_list.h>
+
+#ifdef HAVE_FUSE_LOG
+#include <fuse_log.h>
+#else
+#include "rpl_fuse_log.h"
+#endif
+
+#include <port.h>
+
+static int dof_dir, pid_dir, probes_dir;
+
+/*
+ * Like close(), but only try to close positively-numbered fds.
+ */
+static int
+maybe_close(int fd)
+{
+ if (fd >= 0)
+ return close(fd);
+ return 0;
+}
+
+/*
+ * Make a runtime state directory and return its fd.
+ */
+static int
+make_state_dir(const char *dir)
+{
+ int fd;
+
+ if (mkdir(dir, 0755) < 0 && errno != EEXIST) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: "
+ "cannot create state directory %s: %s\n",
+ dir, strerror(errno));
+ return -1;
+ }
+
+ if ((fd = open(dir, O_PATH | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: "
+ "cannot open state directory %s: %s\n",
+ dir, strerror(errno));
+ return -1;
+ }
+ return fd;
+}
+
+/*
+ * Make a state directory under some other directory and return an fd to it.
+ *
+ * If "readable" is set, the fd is opened with O_RDONLY rather than O_PATH.
+ */
+static int
+make_state_dirat(int dirfd, const char *name, const char *errmsg, int readable)
+{
+ int fd;
+ int flags = O_PATH;
+
+ if (readable)
+ flags = O_RDONLY;
+
+ if (mkdirat(dirfd, name, 0755) < 0) {
+ if (errno == EEXIST) {
+ struct stat s;
+
+ if (fstatat(dirfd, name, &s, 0) == 0 &&
+ S_ISDIR(s.st_mode))
+ goto open_fd;
+ errno = EEXIST;
+ }
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot create %s dir named %s: %s\n",
+ errmsg, name, strerror(errno));
+ return -1;
+ }
+
+open_fd:
+ if ((fd = openat(dirfd, name, flags | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open %s dir named %s: %s\n",
+ errmsg, name, strerror(errno));
+ return -1;
+ }
+ return fd;
+}
+
+/*
+ * Initialize the DOF-stashing machinery, if needed.
+ */
+int
+dof_stash_init(const char *statedir)
+{
+ int root;
+ int stash;
+
+ if (statedir == NULL)
+ statedir="/run/dtrace";
+
+ if ((root = make_state_dir(statedir)) < 0)
+ return -1;
+
+ if ((stash = make_state_dirat(root, "stash", "statedir", 0)) < 0)
+ return -1;
+
+ if ((dof_dir = make_state_dirat(stash, "dof", "DOF", 0)) < 0)
+ return -1;
+ if ((pid_dir = make_state_dirat(stash, "dof-pid", "per-pid DOF", 1)) < 0)
+ return -1;
+
+ if ((probes_dir = make_state_dirat(root, "probes", "probesdir", 1)) < 0)
+ return -1;
+
+ close(root);
+ close(stash);
+
+ return 0;
+}
+
+/*
+ * Compose the provider+pid name of the per-PID instance of a provider from
+ * pieces.
+ */
+static char *
+make_provpid_name(const char *prov, pid_t pid)
+{
+ char *ret;
+
+ if (asprintf(&ret, "%s%li", prov, (long) pid) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making provider name\n");
+ return NULL;
+ }
+ return ret;
+}
+
+/*
+ * Compose a full probespec's name from pieces.
+ */
+static char *
+make_probespec_name(const char *prov, const char *mod, const char *fn,
+ const char *prb)
+{
+ char *ret;
+
+ if (asprintf(&ret, "%s:%s:%s:%s", prov, mod, fn, prb) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making probespec\n");
+ return NULL;
+ }
+ return ret;
+}
+
+/*
+ * Split the pieces back up again. The input spec becomes the provider name.
+ * The pieces' lifetime is no longer than that of the input string.
+ */
+static int
+split_probespec_name(char *spec, const char **mod, const char **fn,
+ const char **prb)
+{
+ char *tmp;
+
+ tmp = strchr(spec, ':');
+ if (tmp == NULL)
+ return -1;
+ *tmp = 0;
+ *mod = tmp + 1;
+
+ tmp = strchr(*mod, ':');
+ if (tmp == NULL)
+ return -1;
+ *tmp = 0;
+ *fn = tmp + 1;
+
+ tmp = strchr(*fn, ':');
+ if (tmp == NULL)
+ return -1;
+ *tmp = 0;
+ *prb = tmp + 1;
+
+ return 0;
+}
+
+/*
+ * Compose a DOF directory's name from pieces.
+ */
+static char *
+make_dof_name(dev_t dev, ino_t ino)
+{
+ char *ret;
+
+ if (asprintf(&ret, "%li-%li", (long) dev, (long) ino) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making DOF name\n");
+ return NULL;
+ }
+ return ret;
+}
+
+/*
+ * Split the pieces back up again.
+ */
+static int
+split_dof_name(const char *dof_name, dev_t *dev, ino_t *ino)
+{
+ if (sscanf(dof_name, "%li-%li", (long *) dev, (long *) ino) < 2) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: per-pid directory name %s unparseable\n",
+ dof_name);
+ return -1;
+ }
+ return 0;
+}
+
+/*
+ * A trivial wrapper.
+ */
+static char *
+make_numeric_name(long num)
+{
+ char *ret;
+
+ if (asprintf(&ret, "%li", num) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making filename\n");
+ return NULL;
+ }
+ return ret;
+}
+
+
+/*
+ * Push a piece of parsed DOF into a list to be emitted by
+ * dof_stash_write_parsed().
+ */
+int
+dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed)
+{
+ dof_parsed_list_t *entry;
+
+ if ((entry = calloc(1, sizeof(dof_parsed_list_t))) == NULL)
+ return -1;
+
+ entry->parsed = parsed;
+ dt_list_append(accum, entry);
+
+ return 0;
+}
+
+/*
+ * Free all the accumulated parsed DOF in the passed list.
+ */
+void
+dof_stash_free(dt_list_t *accum)
+{
+ dof_parsed_list_t *accump;
+ dof_parsed_list_t *last_accump = NULL;
+
+ for (accump = dt_list_next(accum); accump != NULL;
+ accump = dt_list_next(accump)) {
+ dt_list_delete(accum, accump);
+ free(accump->parsed);
+ free(last_accump);
+ last_accump = accump;
+ }
+ free(last_accump);
+}
+
+/*
+ * Write a chunk of data to an fd.
+ */
+static int
+write_chunk(int fd, const void *buf, size_t size)
+{
+ char *bufptr = (char *) buf;
+
+ while (size > 0) {
+ size_t len;
+
+ if ((len = write(fd, bufptr, size)) < 0 && errno != EINTR)
+ return -1;
+
+ size -= len;
+ bufptr += len;
+ }
+ return 0;
+}
+
+/*
+ * Write out a piece of raw DOF. Returns the length of the file written,
+ * or 0 if none was needed (or -1 on error).
+ */
+static int
+dof_stash_write_raw(int dirfd, const char *name, const void *buf, size_t size)
+{
+ struct stat s;
+ int fd;
+
+ /*
+ * Sanity check: if the DOF already exists but is not the same size as
+ * the DOF we already have, complain, and replace it. If it does exist,
+ * there's no need to write it out.
+ *
+ * If we can't even unlink it or write it out, we give up -- the stash
+ * has failed and we won't be able to do anything it implies.
+ *
+ * (This is only a crude check -- obviously distinct raw DOF could be
+ * the same size by pure chance.)
+ */
+ if (fstatat(dirfd, name, &s, 0) == 0) {
+ if (s.st_size == size)
+ return 0;
+
+ fuse_log(FUSE_LOG_ERR, "dtprobed: DOF %s already exists, "
+ "but is %zx bytes long, not %zx: replacing\n",
+ name, s.st_size, size);
+ if (unlinkat(dirfd, name, 0) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
+ name, strerror(errno));
+ return -1;
+ }
+ }
+
+ if ((fd = openat(dirfd, name, O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out raw DOF: %s\n",
+ strerror(errno));
+ return -1;
+ }
+
+ if (write_chunk(fd, buf, size) < 0)
+ goto err;
+
+ if (close(fd) < 0)
+ goto err;
+ return size + sizeof(uint64_t);
+
+err:
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF: %s\n",
+ strerror(errno));
+ unlinkat(dirfd, name, 0);
+ close(fd);
+ return -1;
+}
+
+/*
+ * Write out all the parsed DOF from the passed-in accumulator. Split out from
+ * dof_stash_add() because we don't want to create anything in the DOF stash for
+ * a given piece of DOF until we have verified that all the parsed data for all
+ * the probes in that DOF is valid.
+ *
+ * Each parsed file is a reduced parser stream (see the comments at the top of
+ * this file), starting with a single 64-bit word giving the version of
+ * dof_parsed_t in use, to ensure that readers never use an incompatible
+ * version.
+ *
+ * We also write the parsed version out to the parsed/version file in the parsed
+ * mapping dir, to make it easier to determine whether to reparse all the DOF
+ * for outdatedness.
+ */
+int
+dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum)
+{
+ char *pid_name = make_numeric_name(pid);
+ char *dof_name = make_dof_name(dev, ino);
+ char *provpid_name = NULL;
+ int perpid_dir, perpid_dof_dir, probepid_dir, parsed_mapping_dir,
+ prov_dir = -1, mod_dir = -1, fun_dir = -1;
+ dof_parsed_list_t *accump;
+ const dof_parsed_t *this_provider = NULL;
+ int version_fd, parsed_fd = -1;
+ const char *op;
+ const char *probe_err = "unknown probe";
+ uint64_t version = DOF_PARSED_VERSION;
+ int state = -1;
+ int err;
+
+ if (dt_list_next(accum) == NULL)
+ goto early_out;
+
+ if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: open failed for PID dir for parsed DOF write: %s\n",
+ pid, dev, ino, strerror(errno));
+ goto early_err;
+ }
+
+ op = "per-mapping open";
+ if ((perpid_dof_dir = openat(perpid_dir, dof_name, O_PATH | O_CLOEXEC)) < 0)
+ goto close_perpid_err;
+
+ op = "per-mapping parsed DOF mkdirat";
+ if ((parsed_mapping_dir = make_state_dirat(perpid_dof_dir, "parsed", op, 0)) < 0)
+ goto close_perpid_dof_err;
+
+ op = "probe PID mkdirat";
+ if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
+ goto remove_parsed_mapping_err;
+
+ if ((version_fd = openat(parsed_mapping_dir, "version",
+ O_CREAT | O_EXCL | O_TRUNC | O_WRONLY, 0644)) < 0) {
+ /*
+ * This creation failing indicates that this mapping already
+ * exists, i.e. it exists multiple times within the same
+ * process. ELF constructors will be run repeatedly
+ * in such cases, but we're only interested in the first.
+ *
+ * Note: in the future we may want to track this so that you
+ * need to do two DTRACEHIOC_REMOVEs to delete such things.)
+ */
+ if (errno == EEXIST) {
+ close(probepid_dir);
+ close(parsed_mapping_dir);
+ close(perpid_dof_dir);
+ close(perpid_dir);
+ free(pid_name);
+ free(dof_name);
+ return 0;
+ }
+ goto remove_probepid_dir_err;
+ }
+
+ if (write_chunk(version_fd, &version, sizeof(uint64_t)) < 0) {
+ close(version_fd);
+ goto remove_probepid_dir_err;
+ }
+ close(version_fd);
+
+ /*
+ * We can just use assert() to check for overruns here, because any
+ * overruns are not corrupted input but a seriously buggy parser.
+ *
+ * On error, from this point on, we remove as little as possible on
+ * error, just a single probe's worth: better some probes survive an
+ * error condition than none do.
+ */
+ for (accump = dt_list_next(accum); accump != NULL;
+ accump = dt_list_next(accump)) {
+ char *mod, *fun, *prb;
+ char *parsedfn = NULL;
+
+ switch (accump->parsed->type) {
+ /*
+ * Provider: make new provider dir.
+ */
+ case DIT_PROVIDER: {
+ state = accump->parsed->type;
+
+ free(provpid_name);
+
+ op = "provider name creation";
+ this_provider = accump->parsed;
+ provpid_name = make_provpid_name(this_provider->provider.name, pid);
+ if (!provpid_name)
+ goto err_provider;
+
+ probe_err = this_provider->provider.name;
+
+ op = "parsed provider close";
+ if (maybe_close(prov_dir) < 0)
+ goto err_provider;
+
+ op = "probe provider mkdirat";
+ if ((prov_dir = make_state_dirat(probepid_dir, provpid_name,
+ op, 1)) < 0)
+ goto err_provider;
+ break;
+ }
+ /*
+ * Probe: make (mod, fun, probe) dirs as needed: close
+ * out file, open new one, make hardlink to it.
+ */
+ case DIT_PROBE: {
+ assert(state == DIT_PROVIDER || state == DIT_PROBE ||
+ state == DIT_TRACEPOINT);
+ state = accump->parsed->type;
+
+ /*
+ * This rather weird pattern carries out all closes and
+ * then errors if any of them fail.
+ */
+ op = "parsed probe close";
+ err = maybe_close(mod_dir);
+ err |= maybe_close(fun_dir);
+ err |= maybe_close(parsed_fd);
+
+ if (err != 0)
+ goto err_provider;
+
+ /*
+ * Skip probes with zero tracepoints entirely.
+ */
+ if (accump->parsed->probe.ntp == 0)
+ break;
+
+ mod = accump->parsed->probe.name;
+ assert(accump->parsed->size > (mod - (char *) accump->parsed));
+ fun = mod + strlen(mod) + 1;
+ assert(accump->parsed->size > (fun - (char *) accump->parsed));
+ prb = fun + strlen(fun) + 1;
+ assert(accump->parsed->size >= (prb - (char *) accump->parsed));
+
+ op = "probe name construction";
+ probe_err = this_provider->provider.name;
+ if ((parsedfn = make_probespec_name(this_provider->provider.name,
+ mod, fun, prb)) == NULL)
+ goto err_provider;
+
+ /*
+ * Ban "." and ".." as name components. Obviously names
+ * containing dots are commonplace (shared libraries,
+ * for instance), but allowing straight . and .. would
+ * have obviously horrible consequences. They can't be
+ * filenames anyway, and you can't create them with
+ * dtrace -h because they aren't valid C identifier
+ * names.
+ */
+ op = "probe name validation";
+ probe_err = parsedfn;
+
+ if (strcmp(parsedfn, ".") == 0 ||
+ strcmp(parsedfn, "..") == 0)
+ goto err_provider;
+
+ op = "probe module";
+
+ if ((mod_dir = make_state_dirat(prov_dir, mod, op, 0)) < 0)
+ goto err_provider;
+
+ op = "probe fun";
+ if ((fun_dir = make_state_dirat(mod_dir, fun, op, 0)) < 0)
+ goto err_probemod;
+
+ op = "parsed probe creation";
+ if ((parsed_fd = openat(parsed_mapping_dir, parsedfn,
+ O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
+ goto err_probefn;
+
+ op = "parsed probe hardlink";
+ errno = 0;
+ if (linkat(parsed_mapping_dir, parsedfn, fun_dir, prb, 0) < 0
+ && errno != EEXIST)
+ goto err_parsed_probe;
+
+ op = "parsed version write";
+ if (write_chunk(parsed_fd, &version, sizeof(uint64_t)) < 0)
+ goto err_probe_link;
+
+ op = "parsed provider write";
+ if (write_chunk(parsed_fd, this_provider,
+ this_provider->size) < 0)
+ goto err_probe_link;
+
+ op = "parsed probe write";
+ if (write_chunk(parsed_fd, accump->parsed,
+ accump->parsed->size) < 0)
+ goto err_probe_link;
+
+ break;
+ }
+
+ /* Tracepoint: add to already-open file. */
+ case DIT_TRACEPOINT:
+ assert(state == DIT_PROBE || state == DIT_TRACEPOINT);
+ state = accump->parsed->type;
+
+ if (write_chunk(parsed_fd, accump->parsed, accump->parsed->size) < 0)
+ goto err_probe_link;
+ break;
+
+ /* Error return from parser. */
+ case DIT_ERR:
+ op = "parsing";
+ fuse_log(FUSE_LOG_ERR, "dtprobed: parser error: %s\n",
+ accump->parsed->err.err);
+ goto err_provider_close;
+
+ default:
+ /*
+ * This is a parser bug, and a bad one. Just fail
+ * hard.
+ */
+ fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: internal error: corrupt parsed DOF: type %i\n",
+ pid, dev, ino, accump->parsed->type);
+ assert(1);
+ }
+
+ free(parsedfn);
+ continue;
+
+ err_probe_link:
+ unlinkat(fun_dir, prb, 0);
+ err_parsed_probe:
+ unlinkat(parsed_mapping_dir, parsedfn, 0);
+ err_probefn:
+ unlinkat(mod_dir, fun, AT_REMOVEDIR);
+ err_probemod:
+ if (prov_dir >= 0) /* Always true: placate compiler */
+ unlinkat(prov_dir, mod, AT_REMOVEDIR);
+ err_provider_close:
+ maybe_close(parsed_fd);
+ maybe_close(fun_dir);
+ maybe_close(mod_dir);
+ err_provider:
+ maybe_close(prov_dir);
+ unlinkat(probepid_dir, provpid_name, AT_REMOVEDIR);
+ fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write for %s: %s\n",
+ pid, dev, ino, op, probe_err, strerror(errno));
+ free(parsedfn);
+ goto err_free;
+ }
+
+ err = maybe_close(prov_dir);
+ err |= maybe_close(mod_dir);
+ err |= maybe_close(fun_dir);
+ err |= maybe_close(parsed_fd);
+
+ if (err != 0) {
+ unlinkat(perpid_dof_dir, "parsed", 0);
+ goto err_free;
+ }
+
+ close(probepid_dir);
+ close(parsed_mapping_dir);
+ close(perpid_dof_dir);
+ close(perpid_dir);
+ free(provpid_name);
+
+early_out:
+ free(pid_name);
+ free(dof_name);
+ return 0;
+
+ /* Errors requiring freeing and error return but no removals. */
+err_free:
+ close(probepid_dir);
+ close(parsed_mapping_dir);
+ close(perpid_dof_dir);
+ close(perpid_dir);
+ close(prov_dir);
+ free(provpid_name);
+ free(pid_name);
+ free(dof_name);
+ return -1;
+
+ /*
+ * Errors before parsed-probe creation starts require wholesale
+ * removals of stale dirs.
+ */
+
+remove_probepid_dir_err:
+ close(probepid_dir);
+ unlinkat(probes_dir, pid_name, AT_REMOVEDIR);
+ unlinkat(parsed_mapping_dir, "version", 0);
+remove_parsed_mapping_err:
+ close(parsed_mapping_dir);
+ unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR);
+close_perpid_dof_err:
+ close(perpid_dof_dir);
+close_perpid_err:
+ fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write: %s\n",
+ pid, dev, ino, op, strerror(errno));
+ close(perpid_dir);
+early_err:
+ free(pid_name);
+ free(dof_name);
+ return -1;
+}
+
+/*
+ * Write out the DOF helper. A trivial wrapper.
+ */
+static
+int dof_stash_write_dh(int dirfd, const dof_helper_t *dh)
+{
+ int fd;
+
+ if ((fd = openat(dirfd, "dh", O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
+ goto err;
+
+ if (write_chunk(fd, dh, sizeof(dof_helper_t *)) < 0) {
+ close(fd);
+ goto err;
+ }
+
+ if (close(fd) < 0)
+ goto err;
+
+ return 0;
+
+err:
+ unlinkat(dirfd, "dh", 0);
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF helper: %s\n",
+ strerror(errno));
+ return -1;
+}
+
+/*
+ * Add a piece of raw DOF from a given (pid, dev, ino) triplet.
+ *
+ * Return the new DOF generation number, or -1 on error.
+ */
+int
+dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
+ const void *dof, size_t size)
+{
+ char *dof_name = make_dof_name(dev, ino);
+ char *pid_name = make_numeric_name(pid);
+ int perpid_dir = -1;
+ int perpid_dof_dir = -1;
+ int gen_fd = -1;
+ struct stat gen_stat;
+ char *gen_name = NULL;
+ int new_pid = 1;
+ int new_dof = 0;
+ int err = -1;
+
+ if (!pid_name || !dof_name)
+ goto out_free;
+
+
+ /* Make the directories. */
+
+ perpid_dir = make_state_dirat(pid_dir, pid_name, "PID", 0);
+
+ if (perpid_dir < 0)
+ goto out_free;
+
+ perpid_dof_dir = make_state_dirat(perpid_dir, dof_name, "per-pid DOF", 0);
+
+ if (perpid_dof_dir < 0)
+ goto err_unlink_nomsg;
+
+ fuse_log(FUSE_LOG_DEBUG, "%i: adding DOF in mapping %lx/%lx, l_addr %lx\n",
+ pid, dev, ino, dh->dofhp_addr);
+
+ /* Get the current generation counter. */
+
+ if ((gen_fd = openat(perpid_dir, "next-gen", O_CREAT | O_WRONLY, 0644)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open DOF generation counter for PID "
+ "%i: %s\n", pid, strerror(errno));
+ goto err_unlink_nomsg;
+ }
+
+ if (fstat(gen_fd, &gen_stat) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat DOF generation counter for PID "
+ "%i: %s\n", pid, strerror(errno));
+ goto err_unlink_nomsg;
+ }
+ if (gen_stat.st_size > 0)
+ new_pid = 0;
+
+ /*
+ * Write out the raw DOF, if not already known, and link it into its
+ * per-PID dir: write the helper directly out. dof_stash_write_raw
+ * tells us whether the DOF actually needed writing or not (via a
+ * straight size comparison, which suffices to spot almost everything
+ * other than the pathological case of stale DOF from a killed process
+ * whose inode is unlinked and replaced with another identically-
+ * numbered inode with DOF of the *exact same length*: this seems too
+ * unlikely to be worth bothering with.)
+ *
+ * dof_stash_write_raw() returns 0 if no DOF needed writing because it
+ * already existed, -1 if there was an error, and a positive size
+ * otherwise.
+ */
+ new_dof = 1;
+ switch (dof_stash_write_raw(dof_dir, dof_name, dof, size)) {
+ case 0: new_dof = 0; break;
+ case -1: goto err_unlink_nomsg; break;
+ }
+
+ if (linkat(dof_dir, dof_name, perpid_dof_dir, "raw", 0) < 0)
+ goto err_unlink_msg;
+
+ if (dof_stash_write_dh(perpid_dof_dir, dh) < 0)
+ goto err_unlink_nomsg;
+
+ /*
+ * Update the generation counter and mark this DOF's generation. On
+ * error after this point we leak generation counter values.
+ */
+ if(ftruncate(gen_fd, gen_stat.st_size + 1) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot update generation counter for PID %i: %s\n",
+ pid, strerror(errno));
+ goto err_unlink_nomsg;
+ }
+
+ gen_name = make_numeric_name(gen_stat.st_size);
+ if (symlinkat(dof_name, perpid_dir, gen_name) < 0)
+ goto err_unlink_msg;
+
+ err = 0;
+
+out_free:
+ free(dof_name);
+ free(pid_name);
+
+ maybe_close(perpid_dir);
+ maybe_close(perpid_dof_dir);
+ maybe_close(gen_fd);
+
+ return err == 0 ? gen_stat.st_size : err;
+
+err_unlink_msg:
+ fuse_log(FUSE_LOG_ERR, "Cannot link pids for DOF from PID %i, "
+ "DOF mapping %lx/%lx into place: %s\n", pid, dev, ino,
+ strerror(errno));
+err_unlink_nomsg:
+ unlinkat(perpid_dof_dir, "raw", 0);
+ unlinkat(perpid_dir, dof_name, AT_REMOVEDIR);
+
+ if (gen_name)
+ unlinkat(perpid_dir, gen_name, 0);
+
+ if (new_pid) {
+ unlinkat(perpid_dir, "next-gen", 0);
+ unlinkat(pid_dir, pid_name, AT_REMOVEDIR);
+ }
+
+ if (new_dof)
+ unlinkat(dof_dir, dof_name, 0);
+
+ goto out_free;
+}
+
+/*
+ * The guts of unlinkat_many (below). Returns -1 on error, or positive
+ * ENOTEMPTY if a subdir deletion failed because the dir was nonempty (any
+ * positive return leads to unwinding without doing any more deletions).
+ */
+static int
+unlinkat_many_internal(int dirfd, int diroff, const char **names)
+{
+ int fd;
+ struct stat s;
+ int flags = 0;
+ int ret;
+
+ /*
+ * Open the dir/file at the next level down, then recurse to delete the
+ * next one in first. This is a directory unless the next element in
+ * the list is NULL.
+ */
+
+ if (names[diroff] == NULL)
+ return 0;
+
+ if (fstatat(dirfd, names[diroff], &s, AT_SYMLINK_NOFOLLOW) < 0) {
+ if (errno == ENOENT)
+ return 0;
+ fuse_log(FUSE_LOG_ERR, "Cannot unlink %s during probe state dir deletion: %s\n",
+ names[diroff], strerror(errno));
+ return -1;
+ }
+
+ if (S_ISDIR(s.st_mode))
+ flags = AT_REMOVEDIR;
+
+ if ((fd = openat(dirfd, names[diroff], O_RDONLY | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "Cannot open %s during probe state dir deletion: %s\n",
+ names[diroff], strerror(errno));
+ return -1;
+ }
+
+ ret = unlinkat_many_internal(fd, diroff + 1, names);
+ close(fd);
+
+ if (ret < 0)
+ return -1;
+ else if (ret > 0)
+ return 0; /* ENOTEMPTY */
+
+ /*
+ * Now do the deletion.
+ */
+ if (unlinkat(dirfd, names[diroff], flags) < 0) {
+ if (errno == ENOENT)
+ return 0;
+ if (errno == ENOTEMPTY)
+ return ENOTEMPTY;
+ }
+ return 0;
+}
+
+/*
+ * Act like rm -rd, removing a file and some number of empty directories above
+ * it, as given by the names argument: the last element may be a directory or a
+ * file. Each directory in the names array is a subdirectory of the one before
+ * it. If any directory removal fails due to non-emptiness, all later removals
+ * are skipped. If any directory does not exist, removals start from that
+ * point.
+ */
+static int
+unlinkat_many(int dirfd, const char **names)
+{
+ return unlinkat_many_internal(dirfd, 0, names);
+}
+
+/*
+ * Determine if a file or directory (in the DOF stash) should be deleted.
+ */
+static int
+refcount_cleanup_p(int dirfd, const char *name, int isdir)
+{
+ struct stat s;
+
+ if (fstatat(dirfd, name, &s, 0) != 0) {
+ fuse_log(FUSE_LOG_ERR, "Cannot stat %s for cleanup: %s\n",
+ name, strerror(errno));
+ return -1;
+ }
+
+ if ((isdir && s.st_nlink != 2) || (!isdir && s.st_nlink != 1))
+ return 0;
+
+ return 1;
+}
+
+
+/*
+ * Delete a file or directory (in the DOF stash) if it has no other links.
+ */
+static int
+refcount_cleanup(int dirfd, const char *name, int isdir)
+{
+ switch (refcount_cleanup_p(dirfd, name, isdir)) {
+ case -1: return -1;
+ case 0: return 0;
+ default: break;
+ }
+
+ if (unlinkat(dirfd, name, isdir ? AT_REMOVEDIR : 0) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
+ name, strerror(errno));
+ return -1;
+ }
+ return 0;
+}
+
+/*
+ * Remove the parsed directory under a given mapping directory.
+ */
+static int
+dof_stash_remove_parsed(pid_t pid, int mapping_dir, const char *mapping)
+{
+ char *pid_name = make_numeric_name(pid);
+ int parsed_dir;
+ DIR *parsed;
+ struct dirent *ent;
+
+ if (((parsed_dir = openat(mapping_dir, "parsed", O_RDONLY | O_CLOEXEC)) < 0) ||
+ ((parsed = fdopendir(parsed_dir)) == NULL)) {
+
+ if (parsed_dir >= 0)
+ close(parsed_dir);
+
+ if (errno == ENOENT)
+ return 0;
+
+ fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory for PID %i mapping %s for cleanup: %s\n",
+ pid, mapping, strerror(errno));
+ free(pid_name);
+ return -1;
+ }
+
+ /*
+ * Scan the parsed dir, clean up all the matching per-probe
+ * dirs, and delete the parsed dir's contents too.
+ */
+
+ while (errno = 0, (ent = readdir(parsed)) != NULL) {
+ const char *mod, *fun, *prb;
+ const char *unlink_err;
+ char *prov = NULL, *provpid = NULL;
+
+ if (errno != 0)
+ goto err;
+
+ if (ent->d_type != DT_REG)
+ continue;
+
+ if ((prov = strdup(ent->d_name)) == NULL)
+ goto err;
+
+ if (unlinkat(parsed_dir, ent->d_name, 0) < 0) {
+ unlink_err = "parsed dir";
+ goto next;
+ }
+
+ if (strcmp(ent->d_name, "version") == 0)
+ continue;
+
+ if (split_probespec_name(prov, &mod, &fun, &prb) < 0)
+ continue;
+
+ provpid = make_provpid_name(prov, pid);
+ if (!provpid) {
+ free(prov);
+ goto err;
+ }
+
+ const char *dirnames[] = {pid_name, provpid, mod, fun, prb, NULL};
+ if (unlinkat_many(probes_dir, dirnames) < 0) {
+ unlink_err = "parsed probes dir entries";
+ goto next;
+ }
+ free(prov);
+ free(provpid);
+ continue;
+
+ next:
+ free(prov);
+ free(provpid);
+ fuse_log(FUSE_LOG_ERR, "cannot unlink per-PID %s %s (mapping %s) for cleanup: %s\n",
+ ent->d_name, unlink_err, mapping, strerror(errno));
+ }
+ closedir(parsed);
+ free(pid_name);
+
+ return 0;
+
+err:
+ closedir(parsed);
+ free(pid_name);
+ fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory entry for PID %i mapping %s for cleanup: %s\n",
+ pid, mapping, strerror(errno));
+ return -1;
+}
+
+/*
+ * Remove a piece of DOF, identified by generation counter. Return -1 on error.
+ *
+ * Not knowing about the DOF is not an error.
+ */
+int
+dof_stash_remove(pid_t pid, int gen)
+{
+ char *pid_name = make_numeric_name(pid);
+ char *gen_name = make_numeric_name(gen);
+ char *gen_linkname = NULL;
+ int perpid_dir = -1;
+ int perpid_dof_dir = -1;
+ struct stat pid_stat;
+ struct stat gen_stat;
+ int err = -1;
+ const char *unlink_err = NULL;
+
+ /*
+ * Figure out the per-PID DOF directory by following the gen-counter
+ * symlink (by hand, because we want to reuse its name in the non-
+ * PID-specific dof/ subdir too).
+ */
+
+ if (!pid_name || !gen_name)
+ goto oom;
+
+ if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0)
+ goto out;
+
+ if (fstatat(perpid_dir, gen_name, &gen_stat, AT_SYMLINK_NOFOLLOW) < 0)
+ goto out;
+
+ if ((gen_linkname = malloc(gen_stat.st_size + 1)) == NULL)
+ goto oom;
+
+ /*
+ * Truncated dir -> concurrent modifications, not allowed.
+ */
+ if (readlinkat(perpid_dir, gen_name, gen_linkname, gen_stat.st_size + 1) ==
+ gen_stat.st_size + 1) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: state dir is under concurrent modification, maybe multiple dtprobeds are running\n");
+ exit(1);
+ }
+ gen_linkname[gen_stat.st_size] = 0;
+
+ perpid_dof_dir = openat(perpid_dir, gen_linkname, O_PATH | O_CLOEXEC);
+
+ if (strchr(gen_linkname, '/') != NULL ||
+ perpid_dof_dir < 0) {
+ if (perpid_dof_dir >= 0)
+ close(perpid_dof_dir);
+ fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which is not valid\n",
+ pid, gen, gen_linkname);
+ goto err;
+ }
+
+ if (faccessat(dof_dir, gen_linkname, F_OK, 0) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which has no raw DOF\n",
+ pid, gen, gen_linkname);
+ goto err;
+ }
+
+ /*
+ * Unlink the per-PID raw and parsed DOF and the probe dirs, then check
+ * the link counts on the non-per-PID stuff; if zero, unlink that too.
+ */
+
+ fuse_log(FUSE_LOG_DEBUG, "gen_name: %s; gen_linkname: %s; perpid_dof_dir: %i\n", gen_name, gen_linkname, perpid_dof_dir);
+ if (unlinkat(perpid_dof_dir, "raw", 0) != 0 && errno != ENOENT) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID raw DOF for PID %i generation %i: %s\n",
+ pid, gen, strerror(errno));
+ }
+
+ if (dof_stash_remove_parsed(pid, perpid_dof_dir, gen_linkname) < 0)
+ unlink_err = "parsed probes dir entries";
+
+ if (unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR) < 0)
+ unlink_err = "parsed dir";
+
+ if (unlinkat(perpid_dof_dir, "dh", 0) < 0)
+ unlink_err = "dh";
+
+ if (unlinkat(perpid_dir, gen_linkname, AT_REMOVEDIR) < 0)
+ unlink_err = gen_linkname;
+
+ if (unlinkat(perpid_dir, gen_name, 0) < 0)
+ unlink_err = gen_name;
+
+ if (refcount_cleanup(dof_dir, gen_linkname, 0) < 0)
+ unlink_err = gen_linkname;
+
+ refcount_cleanup(dof_dir, gen_linkname, 0);
+
+ /*
+ * Clean up the PID directory itself, if it is now empty. We can't just
+ * use refcount_cleanup here because we also have to delete the
+ * generation counter right before deleting the directory.
+ */
+
+ if (fstatat(pid_dir, pid_name, &pid_stat, 0) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat per-PID for pid %i: %s\n",
+ pid, strerror(errno));
+ goto err;
+ }
+
+ if (pid_stat.st_nlink == 2) {
+ if (unlinkat(perpid_dir, "next-gen", 0) < 0) {
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot clean up gen counter in per-PID dir for PID %i: %s\n",
+ pid, strerror(errno));
+ goto err;
+ }
+ refcount_cleanup(pid_dir, pid_name, 1);
+ }
+
+ if (unlink_err)
+ fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID DOF %s for PID %i generation %i: %s\n",
+ unlink_err, pid, gen, strerror(errno));
+
+out:
+ err = 0;
+
+err:
+ free(pid_name);
+ free(gen_name);
+ free(gen_linkname);
+
+ maybe_close(perpid_dir);
+ maybe_close(perpid_dof_dir);
+
+ return err;
+
+oom:
+ fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory unlinking DOF\n");
+ goto err;
+}
+
+/*
+ * Prune dead processes out of the DOF stash. This is not a correctness
+ * operation, just a space-waste reducer. If a process with DOF died uncleanly
+ * and its PID was recycled since a cleanup was last run, we'll leave some DOF
+ * around for a bit, but there are no correctness implications: either the new
+ * process has no DOF at all, in which case we'll never learn about it, or it
+ * does, in which case either it has identical mappings (in which case we'll
+ * just reuse them) or it has different ones (in which case we'll just add to
+ * them): in neither case is there any negative consequence of the nonexistent
+ * mappings other than a bit of space wastage. (This implies that any process
+ * reading DOF out of the state directory must check that a process that claims
+ * to have DOF at a given mapping actually has a corresponding mapping with the
+ * same dev/ino pair at the right address. There's no need for that process to
+ * clean it up if found to be wrong, though: this function will eventually get
+ * around to it.)
+ */
+void
+dof_stash_prune_dead(void)
+{
+ DIR *all_pids_dir;
+ struct dirent *pid_ent;
+ int tmp;
+
+ fuse_log(FUSE_LOG_DEBUG, "Pruning dead PIDs\n");
+
+ if ((tmp = dup(pid_dir)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "cannot dup() for cleanup: %s\n", strerror(errno));
+ return;
+ }
+
+ if ((all_pids_dir = fdopendir(tmp)) == NULL) {
+ close(tmp);
+ fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF directory: %s\n",
+ strerror(errno));
+ return;
+ }
+ rewinddir(all_pids_dir);
+
+ /*
+ * Work over all the PIDs.
+ */
+ while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
+ DIR *perpid_dir;
+ struct dirent *mapping_ent;
+ char *end_pid;
+ pid_t pid;
+
+ if (errno != 0)
+ goto scan_failure;
+
+ if (pid_ent->d_type != DT_DIR)
+ continue;
+
+ /*
+ * Only directories with numeric names can be PIDs: skip all the
+ * rest.
+ */
+ pid = strtol(pid_ent->d_name, &end_pid, 10);
+ if (*end_pid != '\0')
+ continue;
+
+ if (kill(pid, 0) == 0 || errno != ESRCH)
+ continue;
+
+ /*
+ * Process dead: clean up its PIDs, deleting everything in the
+ * per-PID directory by calling dof_stash_remove() on every
+ * mapping in turn.
+ */
+
+ if ((tmp = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR,"cannot open per-PID DOF mappings directory for cleanup: %s\n",
+ strerror(errno));
+ goto out;
+ }
+
+ if ((perpid_dir = fdopendir(tmp)) == NULL) {
+ fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF mappings: %s\n",
+ strerror(errno));
+ close(tmp);
+ goto out;
+ }
+
+ fuse_log(FUSE_LOG_DEBUG, "Pruning dead PID %s\n", pid_ent->d_name);
+
+ /* Work over all the mappings in this PID. */
+
+ while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
+ int gen;
+ char *end_gen;
+
+ if (errno != 0) {
+ fuse_log(FUSE_LOG_ERR, "cannot read per-PID DOF mappings for cleanup: %s\n",
+ strerror(errno));
+ closedir(perpid_dir);
+ goto out;
+ }
+
+ if (mapping_ent->d_type != DT_LNK)
+ continue;
+
+ fuse_log(FUSE_LOG_DEBUG, "Working over generation %s\n", mapping_ent->d_name);
+
+ gen = strtol(mapping_ent->d_name, &end_gen, 10);
+ if (*end_gen != '\0')
+ continue;
+
+ dof_stash_remove(pid, gen);
+ }
+ closedir(perpid_dir);
+ errno = 0;
+ }
+
+ if (errno != 0)
+ goto scan_failure;
+
+out:
+ closedir(all_pids_dir);
+ return;
+
+scan_failure:
+ fuse_log(FUSE_LOG_ERR, "cannot scan per-PID DOF directory for cleanup: %s\n",
+ strerror(errno));
+ goto out;
+}
+
+/*
+ * Read a file into a buffer and return it. If READ_SIZE is non-negative, read
+ * only that many bytes (and silently fail if the file isn't at least that
+ * long).
+ */
+static void *
+read_file(int fd, ssize_t read_size, size_t *size)
+{
+ struct stat s;
+ char *buf;
+ char *bufptr;
+ int len;
+
+ if (fstat(fd, &s) < 0) {
+ fuse_log(FUSE_LOG_ERR, "cannot stat: %s\n", strerror(errno));
+ return NULL;
+ }
+ if (read_size >= 0) {
+ if (s.st_size < read_size)
+ return NULL;
+ } else
+ read_size = s.st_size;
+
+ if ((buf = malloc(read_size)) == NULL) {
+ fuse_log(FUSE_LOG_ERR, "out of memory allocating %zi bytes\n",
+ read_size);
+ return NULL;
+ }
+
+ if (size)
+ *size = read_size;
+
+ bufptr = buf;
+ while ((len = read(fd, bufptr, read_size)) < read_size) {
+ if (len < 0) {
+ if (errno != EINTR) {
+ fuse_log(FUSE_LOG_ERR, "cannot read: %s",
+ strerror(errno));
+ free(buf);
+ return NULL;
+ }
+ continue;
+ }
+ read_size -= len;
+ bufptr += len;
+ }
+ return buf;
+}
+
+/*
+ * Reparse all DOF. Mappings that cannot be reparsed are simply ignored, on the
+ * grounds that most DOF, most of the time, is not used, so this will likely be
+ * ignorable.
+ *
+ * Normally, DOF with the same DOF_VERSION as the currently running daemon is
+ * left alone: if force is set, even that is reparsed.
+ */
+int
+reparse_dof(int out, int in,
+ int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
+ dof_helper_t *dh, const void *in_buf, size_t in_bufsz,
+ int reparsing),
+ int force)
+{
+ DIR *all_pids_dir;
+ struct dirent *pid_ent;
+ int err = -1;
+ int tmp;
+
+ if ((tmp = dup(pid_dir)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
+ return -1;
+ }
+
+ if ((all_pids_dir = fdopendir(tmp)) == NULL) {
+ close(tmp);
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF directory: %s\n",
+ strerror(errno));
+ return -1;
+ }
+
+ /* Work over all the PIDs. */
+
+ rewinddir(all_pids_dir);
+ while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
+ int perpid_fd;
+ char *end_pid;
+ pid_t pid;
+ DIR *perpid_dir;
+ struct dirent *mapping_ent;
+
+ if (errno != 0)
+ goto scan_failure;
+
+ if (pid_ent->d_type != DT_DIR)
+ continue;
+
+ /*
+ * Only directories with numeric names can be PIDs: skip all the
+ * rest.
+ */
+ pid = strtol(pid_ent->d_name, &end_pid, 10);
+ if (*end_pid != '\0')
+ continue;
+
+ if ((perpid_fd = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mappings directory for PID %s: %s\n",
+ pid_ent->d_name, strerror(errno));
+ goto out;
+ }
+
+ if ((tmp = dup(perpid_fd)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
+ close(perpid_fd);
+ goto out;
+ }
+
+ if ((perpid_dir = fdopendir(tmp)) == NULL) {
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
+ pid_ent->d_name, strerror(errno));
+ close(tmp);
+ close(perpid_fd);
+ goto out;
+ }
+
+ /* Work over all the mappings in this PID. */
+
+ while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
+ int mapping_fd;
+ int fd;
+ dev_t dev;
+ ino_t ino;
+ size_t dof_size, dh_size;
+ void *dof = NULL;
+ void *dh = NULL;
+
+ if (errno != 0) {
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
+ pid_ent->d_name, strerror(errno));
+ goto perpid_err;
+ }
+
+ if (mapping_ent->d_type != DT_DIR)
+ continue;
+
+ /*
+ * Hidden files cannot be mappings. (This also skips
+ * . and ..).
+ */
+ if (mapping_ent->d_name[0] == '.')
+ continue;
+
+ if ((mapping_fd = openat(perpid_fd, mapping_ent->d_name,
+ O_PATH | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mapping for PID %s, mapping %s: %s\n",
+ pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+ goto perpid_err;
+ }
+
+ fuse_log(FUSE_LOG_DEBUG, "Parsed DOF is present: evaluating\n");
+
+ /*
+ * Only DOF with a different parsed representation gets
+ * reparsed.
+ */
+ if (!force) {
+ if ((fd = openat(mapping_fd, "parsed/version",
+ O_RDONLY | O_CLOEXEC)) == 0) {
+ uint64_t *parsed_version;
+
+ parsed_version = read_file(fd, sizeof(uint64_t),
+ NULL);
+ close(fd);
+
+ if (parsed_version != NULL &&
+ *parsed_version == DOF_PARSED_VERSION) {
+ fuse_log(FUSE_LOG_DEBUG, "No need to reparse\n");
+ continue;
+ }
+
+ free(parsed_version);
+ }
+ }
+
+ fuse_log(FUSE_LOG_DEBUG, "Deleting\n");
+
+ if (dof_stash_remove_parsed(pid, mapping_fd, mapping_ent->d_name) < 0) {
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot delete parsed DOF for PID %s, mapping %s: %s\n",
+ pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+ close(mapping_fd);
+ goto perpid_err;
+ }
+
+ if (split_dof_name(mapping_ent->d_name, &dev, &ino) < 0) {
+ fuse_log(FUSE_LOG_ERR, "when reparsing DOF for PID %s, cannot derive dev/inum from %s: ignored\n",
+ pid_ent->d_name, mapping_ent->d_name);
+ continue;
+ }
+
+ if ((fd = openat(mapping_fd, "raw", O_RDONLY | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open raw DOF for PID %s, mapping %s: ignored: %s\n",
+ pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+ close(mapping_fd);
+ continue;
+ }
+
+ if ((dof = read_file(fd, -1, &dof_size)) == NULL) {
+ fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read raw DOF for PID %s, mapping %s: ignored: %s\n",
+ pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+ close(mapping_fd);
+ close(fd);
+ continue;
+ }
+ close(fd);
+
+ if ((fd = openat(mapping_fd, "dh", O_RDONLY | O_CLOEXEC)) < 0) {
+ fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open dh for PID %s, mapping %s: ignored: %s\n",
+ pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+ free(dof);
+ close(mapping_fd);
+ continue;
+ }
+
+ if ((dh = read_file(fd, -1, &dh_size)) == NULL) {
+ fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read dh for PID %s, mapping %s: ignored: %s\n",
+ pid_ent->d_name, mapping_ent->d_name, strerror(errno));
+ free(dof);
+ close(mapping_fd);
+ close(fd);
+ continue;
+ }
+ close(fd);
+
+ fuse_log(FUSE_LOG_DEBUG, "Reparsing DOF for PID %s, mapping %s\n",
+ pid_ent->d_name, mapping_ent->d_name);
+ if (reparse(pid, out, in, dev, ino, dh, dof, dof_size, 1) < 0)
+ fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot parse DOF for PID %s, mapping %s: ignored\n",
+ pid_ent->d_name, mapping_ent->d_name);
+ free(dof);
+ free(dh);
+ close(mapping_fd);
+
+ continue;
+
+ perpid_err:
+ closedir(perpid_dir);
+ close(perpid_fd);
+ goto out;
+ }
+ closedir(perpid_dir);
+ close(perpid_fd);
+ errno = 0;
+ }
+
+ if (errno != 0)
+ goto scan_failure;
+
+ err = 0;
+out:
+ closedir(all_pids_dir);
+ return err;
+
+scan_failure:
+ fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot scan per-PID DOF directory: %s\n",
+ strerror(errno));
+ goto out;
+}
diff --git a/dtprobed/dof_stash.h b/dtprobed/dof_stash.h
new file mode 100644
index 00000000..f18c3c2e
--- /dev/null
+++ b/dtprobed/dof_stash.h
@@ -0,0 +1,41 @@
+/*
+ * Oracle Linux DTrace; DOF storage for later probe removal.
+ * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved.
+ * Licensed under the Universal Permissive License v 1.0 as shown at
+ * http://oss.oracle.com/licenses/upl.
+ */
+
+#ifndef _DOF_STASH_H
+#define _DOF_STASH_H
+
+#include <sys/types.h>
+#include <stdint.h>
+
+#include <dt_list.h>
+#include "dof_parser.h"
+
+typedef struct dof_parsed_list {
+ dt_list_t list;
+ dof_parsed_t *parsed;
+} dof_parsed_list_t;
+
+int dof_stash_init(const char *statedir);
+
+int dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed);
+int dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum);
+void dof_stash_free(dt_list_t *accum);
+
+int dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
+ const void *dof, size_t size);
+int dof_stash_remove(pid_t pid, int gen);
+
+void dof_stash_prune_dead(void);
+
+int
+reparse_dof(int out, int in,
+ int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
+ dof_helper_t *dh, const void *in_buf,
+ size_t in_bufsz, int reparsing),
+ int force);
+
+#endif
diff --git a/libcommon/dof_parser.h b/libcommon/dof_parser.h
index 7ab18f4d..33739993 100644
--- a/libcommon/dof_parser.h
+++ b/libcommon/dof_parser.h
@@ -1,6 +1,6 @@
/*
* Oracle Linux DTrace; DOF parser interface with the outside world
- * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2022, 2024, Oracle and/or its affiliates. All rights reserved.
* Licensed under the Universal Permissive License v 1.0 as shown at
* http://oss.oracle.com/licenses/upl.
*/
@@ -15,11 +15,10 @@
#include <dtrace/helpers.h>
/*
- * Result of DOF probe parsing (currently used only for probe creation. We
- * receive a provider info structure, followed by N probe info structures each
- * of which is followed by possibly many tracepoint info structures, all tagged.
- * Things not useful for probe creation (like args, translated types, etc) are
- * not returned.
+ * Result of DOF probe parsing. We receive a provider info structure, followed
+ * by N probe info structures each of which is followed by possibly many
+ * tracepoint info structures, all tagged. Things not yet used for probe
+ * creation (like args, translated types, etc) are not returned.
*
* On error, a DIT_ERR structure is returned with an error message.
*/
@@ -31,6 +30,15 @@ typedef enum dof_parsed_info {
DIT_ERR = 3
} dof_parsed_info_t;
+/*
+ * Bump this whenever dof_parsed changes.
+ *
+ * Files consisting of arrays of dof_parsed have a single 64-bit word at the
+ * start which is the version of the dof_parseds within it. The data flowing
+ * over the stream from the seccomped parser has no such prefix.
+ */
+#define DOF_PARSED_VERSION 1
+
typedef struct dof_parsed {
/*
* Size of this instance of this structure.
--
2.42.0
More information about the DTrace-devel
mailing list