[DTrace-devel] [PATCH v3 06/21] dtprobed: add the DOF stash

Kris Van Hees kris.van.hees at oracle.com
Wed Feb 14 05:44:02 UTC 2024


On Mon, Feb 12, 2024 at 10:21:35AM -0500, Kris Van Hees via DTrace-devel wrote:
> Initial comments (mostly on the comments, that are pretty crucial for people
> to understand what is going on here).
> 
> On Tue, Jan 16, 2024 at 09:13:02PM +0000, Nick Alcock via DTrace-devel wrote:
> > The DOF stash is a runtime directory (/run/dtrace/stash by default) in which
> > the dtprobed daemon keeps track of incoming DOF.  There's a comment
> > describing it at the top of dtprobed/dof_stash.c, but in brief we have two
> > directories under /run/dtrace/stash:
> > 
> > - dof/, which contains raw DOF named by mapping
> > - dof-pid/, which tracks DOF on a PID-by-PID and mapping-by-mapping basis.
> >   This contains subdirectories $pid/$mapping containing both the same raw
> >   DOF (hardlinked) and parsed DOF split up by probe, in
> >   $pid/$mapping/parsed/$spec.
> > 
> > To aid consumer globbing of probes we also put the DOF for each individual
> > probe in /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb; in future we may
> > introduce further symlinks not in per-pid directories to ease matching of
> > probes across processes.
> 
> Mention here whether the stuff under probes is raw or parsed DOF (or both)?
> 
> > Consumers can use inotify watches to detect newly added or removed DOF or
> > probes: because all the DOF is in one directory, a very small number of
> > watches is needed to keep track of all the DOF in the system, and only one
> > watch per PID is needed to track it on a PID-by-PID basis. (Currently,
> > nothing is doing this, but it is now a possibility where it never was
> > before.)
> > 
> > In a future commit, dtprobed will emit all the DOF it receives into this
> > structure, removing it when processes exit, and DTrace will use it to create
> > and remove probes.
> > 
> > dof_stash.c itself provides functions to add new raw DOF to the stash, to
> > add a stream of parsed DOF corresponding to a given piece of raw DOF to the
> > stash, to remove DOF as processes issue DTRACEHIOC_REMOVE requests, to check
> > for any DOF that needs reparsing to handle changes to struct dof_parsed, and
> > to remove entries from the DOF stash that is left behind by processes that
> > died uncleanly without issuing DTRACEHIOC_REMOVEs. (Currently, we don't deal
> > with newly-started processes that reuse the same PID as such a process, or
> > processes that leave DOF behind on exec() and then add more: this
> > improvement will come soon.)
> > 
> > It does not do any DOF parsing: that is left up to the caller (dtprobed.c).
> > 
> > All the writes to the dof_stash are done via the *at() functions to avoid
> > any risk of bugs causing the file creations and most especially deletions it
> > carries out from leaking outside /run/dtrace.  (We are not concerned about
> > symlink attacks etc because only root can write to /run/dtrace or any
> > directories underneath it.)
> > 
> > Signed-off-by: Nick Alcock <nick.alcock at oracle.com>
> > ---
> >  dtprobed/Build            |    2 +-
> >  dtprobed/dof_stash.c      | 1621 +++++++++++++++++++++++++++++++++++++
> >  dtprobed/dof_stash.h      |   44 +
> >  dtprobed/dtprobed.service |    2 +
> >  dtrace.spec               |    4 +
> >  libcommon/dof_parser.h    |   20 +-
> >  6 files changed, 1686 insertions(+), 7 deletions(-)
> >  create mode 100644 dtprobed/dof_stash.c
> >  create mode 100644 dtprobed/dof_stash.h
> > 
> > diff --git a/dtprobed/Build b/dtprobed/Build
> > index c44dd40ab3d2..f1e0dfff456f 100644
> > --- a/dtprobed/Build
> > +++ b/dtprobed/Build
> > @@ -26,7 +26,7 @@ dtprobed_CPPFLAGS := -I. -Idtprobed -Ilibproc -Ilibcommon -Ilibport
> >  dtprobed_CFLAGS := $(shell pkg-config --cflags $(LIBFUSE))
> >  dtprobed_LIBS := -lcommon -lproc -lcommon -lport -lelf $(shell pkg-config --libs $(LIBFUSE))
> >  dtprobed_DEPS := libproc.a libcommon.a libport.a
> > -dtprobed_SOURCES := dtprobed.c seccomp-assistance.c
> > +dtprobed_SOURCES := dtprobed.c dof_stash.c seccomp-assistance.c
> >  dtprobed_LIBSOURCES := libproc libcommon
> >  
> >  ifdef HAVE_LIBSYSTEMD
> > diff --git a/dtprobed/dof_stash.c b/dtprobed/dof_stash.c
> > new file mode 100644
> > index 000000000000..a3a7af163398
> > --- /dev/null
> > +++ b/dtprobed/dof_stash.c
> > @@ -0,0 +1,1621 @@
> > +/*
> > + * Oracle Linux DTrace; DOF storage for later probe removal.
> 
> This is a very narrow scope description of what the DOF stash is, and I would
> say that the removal of probes is probably a more minor use of it when compared
> to enabling probe create/remove by dtrace itself and state retention across
> dtprobed restarts.  Maybe just sya 'DOF state storage management' and leave it
> at that?
> 
> > + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reservedaeve.
> 
> 2024 (and in general all copyright years should be updated to reflect these
> changes are getting applied to the tree in 2024).
> 
> > + * Licensed under the Universal Permissive License v 1.0 as shown at
> > + * http://oss.oracle.com/licenses/upl.
> > + *
> > + * The DOF stash is the principal state preservation mechanism used by dtprobed.
> > + * Keeping DOF on disk keeps memory management complexity low, and makes it
> > + * possible for dtprobed to track DOF across daemon restarts and upgrades.
> 
> Wouldn't you mention here also the important purpose of allow dtrace to get
> to USDT probe data for probe creation and removal?
> 
> > + *
> > + * The state storage structure is as follows:
> > + *
> > + * /run/dtrace/stash: Things private to dtprobed.
> 
> I would put the section I delineate below as #1# here, indented to show that
> it is content that goes under this directory.
> 
> > + *
> > + * /run/dtrace/probes: Per-probe info, written by dtprobed, read by DTrace.
> 
> I would put the section I delineate below as #2# here, indented to show that
> it is content that goes under this directory.  That also means you can drop the
> /run/dtrace, /run/dtrace/stash, and /run/dtrace/probes prefixes which saves
> some space and makes reading the paths a little easier I think.
> 
> Section #1# starts here...
> 
> > + *
> > + * /run/dtrace/stash/dof/$dev-$ino: DOF contributed by particular mappings,
> > + * in raw form (as received from the client).
> 
> What is 'the client'?  That is terminology that seems confusing here because I
> would think it refers to a client of dof-stash, which would be dtprobed and
> dtrace, but then again, perhaps it refers to executables and libraries that
> have DOF info in them, presented to dtprobed using the drti code?  Also, since
> I assume that a $dev-$ino directory corresponds to a (one) particular mapping
> I think that 'by particular mappings' above should be 'a particular mapping'?
> 
> > + *
> > + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/raw: hardlink to the DOF for a given
> > + * PID. Pruned of dead processes at startup and on occasion: entries also
> > + * deleted on receipt of DTRACEHIOC_REMOVE ioctls. The principal effect of this
> > + * is to bump the link count for the corresponding DOF in dof/ directory: when
> > + * this falls to 1, the DOF is considered dead and the corresponding probe is
> > + * removed.
> 
> I would rephrase this a bit, beecause the way it reads it gives the impression
> that the principal effect of DTRACEHIOC_REMOVE is to dump the link count for
> the corresponding DOF in dof/ which makes no sense whatsoever.  Only after
> re-reading this a few times did I realize that you mean that you are using a
> hardlink to the DOF for a given PID because this has the principal effect of
> bumping the link count of the corresponding DOF in dof/.  So that should be
> mentioned first, and then you can mention how this directory is used in terms
> of creation, removal, and pruning.
> 
> I am also confused by 'hardlink to the DOF for a given PID' when it is clear
> from the $pid/$dev-$ino directory naming (and the nature of processes and their
> USDT probes) that a PID can have USDT probes that are provided by DOF from
> multiple mappings.  So isn't this a hardlink to the DOF for a given mapping
> that is used by a given PID?
> 
> > + *
> > + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/dh: Serialized form of the
> > + * dof_helper_t received from a given client. Preserved for all
> 
> Again, what is 'a given client'?  Here is would seem to indicate a mapping, or
> at least an executable or shared library that provides DOF through drti.
> 
> > + * DTRACEHIOC_ADDDOFs, even if the DOF was already contributed by some other
> > + * process.  Keeping this around allows the DOF to be re-parsed as needed, even
> > + * by a later re-execution of a different version of the same daemon with a
> > + * different dof_parsed_t structure, and to be re-parsed with appropriate
> > + * addresses for whichever mapping of this DOF is under consideration.
> 
> I assume 'Serialized' means 'parsed' here?  Since that is the terminology used
> throughout this patch in reference to DOF I would stick to that.
> 
> > + *
> > + * /usr/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed: parsed DOF for a given PID.
> > + * Generated from the helper for PIDs that are the first appearance of a given
> > + * piece of DOF and used for the common case where that DOF has only one mapping
> > + * until its deletion (short-lived processes, mostly).  Deleted at daemon
> > + * startup.
> 
> I am confused here.  The description seems to indicate that this is a file,
> but reading firther down it becomes clear that this is a directory.  (Hence my
> suggestion to reorganize this comment block.)
> 
> But then there is the question of what you mean by refer to the first appearance
> of a given piece of DOF and DOF that only has one mapping.  If I have two tasks
> that execute the same executable that contains USDT probes, will only one of
> them have a /parsed directory?  What if that one finishes execution while the
> other is still running?  What if then a third innstance gets started?  Many
> questiond and I do not find clarity in the description of this directory to
> actually know how this is used and managed.  Is this not used for long-lived
> processes?  You peak of 'DOF [that] has only one mapping' but aren't we above
> talking about 'DOF contributed by [a] particular mapping'?  It feels like we
> are mixing terminology.
> 
> > + *
> > + * /run/dtrace/stash/dof-pid/$pid/$gen: symlink to a $dev-$ino dir. "$gen" is an
> > + * incrementing generation counter starting at zero for every unique PID, used
> > + * by clients to uniquely identify DOF pieces to remove.
> 
> What 'clients'?  Which $dev-$ino is $gen a symlink to (since a given pid can
> have multiple $dev-$ino pairs associated with it)?
> 
> > + *
> > + * /run/dtrace/stash/dof-pid/$pid/next-gen: The next generation counter to use
> > + * for a new piece of DOF from this process.  Encoded as a sparse file whose
> > + * size is the next counter value to use.
> 
> Section #1# end here...
> 
> > + *
> > + * Per-probe info:
> > + *
> 
> Section #2# starts here...
> 
> > + * /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb: Parsed representation of one
> > + * probe in a given process. Removed by dtprobed when the process dies, or if
> > + * all mappings containing the probe are unmmapped.  Used by DTrace for tracing
> > + * by PID.  Hardlink to...
> 
> It might be more clear to start with saying that this is a hardlink to <blah>,
> which is the parsed representation of one probe in a given process.  And then
> refer to the actual description of that file.
> 
> Section #3# end here...
> 
> Add to section #1# from here...
> 
> > + *
> > + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed/$prv$pid:$mod:$fun:$prb:
> > + * parsed DOF for a given probe for a given PID.  Generated from the helper for
> > + * PIDs that are the first appearance of a given piece of DOF and used for the
> > + * common case where that DOF has only one mapping until its deletion
> > + * (short-lived processes, mostly).  Deleted at daemon startup if the
> > + * dof_parsed_t structure changed.  The file consists of a uint64_t version
> > + * number (DOF_PARSED_VERSION), a uint64_t count of tracepoints, and then
> > + * that many struct dof_parsed_t's of type DIT_TRACEPOINT.
> 
> Same question about the 'used for the common case' which seems to imply that
> this is not used in other cases - which then brings up the question where
> dtrace will get USDT probe info for the uncommon cases.
> 
> to here...
> 
> Add to section #2# from here...
> 
> > + *
> > + * In future, symlinks in /run/dtrace/probes/$prv$pid may exist, used for
> > + * globbing provider names across processes.
> 
> to here...
> 
> So, on a higher level...  I would expect things to work as follows (and please
> correct me if/where I am wrong)...
> 
> Evwry piece of DOF data is associated with a unique $dev-$ino pair that
> identifies the ELF object that provided it (executable or shared library).
> 
> When a task starts, the DOF data of the executable (if any) and shared
> libraries (if any) is presented to dtprobed for processing, along with task
> specific information (dof_helper_t - of which the base address of the mapping
> for the associated $dev-$ino pair is the most important piece).
> 
> This means that if multiple tasks contain a mapping for a given $dev-$ino,
> only one copy of the DOF data is needed because the only task-specific data
> is contained in dof_helper_t for each task.  Furthermore, if a task has a
> mapping for a given $dev-$ino, the DOF data for that $dev-$ino apply to the
> task (i.e. it will contain the USDT probes specified in that DOF data).
> 
> So, what is needed is a collection of $dev-$ino directories that contain the
> actual (non-task specific) DOF data.  For practical reasons, storing it in
> raw, parsed, and per-probe form seems convenient.
> 
> And then there is a collection of $pid directories that contain links to the
> associated $dev-$ino directories, and dof_helper_t data for each of them.
> Due to provider names including the pid of the task, this would also require
> links to be created for the task-specific provider names (with pid) to the
> generic ones in the apporpriate $dev-$ino directory (the rest of the probe
> name components are nno-task specific).
> 
> Management-wise, the $pid directories are created for any task that presents
> DOF to dtprobed, and removed when the task terminates (or possibly when a
> DTRACEHIOC_REMOVE ioctl has been issued for all its DOF).  A $dev-$ino
> directory is created the first time DOF is presented for the $dev-$ino pair,
> and remains until the last task that references is has terminated.
> 
> Does that sum it up correctly?  If so, this is not what I understand from the
> existing description in this comment block.
> 
> > + */
> > +
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <dirent.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <inttypes.h>
> > +#include <signal.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <unistd.h>
> > +#include <assert.h>
> > +#include "dof_stash.h"
> > +
> > +#include <dt_list.h>
> > +
> > +#ifdef HAVE_FUSE_LOG
> > +#include <fuse_log.h>
> > +#else
> > +#include "rpl_fuse_log.h"
> > +#endif
> > +
> > +#include <port.h>
> > +
> > +static int dof_dir, pid_dir, probes_dir;
> > +
> > +/*
> > + * Like close(), but only try to close positively-numbered fds.
> > + */
> > +static int
> > +maybe_close(int fd)
> > +{
> > +	if (fd > -1)
> 
> I would think (fd >= 0) is more standard, but either way works.
> 
> > +		return close(fd);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Like mkdirat(), but already-existing directories are considered
> > + * successful.
> > + */
> > +static int
> > +mkexistingdirat(int dirfd, const char *name, mode_t mode)
> 
> There is only one caller for this function (make_state_dirat) and I would
> think that this functionality could as well just be inlined there.  That also
> makes it nice and similar to make_state_dir.
> 
> > +{
> > +	if (mkdirat(dirfd, name, mode) < 0) {
> > +		if (errno == EEXIST) {
> > +			struct stat s;
> > +
> > +			if (fstatat(dirfd, name, &s, 0) == 0 &&
> > +			    S_ISDIR(s.st_mode))
> > +				return 0;
> > +			errno = EEXIST;
> > +		}
> > +		return -1;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Make a runtime state directory and return its fd.
> > + */
> > +static int
> > +make_state_dir(const char *dir)
> > +{
> > +	int fd;
> > +
> > +        if (mkdir(dir, 0755) < 0 && errno != EEXIST) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: "
> > +			 "cannot create state directory %s: %s\n",
> > +			 dir, strerror(errno));
> > +		return -1;
> > +	}
> > +
> > +	if ((fd = open(dir, O_PATH | O_CLOEXEC)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: "
> > +			 "cannot open state directory %s: %s\n",
> > +			 dir, strerror(errno));
> > +		return -1;
> > +	}
> > +	return fd;
> > +}
> > +
> > +/*
> > + * Make a state directory under some other directory and return an fd to it.
> > + *
> > + * If "readable" is set, the fd is opened with O_RDONLY rather than O_PATH.
> > + */
> > +static int
> > +make_state_dirat(int dirfd, const char *name, const char *errmsg, int readable)
> > +{
> > +	int fd;
> > +	int flags = O_PATH;
> > +
> > +	if (readable)
> > +		flags = O_RDONLY;
> > +
> > +	if (mkexistingdirat(dirfd, name, 0755) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot create %s dir named %s: %s\n",
> > +			 errmsg, name, strerror(errno));
> > +		return -1;
> > +	}
> > +
> > +	if ((fd = openat(dirfd, name, flags | O_CLOEXEC)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open %s dir named %s: %s\n",
> > +			 errmsg, name, strerror(errno));
> > +		return -1;
> > +	}
> > +	return fd;
> > +}
> > +
> > +/*
> > + * Initialize the DOF-stashing machinery, if needed.
> > + */
> > +int
> > +dof_stash_init(const char *statedir)
> > +{
> > +	int root;
> > +	int stash;
> > +
> > +	if (statedir == NULL)
> > +		statedir="/run/dtrace";
> > +
> > +        if ((root = make_state_dir(statedir)) < 0)
> > +		return -1;
> > +
> > +        if ((stash = make_state_dirat(root, "stash", "statedir", 0)) < 0)
> > +		return -1;
> > +
> > +	if ((dof_dir = make_state_dirat(stash, "dof", "DOF", 0)) < 0)
> > +		return -1;
> > +        if ((pid_dir = make_state_dirat(stash, "dof-pid", "per-pid DOF", 1)) < 0)
> > +		return -1;
> > +
> > +        if ((probes_dir = make_state_dirat(root, "probes", "probesdir", 1)) < 0)
> > +		return -1;
> > +
> > +        close(root);
> > +	close(stash);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Compose a provider's name from pieces.
> > + */
> > +static char *
> > +make_provider_name(const char *prov, pid_t pid)
> > +{
> > +	char *ret;
> > +
> > +        if (asprintf(&ret, "%s%li", prov, (long) pid) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making provider name\n");
> > +		return NULL;
> > +	}
> > +	return ret;
> > +}
> 
> This function does not validate against the length limit on provider names?
> This function seems to be something that would belong in a library of common
> functions between dtprobed and dtrace since e.g. dt_prov_uprobe.d and dt_pid.c
> provide similar code.
> 
> > +
> > +/*
> > + * Compose a full probespec's name from pieces.
> > + */
> > +static char *
> > +make_probespec_name(const char *prov, const char *mod, const char *fn,
> > +		    const char *prb)
> > +{
> > +	char *ret;
> > +
> > +        if (asprintf(&ret, "%s:%s:%s:%s", prov, mod, fn, prb) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making probespec\n");
> > +		return NULL;
> > +	}
> > +	return ret;
> > +}
> 
> This also seems like a good candidate for a more general-use function.  It
> probably could also use some probe name length validation.  Perhaps it could
> also be useful to make use of the dtrace_probedesc_t type to represent the
> quad (prv, mod, fun, prb) as is commonly done in DTrace code?
> 
> > +
> > +/*
> > + * Split the pieces back up again.  The input spec becomes the provider name.
> > + * The pieces' lifetime is no longer than that of the input string.
> > + */
> > +static int
> > +split_probespec_name(char *spec, const char **mod, const char **fn,
> > +		     const char **prb)
> > +{
> > +	char *tmp;
> > +
> > +	tmp = strchr(spec, ':');
> > +	if (tmp == NULL)
> > +		return -1;
> > +	*tmp = 0;
> > +	*mod = tmp + 1;
> > +
> > +	tmp = strchr(*mod, ':');
> > +	if (tmp == NULL)
> > +		return -1;
> > +	*tmp = 0;
> > +	*fn = tmp + 1;
> > +
> > +	tmp = strchr(*fn, ':');
> > +	if (tmp == NULL)
> > +		return -1;
> > +	*tmp = 0;
> > +	*prb = tmp + 1;
> > +
> > +	return 0;
> > +}
> 
> Same comments (general use function, use dtrace_probedesc_t, length validation).
> Also see dtrace_xstr2desc() in dt_subr.c for a very similar function.
> 
> > +
> > +/*
> > + * Compose a DOF directory's name from pieces.
> > + */
> > +static char *
> > +make_dof_name(dev_t dev, ino_t ino)
> > +{
> > +	char *ret;
> > +
> > +        if (asprintf(&ret, "%li-%li", (long) dev, (long) ino) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making DOF name\n");
> > +		return NULL;
> > +	}
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Split the pieces back up again.
> > + */
> > +static int
> > +split_dof_name(const char *dof_name, dev_t *dev, ino_t *ino)
> > +{
> > +	if (sscanf(dof_name, "%li-%li", (long *) dev, (long *) ino) < 2) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: per-pid directory name %s unparseable\n",
> > +			 dof_name);
> > +		return -1;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/*
> > + * A trivial wrapper.
> > + */
> > +static char *
> > +make_numeric_name(long num)
> > +{
> > +	char *ret;
> > +
> > +        if (asprintf(&ret, "%li", num) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making filename\n");
> > +		return NULL;
> > +	}
> > +	return ret;
> > +}
> > +
> > +
> > +/*
> > + * Push a piece of parsed DOF into a list to be emitted by
> > + * dof_stash_write_parsed().
> > + */
> > +int
> > +dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed)
> > +{
> > +	dof_parsed_list_t *entry;
> > +
> > +	if ((entry = calloc(1, sizeof(dof_parsed_list_t))) == NULL)
> > +		return -1;
> > +
> > +	entry->parsed = parsed;
> > +        dt_list_append(accum, entry);
> > +
> > +        return 0;
> > +}
> > +
> > +/*
> > + * Free all the accumulated parsed DOF in the passed list.
> > + */
> > +void
> > +dof_stash_flush(dt_list_t *accum)

That is a bit of an unusual function name for a function that frees a list.
When I see *flush I usually think of forcinfg things to be written out to
stable storage or something like that.  Maybe dof_stash_free() would be
better?

> > +{
> > +	dof_parsed_list_t *accump;
> > +	dof_parsed_list_t *last_accump = NULL;
> > +
> > +	for (accump = dt_list_next(accum); accump != NULL;
> > +	     accump = dt_list_next(accump)) {
> > +		dt_list_delete(accum, accump);
> > +		free(accump->parsed);
> > +		free(last_accump);
> > +		last_accump = accump;
> > +	}
> > +	free(last_accump);
> > +}
> > +
> > +/*
> > + * Write a chunk of data to an fd.
> > + */
> > +static int
> > +write_chunk(int fd, const void *buf, size_t size)
> > +{
> > +	char *bufptr = (char *) buf;
> > +
> > +	while (size > 0) {
> > +		size_t len;
> > +
> > +                if ((len = write(fd, bufptr, size)) < 0 && errno != EINTR)
> > +			return -1;
> > +
> > +		size -= len;
> > +		bufptr += len;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Write out a piece of raw DOF.  Returns the length of the file written,
> > + * or 0 if none was needed (or -1 on error).
> > + */
> > +static int
> > +dof_stash_write(int dirfd, const char *name, const void *buf, size_t size)

For consistency this perhaps would be better named dof_stash_write_raw() ?

Is there a point to passing in dirfd given that the only caller simply passes
in the dof_dir global variable?

> > +{
> > +	struct stat s;
> > +	int fd;
> > +
> > +        /*
> > +	 * Sanity check: if the DOF already exists but is not the same size
> > +	 * as the DOF we already have, complain, and replace it and its parsed
> > +	 * equivalent.  If it does exist, there's no need to write it out.

This comment seems to imply that this function replaces the raw DOF and its
parsed equivalent.  But the code in this function does not actually do that.
So, I assume the comment needs to be updated to reflect reality?

> > +	 *
> > +	 * If we can't even unlink it or write it out, we give up -- the stash
> > +	 * has failed and we won't be able to do anything it implies.
> > +	 *
> > +	 * (This is only a crude check -- obviously distinct raw DOF could be
> > +	 * the same size by pure chance.)

I fear that the check is so crude that it may not catch that many cases.  E.g.
when I recompile an executable or library but do not change the probes, it is
very likely that the offset location of the probes will change, but the DOF
size will not because it is based on statically sized data items.  And so there
is a higher chance that two distinct DOF versions will have the same size.

Something like a checksum may be needed here.

> > +	 */
> > +	if (fstatat(dirfd, name, &s, 0) == 0) {
> > +		if (s.st_size == size)
> > +			return 0;
> > +
> > +                fuse_log(FUSE_LOG_ERR, "dtprobed: DOF %s already exists, "
> > +			 "but is %zx bytes long, not %zx: replacing\n",
> > +			 name, s.st_size, size);
> > +		if (unlinkat(dirfd, name, 0) < 0) {
> > +			fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
> > +				 name, strerror(errno));
> > +			return -1;
> > +		}
> > +	}
> > +
> > +	if ((fd = openat(dirfd, name, O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out raw DOF: %s\n",
> > +			 strerror(errno));
> > +		return -1;
> > +	}
> > +
> > +	if (write_chunk(fd, buf, size) < 0)
> > +		goto err;
> > +
> > +	if (close(fd) < 0)
> > +		goto err;
> > +	return size + sizeof(uint64_t);
> > +
> > +  err:

Goto labels are typically at the left marging in our code.

> > +	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF: %s\n",
> > +		 strerror(errno));
> > +	unlinkat(dirfd, name, 0);
> > +	close(fd);
> > +	return -1;
> > +}
> > +

The rest I leave pending reply to my earlier email concerning the directory
structure in /run/dtrace because that impacts how I proceed with reviewing
the remainder of the code.

> > +/*
> > + * Write out all the parsed DOF from the passed-in accumulator.  Split out from
> > + * dof_stash_add() because we don't want to create anything in the DOF stash for
> > + * a given piece of DOF until we have verified that all the parsed data for all
> > + * the probes in that DOF is valid.
> > + *
> > + * Each parsed file starts with a single 64-bit word giving the version of
> > + * dof_parsed_t in use, to ensure that readers never use an incompatible
> > + * version, and a single uint64_t giving the number of tracepoints following,
> > + * each of which is a struct dof_parsed_t. The name of the probe is not given,
> > + * since the filename (or, for the probes/ dirs, the file and directory names)
> > + * already contains it.
> > + *
> > + * We also write the parsed version out to the parsed/version file in the parsed
> > + * mapping dir, to make it easier to determine whether to reparse all the DOF
> > + * for outdatedness.
> > + */
> > +int
> > +dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum)
> > +{
> > +	char *pid_name = make_numeric_name(pid);
> > +	char *dof_name = make_dof_name(dev, ino);
> > +	char *provpid_name = NULL;
> > +	int perpid_dir, perpid_dof_dir, probepid_dir, parsed_mapping_dir,
> > +		prov_dir = -1, mod_dir = -1, fun_dir = -1;
> > +        dof_parsed_list_t *accump;
> > +	int version_fd, fd = -1;
> > +	const char *op;
> > +	const char *probe_err = "unknown probe";
> > +	uint64_t version = DOF_PARSED_VERSION;
> > +	int state = -1;
> > +	int err;
> > +
> > +	if (dt_list_next(accum) == NULL) {
> > +		free(pid_name);
> > +		free(dof_name);
> > +		return 0;
> > +	}
> > +
> > +        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: open failed for PID dir for parsed DOF write: %s\n",
> > +			 pid, dev, ino, strerror(errno));
> > +		goto early_err;
> > +	}
> > +
> > +	op = "per-mapping open";
> > +        if ((perpid_dof_dir = openat(perpid_dir, dof_name, O_PATH | O_CLOEXEC)) < 0)
> > +		goto close_perpid_err;
> > +
> > +	op = "per-mapping parsed DOF mkdirat";
> > +	if ((parsed_mapping_dir = make_state_dirat(perpid_dof_dir, "parsed", op, 0)) < 0)
> > +		goto close_perpid_dof_err;
> > +
> > +	op = "probe PID mkdirat";
> > +	if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
> > +		goto remove_parsed_mapping_err;
> > +
> > +        if ((version_fd = openat(parsed_mapping_dir, "version",
> > +				 O_CREAT | O_TRUNC | O_WRONLY, 0644)) < 0)
> > +		goto remove_probepid_dir_err;
> > +
> > +	if (write_chunk(version_fd, &version, sizeof(uint64_t)) < 0) {
> > +		close(version_fd);
> > +		goto remove_probepid_dir_err;
> > +	}
> > +	close(version_fd);
> > +
> > +	/*
> > +	 * We can just use assert() to check for overruns here, because any
> > +	 * overruns are not corrupted input but a seriously buggy parser.
> > +	 *
> > +	 * On error, from this point on, we remove as little as possible on
> > +	 * error, just a single probe's worth: better some probes survive an
> > +	 * error condition than none do.
> > +	 */
> > +	for (accump = dt_list_next(accum); accump != NULL;
> > +	     accump = dt_list_next(accump)) {
> > +		char *mod, *fun, *prb;
> > +		char *specfn = NULL;
> > +
> > +		switch (accump->parsed->type) {
> > +		/*
> > +		 * Provider: make new provider dir.
> > +		 */
> > +		case DIT_PROVIDER: {
> > +			state = accump->parsed->type;
> > +
> > +			free(provpid_name);
> > +
> > +			provpid_name = make_provider_name(accump->parsed->provider.name, pid);
> > +			probe_err = accump->parsed->provider.name;
> > +
> > +			op = "parsed provider close";
> > +			if (maybe_close(prov_dir) < 0)
> > +				goto err_provider;
> > +
> > +			op = "probe provider mkdirat";
> > +			if ((prov_dir = make_state_dirat(probepid_dir, provpid_name,
> > +							 op, 1)) < 0)
> > +				goto err_provider;
> > +			break;
> > +		}
> > +		/*
> > +		 * Probe: make (mod, fun, probe) dirs as needed: close
> > +		 * out file, open new one, make hardlink to it.
> > +		 */
> > +		case DIT_PROBE: {
> > +                        assert(state >= 0);
> > +			state = accump->parsed->type;
> > +
> > +			mod = accump->parsed->probe.name;
> > +			assert(accump->parsed->size > (mod - (char *) accump->parsed));
> > +			fun = mod + strlen(mod) + 1;
> > +			assert(accump->parsed->size > (fun - (char *) accump->parsed));
> > +			prb = fun + strlen(fun) + 1;
> > +			assert(accump->parsed->size >= (prb - (char *) accump->parsed));
> > +
> > +			op = "probe name construction";
> > +			probe_err = provpid_name;
> > +			if ((specfn = make_probespec_name(provpid_name, mod, fun, prb)) == NULL)
> > +				goto err_provider;
> > +
> > +			/*
> > +			 * Ban "." and ".." as name components.  Obviously names
> > +			 * containing dots are commonplace (shared libraries,
> > +			 * for instance), but allowing straight . and .. would
> > +			 * have obviously horrible consequences.  They can't be
> > +			 * filenames anyway, and you can't create them with
> > +			 * dtrace -h because they aren't valid C identifier
> > +			 * names.
> > +			 */
> > +			op = "probe name validation";
> > +			probe_err = specfn;
> > +
> > +                        if (strcmp(accump->parsed->provider.name, ".") == 0 ||
> > +			    strcmp(accump->parsed->provider.name, "..") == 0 ||
> > +			    strcmp(mod, ".") == 0 ||
> > +			    strcmp(mod, "..") == 0 ||
> > +			    strcmp(fun, ".") == 0 ||
> > +			    strcmp(fun, "..") == 0 ||
> > +			    strcmp(prb, ".") == 0 ||
> > +			    strcmp(prb, "..") == 0)
> > +				goto err_provider;
> > +
> > +			op = "parsed probe close";
> > +
> > +			/*
> > +			 * This rather weird pattern carries out all closes and
> > +			 * then errors if any of them fail.
> > +			 */
> > +                        err = maybe_close(mod_dir);
> > +			err |= maybe_close(fun_dir);
> > +			err |= maybe_close(fd);
> > +
> > +			if (err != 0)
> > +				goto err_provider;
> > +
> > +			op = "probe module";
> > +
> > +			if ((mod_dir = make_state_dirat(prov_dir, mod, op, 0)) < 0)
> > +				goto err_provider;
> > +
> > +			op = "probe fun";
> > +			if ((fun_dir = make_state_dirat(mod_dir, fun, op, 0)) < 0)
> > +				goto err_probemod;
> > +
> > +			op = "parsed probe creation";
> > +			if ((fd = openat(parsed_mapping_dir, specfn,
> > +					 O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
> > +				goto err_probefn;
> > +
> > +			op = "parsed probe hardlink";
> > +			if (linkat(parsed_mapping_dir, specfn, fun_dir, prb, 0) < 0)
> > +				goto err_parsed_probe;
> > +
> > +			op = "parsed version write";
> > +			if (write_chunk(fd, &version, sizeof(uint64_t)) < 0)
> > +				goto err_probe_link;
> > +
> > +			op = "parsed nprobes write";
> > +			_Static_assert(sizeof(accump->parsed->probe.ntp) == sizeof(uint64_t),
> > +				       "chunk write assumes parsed.probe.ntp is a uint64_t");
> > +                        if (write_chunk(fd, &accump->parsed->probe.ntp,
> > +					sizeof(accump->parsed->probe.ntp)) < 0)
> > +				goto err_probe_link;
> > +			break;
> > +		}
> > +
> > +		/* Tracepoint: add to already-open file.  */
> > +		case DIT_TRACEPOINT:
> > +			assert(state == DIT_PROBE || state == DIT_TRACEPOINT);
> > +			state = accump->parsed->type;
> > +
> > +			if (write_chunk(fd, accump->parsed, accump->parsed->size) < 0)
> > +				goto err_probe_link;
> > +			break;
> > +
> > +		/* Error return from parser.  */
> > +		case DIT_ERR:
> > +			op = "parsing";
> > +			fuse_log(FUSE_LOG_ERR, "dtprobed: parser error: %s\n",
> > +				 accump->parsed->err.err);
> > +			goto err_provider;
> > +
> > +		default:
> > +			/*
> > +			 * This is a parser bug, and a bad one.  Just fail
> > +			 * hard.
> > +			 */
> > +			fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: internal error: corrupt parsed DOF: type %i\n",
> > +				 pid, dev, ino, accump->parsed->type);
> > +			assert(1);
> > +		}
> > +
> > +		free(specfn);
> > +		continue;
> > +
> > +	err_probe_link:
> > +		unlinkat(fun_dir, prb, 0);
> > +	err_parsed_probe:
> > +		unlinkat(parsed_mapping_dir, specfn, 0);
> > +	err_probefn:
> > +		unlinkat(mod_dir, fun, AT_REMOVEDIR);
> > +	err_probemod:
> > +		if (prov_dir >= 0) /* Always true: placate compiler */
> > +			unlinkat(prov_dir, mod, AT_REMOVEDIR);
> > +	err_provider:
> > +		maybe_close(fd);
> > +		maybe_close(fun_dir);
> > +		maybe_close(mod_dir);
> > +		maybe_close(prov_dir);
> > +		unlinkat(probepid_dir, provpid_name, AT_REMOVEDIR);
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write for %s: %s\n",
> > +			 pid, dev, ino, op, probe_err, strerror(errno));
> > +		free(specfn);
> > +		goto err_free;
> > +	}
> > +
> > +	err = maybe_close(prov_dir);
> > +	err |= maybe_close(mod_dir);
> > +	err |= maybe_close(fun_dir);
> > +	err |= maybe_close(fd);
> > +
> > +        if (err != 0) {
> > +		unlinkat(perpid_dof_dir, "parsed", 0);
> > +		goto err_free;
> > +	}
> > +
> > +	close(probepid_dir);
> > +	close(parsed_mapping_dir);
> > +	close(perpid_dof_dir);
> > +	close(perpid_dir);
> > +	free(provpid_name);
> > +	free(pid_name);
> > +	free(dof_name);
> > +	return 0;
> > +
> > +	/* Errors requiring freeing and error return but no removals.  */
> > +err_free:
> > +	close(probepid_dir);
> > +	close(parsed_mapping_dir);
> > +	close(perpid_dof_dir);
> > +	close(perpid_dir);
> > +	close(prov_dir);
> > +	free(provpid_name);
> > +	free(pid_name);
> > +	free(dof_name);
> > +	return -1;
> > +
> > +	/*
> > +	 * Errors before parsed-probe creation starts require wholesale
> > +	 * removals of stale dirs.
> > +	 */
> > +
> > +remove_probepid_dir_err:
> > +	close(probepid_dir);
> > +	unlinkat(probes_dir, pid_name, AT_REMOVEDIR);
> > +	unlinkat(parsed_mapping_dir, "version", 0);
> > +remove_parsed_mapping_err:
> > +	close(parsed_mapping_dir);
> > +	unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR);
> > +close_perpid_dof_err:
> > +	close(perpid_dof_dir);
> > +close_perpid_err:
> > +	fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: %s failed for parsed DOF write: %s\n",
> > +		 pid, dev, ino, op, strerror(errno));
> > +	close(perpid_dir);
> > +early_err:
> > +	free(pid_name);
> > +	free(dof_name);
> > +	return -1;
> > +}
> > +
> > +/*
> > + * Write out the DOF helper.  A trivial wrapper.
> > + */
> > +static
> > +int dof_stash_write_dh(int dirfd, const dof_helper_t *dh)
> > +{
> > +	int fd;
> > +
> > +        if ((fd = openat(dirfd, "dh", O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
> > +		goto err;
> > +
> > +	if (write_chunk(fd, dh, sizeof(dof_helper_t *)) < 0) {
> > +		close(fd);
> > +		goto err;
> > +	}
> > +
> > +        if (close(fd) < 0)
> > +		goto err;
> > +
> > +	return 0;
> > +
> > + err:
> > +	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF helper: %s\n",
> > +		 strerror(errno));
> > +	return -1;
> > +}
> > +
> > +/*
> > + * Add a piece of raw DOF from a given (pid, dev, ino) triplet.
> > + *
> > + * Return the new DOF generation number, or -1 on error.
> > + */
> > +int
> > +dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
> > +	      const void *dof, size_t size)
> > +{
> > +	char *dof_name = make_dof_name(dev, ino);
> > +	char *pid_name = make_numeric_name(pid);
> > +	int perpid_dir = -1;
> > +	int perpid_dof_dir = -1;
> > +	int gen_fd = -1;
> > +	struct stat gen_stat;
> > +	char *gen_name = NULL;
> > +	int new_pid = 1;
> > +	int new_dof = 0;
> > +	int err = -1;
> > +
> > +	if (!pid_name || !dof_name)
> > +		goto out_free;
> > +
> > +
> > +	/* Make the directories. */
> > +
> > +        perpid_dir = make_state_dirat(pid_dir, pid_name, "PID", 0);
> > +
> > +        if (perpid_dir < 0)
> > +		goto out_free;
> > +
> > +        perpid_dof_dir = make_state_dirat(perpid_dir, dof_name, "per-pid DOF", 0);
> > +
> > +        if (perpid_dof_dir < 0)
> > +		goto err_unlink_nomsg;
> > +
> > +        fuse_log(FUSE_LOG_DEBUG, "%i: adding DOF in mapping %lx/%lx, l_addr %lx\n",
> > +		 pid, dev, ino, dh->dofhp_addr);
> > +
> > +	/* Get the current generation counter. */
> > +
> > +        if ((gen_fd = openat(perpid_dir, "next-gen", O_CREAT | O_WRONLY, 0644)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot open DOF generation counter for PID "
> > +			 "%i: %s\n", pid, strerror(errno));
> > +		goto err_unlink_nomsg;
> > +	}
> > +
> > +        if (fstat(gen_fd, &gen_stat) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat DOF generation counter for PID "
> > +			 "%i: %s\n", pid, strerror(errno));
> > +		goto err_unlink_nomsg;
> > +	}
> > +	if (gen_stat.st_size > 0)
> > +		new_pid = 0;
> > +
> > +	/*
> > +	 * Write out the raw DOF, if not already known, and link it into its
> > +	 * per-PID dir: write the helper directly out.  dof_stash_write tells us
> > +	 * whether the DOF actually needed writing or not (via a straight size
> > +	 * comparison, which suffices to spot almost everything other than the
> > +	 * pathological case of stale DOF from a killed process whose inode is
> > +	 * unlinked and replaced with another identically-numbered inode with
> > +	 * DOF of the *exact same length*: this seems too unlikely to be worth
> > +	 * bothering with.)
> > +	 */
> > +	new_dof = 1;
> > +	switch (dof_stash_write(dof_dir, dof_name, dof, size)) {
> > +	case 0: new_dof = 0; break;
> > +	case -1: goto err_unlink_nomsg; break;
> > +	}
> > +
> > +        if (linkat(dof_dir, dof_name, perpid_dof_dir, "raw", 0) < 0)
> > +		goto err_unlink_msg;
> > +
> > +	if (dof_stash_write_dh(perpid_dof_dir, dh) < 0)
> > +		goto err_unlink_nomsg;
> > +
> > +	/*
> > +	 * Update the generation counter and mark this DOF's generation.  On
> > +	 * error after this point we leak generation counter values.
> > +	 */
> > +	if(ftruncate(gen_fd, gen_stat.st_size + 1) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot update generation counter for PID %i: %s\n",
> > +			 pid, strerror(errno));
> > +		goto err_unlink_nomsg;
> > +	}
> > +
> > +	gen_name = make_numeric_name(gen_stat.st_size);
> > +	if (symlinkat(dof_name, perpid_dir, gen_name) < 0)
> > +		goto err_unlink_msg;
> > +
> > +        err = 0;
> > +
> > +out_free:
> > +        free(dof_name);
> > +	free(pid_name);
> > +
> > +        maybe_close(perpid_dir);
> > +        maybe_close(perpid_dof_dir);
> > +	maybe_close(gen_fd);
> > +
> > +        return err == 0 ? gen_stat.st_size : err;
> > +
> > +err_unlink_msg:
> > +	fuse_log(FUSE_LOG_ERR, "Cannot link pids for DOF from PID %i, "
> > +		 "DOF mapping %lx/%lx into place: %s\n", pid, dev, ino,
> > +		 strerror(errno));
> > +err_unlink_nomsg:
> > +        unlinkat(perpid_dof_dir, "raw", 0);
> > +        unlinkat(perpid_dof_dir, "dh", 0);
> > +	unlinkat(perpid_dir, dof_name, AT_REMOVEDIR);
> > +
> > +	if (gen_name)
> > +		unlinkat(perpid_dir, gen_name, 0);
> > +
> > +	if (new_pid) {
> > +		unlinkat(perpid_dir, "next-gen", 0);
> > +		unlinkat(pid_dir, pid_name, AT_REMOVEDIR);
> > +	}
> > +
> > +        if (new_dof)
> > +		unlinkat(dof_dir, dof_name, 0);
> > +
> > +	goto out_free;
> > +}
> > +
> > +/*
> > + * The guts of unlinkat_many (below).  Returns -1 on error, or positive
> > + * ENOTEMPTY if a subdir deletion failed because the dir was nonempty (any
> > + * positive return leads to unwinding without doing any more deletions).
> > + */
> > +static int
> > +unlinkat_many_internal(int dirfd, int diroff, const char **dirs)
> > +{
> > +	int fd;
> > +	struct stat s;
> > +	const char *err;
> > +	int flags = 0;
> > +	int ret;
> > +
> > +	/*
> > +	 * Open the dir/file at the next level down, then recurse to delete the
> > +	 * next one in first.  This is a directory unless the next element in
> > +	 * the list is NULL.
> > +	 */
> > +
> > +	if (dirs[diroff] == NULL)
> > +		return 0;
> > +
> > +	if (dirs[diroff + 1] != NULL)
> > +		flags = AT_REMOVEDIR;
> > +
> > +	if (fstatat(dirfd, dirs[diroff], &s, AT_SYMLINK_NOFOLLOW) < 0) {
> > +		if (errno == ENOENT)
> > +			return 0;
> > +		fuse_log(FUSE_LOG_ERR, "Cannot unlink %s during probe state dir deletion: %s\n",
> > +			 dirs[diroff], strerror(errno));
> > +		return -1;
> > +	}
> > +
> > +	if (flags == AT_REMOVEDIR && !S_ISDIR(s.st_mode)) {
> > +		err = "must be a directory, but is not";
> > +		goto err_keep_going;
> > +	} else if (flags == 0 && !S_ISREG(s.st_mode)) {
> > +		err = "not a regular file";
> > +		goto err_keep_going;
> > +	}
> > +
> > +	if ((fd = openat(dirfd, dirs[diroff], O_RDONLY | O_CLOEXEC)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "Cannot open %s during probe state dir deletion: %s\n",
> > +			 dirs[diroff], strerror(errno));
> > +		char foo[PATH_MAX];
> > +		sprintf(foo, "ls -l /proc/%i/fd", getpid());
> > +		system(foo);
> > +		return -1;
> > +	}
> > +
> > +        ret = unlinkat_many_internal(fd, diroff + 1, dirs);
> > +	close(fd);
> > +
> > +	if (ret < 0)
> > +		return -1;
> > +	else if (ret > 0)
> > +		return 0;			/* ENOTEMPTY */
> > +
> > +	/*
> > +	 * Now do the deletion.
> > +	 */
> > +	if (unlinkat(dirfd, dirs[diroff], flags) < 0) {
> > +		if (errno == ENOENT)
> > +			return 0;
> > +		if (errno == ENOTEMPTY)
> > +			return ENOTEMPTY;
> > +        }
> > +	return 0;
> > +
> > +err_keep_going:
> > +	fuse_log(FUSE_LOG_ERR, "Cannot unlink %s during probe state dir deletion: "
> > +		 "%s\n", dirs[diroff], err);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Unlink a file and then a bunch of directories, as given by the names
> > + * argument (the last entry in which is a file).  Each directory in the dirs
> > + * array is a subdirectory of the one before it.  If any directory removal
> > + * fails due to non-emptiness, all later removals are skipped.  If any directory
> > + * does not exist, removals start from that point.
> > + *
> > + * The first entry must be a file: the next must be directories.  If this is
> > + * violated, just return without unlinking anything (but do log an error).
> > + */
> > +static int
> > +unlinkat_many(int dirfd, const char **dirs)
> > +{
> > +	return unlinkat_many_internal(dirfd, 0, dirs);
> > +}
> > +
> > +/*
> > + * Determine if a file or directory (in the DOF stash) should be deleted.
> > + */
> > +static int
> > +refcount_cleanup_p(int dirfd, const char *name, int isdir)
> > +{
> > +	struct stat s;
> > +
> > +	if (fstatat(dirfd, name, &s, 0) != 0) {
> > +		fuse_log(FUSE_LOG_ERR, "Cannot stat %s for cleanup: %s\n",
> > +			 name, strerror(errno));
> > +		return -1;
> > +	}
> > +
> > +	if ((isdir && s.st_nlink != 2) || (!isdir && s.st_nlink != 1))
> > +		return 0;
> > +
> > +	return 1;
> > +}
> > +
> > +
> > +/*
> > + * Delete a file or directory (in the DOF stash) if it has no other links.
> > + */
> > +static int
> > +refcount_cleanup(int dirfd, const char *name, int isdir)
> > +{
> > +	switch (refcount_cleanup_p(dirfd, name, isdir)) {
> > +	case -1: return -1;
> > +	case 0: return 0;
> > +	default: break;
> > +	}
> > +
> > +	if (unlinkat(dirfd, name, isdir ? AT_REMOVEDIR : 0) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot remove old DOF %s: %s\n",
> > +			 name, strerror(errno));
> > +		return -1;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Remove the parsed directory under a given mapping directory.
> > + */
> > +static int
> > +dof_stash_remove_parsed(pid_t pid, int mapping_dir, const char *mapping)
> > +{
> > +	char *pid_name = make_numeric_name(pid);
> > +	int parsed_dir;
> > +	DIR *parsed;
> > +	struct dirent *ent;
> > +
> > +        if (((parsed_dir = openat(mapping_dir, "parsed", O_RDONLY | O_CLOEXEC)) < 0) ||
> > +	    ((parsed = fdopendir(parsed_dir)) == NULL)) {
> > +
> > +		if (parsed_dir >= 0)
> > +			close(parsed_dir);
> > +
> > +		if (errno == ENOENT)
> > +			return 0;
> > +
> > +                fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory for PID %i mapping %s for cleanup: %s\n",
> > +                         pid, mapping, strerror(errno));
> > +		free(pid_name);
> > +		return -1;
> > +        }
> > +
> > +	/*
> > +	 * Scan the parsed dir, clean up all the matching per-probe
> > +	 * dirs, and delete the parsed dir's contents too.
> > +	 */
> > +
> > +        while (errno = 0, (ent = readdir(parsed)) != NULL) {
> > +		const char *mod, *fun, *prb;
> > +		const char *unlink_err;
> > +		char *prov;
> > +
> > +                if (errno != 0)
> > +			goto err;
> > +
> > +		if (ent->d_type != DT_REG)
> > +			continue;
> > +
> > +		if ((prov = strdup(ent->d_name)) == NULL)
> > +			goto err;
> > +
> > +		if (unlinkat(parsed_dir, ent->d_name, 0) < 0) {
> > +			unlink_err = "parsed dir";
> > +			goto next;
> > +		}
> > +
> > +                if (strcmp(ent->d_name, "version") == 0)
> > +			continue;
> > +
> > +		if (split_probespec_name(prov, &mod, &fun, &prb) < 0)
> > +			continue;
> > +
> > +		const char *dirnames[] = {pid_name, prov, mod, fun, prb, NULL};
> > +		if (unlinkat_many(probes_dir, dirnames) < 0) {
> > +			unlink_err = "parsed probes dir entries";
> > +			goto next;
> > +		}
> > +		free(prov);
> > +		continue;
> > +
> > +	  next:
> > +		free(prov);
> > +                fuse_log(FUSE_LOG_ERR, "cannot unlink per-PID %s %s (mapping %s) for cleanup: %s\n",
> > +			 ent->d_name, unlink_err, mapping, strerror(errno));
> > +	}
> > +	closedir(parsed);
> > +	free(pid_name);
> > +
> > +        return 0;
> > +
> > +  err:
> > +	closedir(parsed);
> > +	free(pid_name);
> > +	fuse_log(FUSE_LOG_ERR, "cannot open per-PID parsed DOF directory entry for PID %i mapping %s for cleanup: %s\n",
> > +		 pid, mapping, strerror(errno));
> > +	return -1;
> > +}
> > +
> > +/*
> > + * Remove a piece of DOF, identified by generation counter.  Return -1 on error.
> > + *
> > + * Not knowing about the DOF is not an error.
> > + */
> > +int
> > +dof_stash_remove(pid_t pid, int gen)
> > +{
> > +	char *pid_name = make_numeric_name(pid);
> > +	char *gen_name = make_numeric_name(gen);
> > +	char *gen_linkname = NULL;
> > +	int perpid_dir = -1;
> > +	int perpid_dof_dir = -1;
> > +	struct stat pid_stat;
> > +	struct stat gen_stat;
> > +	int err = -1;
> > +	const char *unlink_err = NULL;
> > +
> > +	/*
> > +	 * Figure out the per-PID DOF directory by following the gen-counter
> > +	 * symlink (by hand, because we want to reuse its name in the non-
> > +	 * PID-specific dof/ subdir too).
> > +	 */
> > +
> > +	if (!pid_name || !gen_name)
> > +		goto oom;
> > +
> > +        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0)
> > +		goto out;
> > +
> > +	if (fstatat(perpid_dir, gen_name, &gen_stat, AT_SYMLINK_NOFOLLOW) < 0)
> > +		goto out;
> > +
> > +	if ((gen_linkname = malloc(gen_stat.st_size + 1)) == NULL)
> > +		goto oom;
> > +
> > +	/*
> > +	 * Truncated dir -> concurrent modifications, not allowed.
> > +	 */
> > +	if (readlinkat(perpid_dir, gen_name, gen_linkname, gen_stat.st_size + 1) ==
> > +	    gen_stat.st_size + 1) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: state dir is under concurrent modification, maybe multiple dtprobeds are running\n");
> > +		goto err;
> > +	}
> > +	gen_linkname[gen_stat.st_size] = 0;
> > +
> > +	perpid_dof_dir = openat(perpid_dir, gen_linkname, O_PATH | O_CLOEXEC);
> > +
> > +        if (strchr(gen_linkname, '/') != NULL ||
> > +	    perpid_dof_dir < 0) {
> > +                if (perpid_dof_dir >= 0)
> > +			close(perpid_dof_dir);
> > +                fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which is not valid\n",
> > +			 pid, gen, gen_linkname);
> > +		goto err;
> > +	}
> > +
> > +	if (faccessat(dof_dir, gen_linkname, F_OK, 0) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: generation link for PID %i generation %i is %s, which has no raw DOF\n",
> > +			 pid, gen, gen_linkname);
> > +		goto err;
> > +	}
> > +
> > +	/*
> > +	 * Unlink the per-PID raw and parsed DOF and the probe dirs, then check
> > +	 * the link counts on the non-per-PID stuff; if zero, unlink that too.
> > +	 */
> > +
> > +	fuse_log(FUSE_LOG_DEBUG, "gen_name: %s; gen_linkname: %s; perpid_dof_dir: %i\n", gen_name, gen_linkname, perpid_dof_dir);
> > +        if (unlinkat(perpid_dof_dir, "raw", 0) != 0 && errno != ENOENT) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID raw DOF for PID %i generation %i: %s\n",
> > +			 pid, gen, strerror(errno));
> > +	}
> > +
> > +	if (dof_stash_remove_parsed(pid, perpid_dof_dir, gen_linkname) < 0)
> > +		unlink_err = "parsed probes dir entries";
> > +
> > +	if (unlinkat(perpid_dof_dir, "parsed", AT_REMOVEDIR) < 0)
> > +		unlink_err = "parsed dir";
> > +
> > +	if (unlinkat(perpid_dof_dir, "dh", 0) < 0)
> > +		unlink_err = "dh";
> > +
> > +	if (unlinkat(perpid_dir, gen_linkname, AT_REMOVEDIR) < 0)
> > +		unlink_err = gen_linkname;
> > +
> > +	if (unlinkat(perpid_dir, gen_name, 0) < 0)
> > +		unlink_err = gen_name;
> > +
> > +	if (refcount_cleanup(dof_dir, gen_linkname, 0) < 0)
> > +		unlink_err = gen_linkname;
> > +
> > +	refcount_cleanup(dof_dir, gen_linkname, 0);
> > +
> > +	/*
> > +	 * Clean up the PID directory itself, if it is now empty.  We can't just
> > +	 * use refcount_cleanup here because we also have to delete the
> > +	 * generation counter right before deleting the directory.
> > +	 */
> > +
> > +	if (fstatat(pid_dir, pid_name, &pid_stat, 0) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot stat per-PID for pid %i: %s\n",
> > +			 pid, strerror(errno));
> > +		goto err;
> > +	}
> > +
> > +	if (pid_stat.st_nlink == 2) {
> > +		if (unlinkat(perpid_dir, "next-gen", 0) < 0) {
> > +			fuse_log(FUSE_LOG_ERR, "dtprobed: cannot clean up gen counter in per-PID dir for PID %i: %s\n",
> > +				 pid, strerror(errno));
> > +			goto err;
> > +		}
> > +		refcount_cleanup(pid_dir, pid_name, 1);
> > +	}
> > +
> > +	if (unlink_err)
> > +		fuse_log(FUSE_LOG_ERR, "dtprobed: cannot unlink per-PID DOF %s for PID %i generation %i: %s\n",
> > +			 unlink_err, pid, gen, strerror(errno));
> > +
> > + out:
> > +	err = 0;
> > +
> > + err:
> > +        free(pid_name);
> > +	free(gen_name);
> > +	free(gen_linkname);
> > +
> > +        maybe_close(perpid_dir);
> > +        maybe_close(perpid_dof_dir);
> > +
> > +	return err;
> > +
> > +oom:
> > +	fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory unlinking DOF\n");
> > +	goto err;
> > +}
> > +
> > +/*
> > + * Prune dead processes out of the DOF stash.  This is not a correctness
> > + * operation, just a space-waste reducer. If a process with DOF died uncleanly
> > + * and its PID was recycled since a cleanup was last run, we'll leave some DOF
> > + * around for a bit, but there are no correctness implications: either the new
> > + * process has no DOF at all, in which case we'll never learn about it, or it
> > + * does, in which case either it has identical mappings (in which case we'll
> > + * just reuse them) or it has different ones (in which case we'll just add to
> > + * them): in neither case is there any negative consequence of the nonexistent
> > + * mappings other than a bit of space wastage.  (This implies that any process
> > + * reading DOF out of the state directory must check that a process that claims
> > + * to have DOF at a given mapping actually has a corresponding mapping with the
> > + * same dev/ino pair at the right address. There's no need for that process to
> > + * clean it up if found to be wrong, though: this function will eventually get
> > + * around to it.)
> > + */
> > +void
> > +dof_stash_prune_dead(void)
> > +{
> > +	DIR *all_pids_dir;
> > +	struct dirent *pid_ent;
> > +	int tmp;
> > +
> > +	fuse_log(FUSE_LOG_DEBUG, "Pruning dead PIDs\n");
> > +
> > +	if ((tmp = dup(pid_dir)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "cannot dup() for cleanup: %s\n", strerror(errno));
> > +		return;
> > +	}
> > +
> > +        if ((all_pids_dir = fdopendir(tmp)) == NULL) {
> > +		close(tmp);
> > +		fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF directory: %s\n",
> > +			 strerror(errno));
> > +		return;
> > +	}
> > +	rewinddir(all_pids_dir);
> > +
> > +	/*
> > +	 * Work over all the PIDs.
> > +	 */
> > +	while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
> > +		DIR *perpid_dir;
> > +		struct dirent *mapping_ent;
> > +		char *end_pid;
> > +		pid_t pid;
> > +
> > +		if (errno != 0)
> > +			goto scan_failure;
> > +
> > +		if (pid_ent->d_type != DT_DIR)
> > +			continue;
> > +
> > +		/*
> > +		 * Only directories with numeric names can be PIDs: skip all the
> > +		 * rest.
> > +		 */
> > +		pid = strtol(pid_ent->d_name, &end_pid, 10);
> > +		if (*end_pid != '\0')
> > +			continue;
> > +
> > +		if (kill(pid, 0) == 0 || errno != ESRCH)
> > +			continue;
> > +
> > +		/*
> > +		 * Process dead: clean up its PIDs, deleting everything in the
> > +		 * per-PID directory by calling dof_stash_remove() on every
> > +		 * mapping in turn.
> > +		 */
> > +
> > +		if ((tmp = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
> > +			fuse_log(FUSE_LOG_ERR,"cannot open per-PID DOF mappings directory for cleanup: %s\n",
> > +				 strerror(errno));
> > +			goto out;
> > +		}
> > +
> > +		if ((perpid_dir = fdopendir(tmp)) == NULL) {
> > +			fuse_log(FUSE_LOG_ERR, "cannot clean up per-PID DOF mappings: %s\n",
> > +				 strerror(errno));
> > +			close(tmp);
> > +			goto out;
> > +		}
> > +
> > +		fuse_log(FUSE_LOG_DEBUG, "Pruning dead PID %s\n", pid_ent->d_name);
> > +
> > +		/* Work over all the mappings in this PID. */
> > +
> > +                while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
> > +			int gen;
> > +			char *end_gen;
> > +
> > +                        if (errno != 0) {
> > +				fuse_log(FUSE_LOG_ERR, "cannot read per-PID DOF mappings for cleanup: %s\n",
> > +					 strerror(errno));
> > +				closedir(perpid_dir);
> > +				goto out;
> > +			}
> > +
> > +			if (mapping_ent->d_type != DT_LNK)
> > +				continue;
> > +
> > +			fuse_log(FUSE_LOG_DEBUG, "Working over generation %s\n", mapping_ent->d_name);
> > +
> > +			gen = strtol(mapping_ent->d_name, &end_gen, 10);
> > +			if (*end_gen != '\0')
> > +				continue;
> > +
> > +                        dof_stash_remove(pid, gen);
> > +		}
> > +		closedir(perpid_dir);
> > +		errno = 0;
> > +	}
> > +
> > +	if (errno != 0)
> > +		goto scan_failure;
> > +
> > +out:
> > +	closedir(all_pids_dir);
> > +	return;
> > +
> > +scan_failure:
> > +	fuse_log(FUSE_LOG_ERR, "cannot scan per-PID DOF directory for cleanup: %s\n",
> > +		 strerror(errno));
> > +	goto out;
> > +}
> > +
> > +/*
> > + * Read a file into a buffer and return it.  If READ_SIZE is non-negative, read
> > + * only that many bytes (and silently fail if the file isn't at least that
> > + * long).
> > + */
> > +static void *
> > +read_file(int fd, ssize_t read_size, size_t *size)
> > +{
> > +	struct stat s;
> > +	char *buf;
> > +	char *bufptr;
> > +	int len;
> > +
> > +        if (fstat(fd, &s) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "cannot stat: %s\n", strerror(errno));
> > +		return NULL;
> > +	}
> > +	if (read_size >= 0) {
> > +		if (s.st_size < read_size)
> > +			return NULL;
> > +	} else
> > +		read_size = s.st_size;
> > +
> > +	if ((buf = malloc(read_size)) == NULL) {
> > +		fuse_log(FUSE_LOG_ERR, "out of memory allocating %zi bytes\n",
> > +			 read_size);
> > +		return NULL;
> > +	}
> > +
> > +	if (size)
> > +		*size = read_size;
> > +
> > +        bufptr = buf;
> > +	while ((len = read(fd, bufptr, read_size)) < read_size) {
> > +		if (len < 0) {
> > +			if (errno != EINTR) {
> > +				fuse_log(FUSE_LOG_ERR, "cannot read: %s",
> > +					 strerror(errno));
> > +				free(buf);
> > +				return NULL;
> > +			}
> > +			continue;
> > +		}
> > +		read_size -= len;
> > +		bufptr += len;
> > +	}
> > +	return buf;
> > +}
> > +
> > +/*
> > + * Reparse all DOF.  Mappings that cannot be reparsed are simply ignored, on the
> > + * grounds that most DOF, most of the time, is not used, so this will likely be
> > + * ignorable.
> > + *
> > + * Normally, DOF with the same DOF_VERSION as the currently running daemon is
> > + * left alone: if force is set, even that is reparsed.
> > + */
> > +int
> > +reparse_dof(int out, int in,
> > +	    int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
> > +			   dof_helper_t *dh, const void *in_buf, size_t in_bufsz,
> > +			   int reparsing),
> > +	    int force)
> > +{
> > +	DIR *all_pids_dir;
> > +	struct dirent *pid_ent;
> > +	int err = -1;
> > +	int tmp;
> > +
> > +	if ((tmp = dup(pid_dir)) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
> > +		return -1;
> > +	}
> > +
> > +        if ((all_pids_dir = fdopendir(tmp)) == NULL) {
> > +		close(tmp);
> > +		fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF directory: %s\n",
> > +			 strerror(errno));
> > +		return -1;
> > +	}
> > +
> > +	/* Work over all the PIDs. */
> > +
> > +	rewinddir(all_pids_dir);
> > +	while (errno = 0, (pid_ent = readdir(all_pids_dir)) != NULL) {
> > +		int perpid_fd;
> > +		char *end_pid;
> > +		pid_t pid;
> > +		DIR *perpid_dir;
> > +		struct dirent *mapping_ent;
> > +
> > +		if (errno != 0)
> > +			goto scan_failure;
> > +
> > +		if (pid_ent->d_type != DT_DIR)
> > +			continue;
> > +
> > +		/*
> > +		 * Only directories with numeric names can be PIDs: skip all the
> > +		 * rest.
> > +		 */
> > +		pid = strtol(pid_ent->d_name, &end_pid, 10);
> > +		if (*end_pid != '\0')
> > +			continue;
> > +
> > +                if ((perpid_fd = openat(pid_dir, pid_ent->d_name, O_RDONLY | O_CLOEXEC)) < 0) {
> > +			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mappings directory for PID %s: %s\n",
> > +				 pid_ent->d_name, strerror(errno));
> > +			goto out;
> > +		}
> > +
> > +		if ((tmp = dup(perpid_fd)) < 0) {
> > +			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot dup(): %s\n", strerror(errno));
> > +			close(perpid_fd);
> > +			goto out;
> > +		}
> > +
> > +		if ((perpid_dir = fdopendir(tmp)) == NULL) {
> > +			fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
> > +				 pid_ent->d_name, strerror(errno));
> > +			close(tmp);
> > +			close(perpid_fd);
> > +			goto out;
> > +		}
> > +
> > +		/* Work over all the mappings in this PID. */
> > +
> > +                while (errno = 0, (mapping_ent = readdir(perpid_dir)) != NULL) {
> > +			int mapping_fd;
> > +			int fd;
> > +			dev_t dev;
> > +			ino_t ino;
> > +			size_t dof_size, dh_size;
> > +			void *dof = NULL;
> > +			void *dh = NULL;
> > +
> > +                        if (errno != 0) {
> > +				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot read per-PID DOF mappings for pid %s: %s\n",
> > +					 pid_ent->d_name, strerror(errno));
> > +				goto perpid_err;
> > +			}
> > +
> > +			if (mapping_ent->d_type != DT_DIR)
> > +				continue;
> > +
> > +			/*
> > +			 * Hidden files cannot be mappings.  (This also skips
> > +			 * . and ..).
> > +			 */
> > +			if (mapping_ent->d_name[0] == '.')
> > +				continue;
> > +
> > +                        if ((mapping_fd = openat(perpid_fd, mapping_ent->d_name,
> > +						 O_PATH | O_CLOEXEC)) < 0) {
> > +				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot open per-PID DOF mapping for PID %s, mapping %s: %s\n",
> > +					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> > +				goto perpid_err;
> > +			}
> > +
> > +			fuse_log(FUSE_LOG_DEBUG, "Parsed DOF is present: evaluating\n");
> > +
> > +			/*
> > +			 * Only DOF with a different parsed representation gets
> > +			 * reparsed.
> > +			 */
> > +			if (!force) {
> > +				if ((fd = openat(mapping_fd, "parsed/version",
> > +						 O_RDONLY | O_CLOEXEC)) == 0) {
> > +					uint64_t *parsed_version;
> > +
> > +					parsed_version = read_file(fd, sizeof(uint64_t),
> > +								   NULL);
> > +					close(fd);
> > +
> > +					if (parsed_version != NULL &&
> > +					    *parsed_version == DOF_PARSED_VERSION) {
> > +						fuse_log(FUSE_LOG_DEBUG, "No need to reparse\n");
> > +						continue;
> > +					}
> > +
> > +					free(parsed_version);
> > +				}
> > +			}
> > +
> > +			fuse_log(FUSE_LOG_DEBUG, "Deleting\n");
> > +
> > +			if (dof_stash_remove_parsed(pid, mapping_fd, mapping_ent->d_name) < 0) {
> > +				fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot delete parsed DOF for PID %s, mapping %s: %s\n",
> > +					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> > +				close(mapping_fd);
> > +				goto perpid_err;
> > +			}
> > +
> > +			if (split_dof_name(mapping_ent->d_name, &dev, &ino) < 0) {
> > +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF for PID %s, cannot derive dev/inum from %s: ignored\n",
> > +					 pid_ent->d_name, mapping_ent->d_name);
> > +				continue;
> > +			}
> > +
> > +			if ((fd = openat(mapping_fd, "raw", O_RDONLY | O_CLOEXEC)) < 0) {
> > +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open raw DOF for PID %s, mapping %s: ignored: %s\n",
> > +					 pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> > +				close(mapping_fd);
> > +				continue;
> > +			}
> > +
> > +			if ((dof = read_file(fd, -1, &dof_size)) == NULL) {
> > +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read raw DOF for PID %s, mapping %s: ignored: %s\n",
> > +					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> > +				close(mapping_fd);
> > +				close(fd);
> > +				continue;
> > +			}
> > +			close(fd);
> > +
> > +			if ((fd = openat(mapping_fd, "dh", O_RDONLY | O_CLOEXEC)) < 0) {
> > +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot open dh for PID %s, mapping %s: ignored: %s\n",
> > +					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> > +				free(dof);
> > +				close(mapping_fd);
> > +				continue;
> > +			}
> > +
> > +			if ((dh = read_file(fd, -1, &dh_size)) == NULL) {
> > +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot read dh for PID %s, mapping %s: ignored: %s\n",
> > +					    pid_ent->d_name, mapping_ent->d_name, strerror(errno));
> > +				free(dof);
> > +				close(mapping_fd);
> > +				close(fd);
> > +				continue;
> > +			}
> > +			close(fd);
> > +
> > +			fuse_log(FUSE_LOG_DEBUG, "Reparsing DOF for PID %s, mapping %s\n",
> > +				 pid_ent->d_name, mapping_ent->d_name);
> > +			if (reparse(pid, out, in, dev, ino, dh, dof, dof_size, 1) < 0)
> > +				fuse_log(FUSE_LOG_ERR, "when reparsing DOF, cannot parse DOF for PID %s, mapping %s: ignored\n",
> > +					    pid_ent->d_name, mapping_ent->d_name);
> > +			free(dof);
> > +			free(dh);
> > +			close(mapping_fd);
> > +
> > +			continue;
> > +
> > +		perpid_err:
> > +			closedir(perpid_dir);
> > +			close(perpid_fd);
> > +			goto out;
> > +		}
> > +		closedir(perpid_dir);
> > +		close(perpid_fd);
> > +		errno = 0;
> > +	}
> > +
> > +	if (errno != 0)
> > +		goto scan_failure;
> > +
> > +	err = 0;
> > +out:
> > +	closedir(all_pids_dir);
> > +	return err;
> > +
> > +scan_failure:
> > +	fuse_log(FUSE_LOG_ERR, "reparsing DOF: cannot scan per-PID DOF directory: %s\n",
> > +		 strerror(errno));
> > +	goto out;
> > +}
> > diff --git a/dtprobed/dof_stash.h b/dtprobed/dof_stash.h
> > new file mode 100644
> > index 000000000000..2754a8180622
> > --- /dev/null
> > +++ b/dtprobed/dof_stash.h
> > @@ -0,0 +1,44 @@
> > +/*
> > + * Oracle Linux DTrace; DOF storage for later probe removal.
> > + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
> > + * Licensed under the Universal Permissive License v 1.0 as shown at
> > + * http://oss.oracle.com/licenses/upl.
> > + */
> > +
> > +#ifndef	_DOF_STASH_H
> > +#define	_DOF_STASH_H
> > +
> > +#include <sys/types.h>
> > +#include <stdint.h>
> > +
> > +#include <dt_list.h>
> > +#include "dof_parser.h"
> > +
> > +typedef struct dof_parsed_list {
> > +	dt_list_t list;
> > +	dof_parsed_t *parsed;
> > +} dof_parsed_list_t;
> > +
> > +int dof_stash_init(const char *statedir);
> > +
> > +int dof_stash_push_parsed(dt_list_t *accum, dof_parsed_t *parsed);
> > +
> > +int dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum);
> > +
> > +void dof_stash_flush(dt_list_t *accum);
> > +
> > +int dof_stash_add(pid_t pid, dev_t dev, ino_t ino, const dof_helper_t *dh,
> > +		  const void *dof, size_t size);
> > +
> > +int dof_stash_remove(pid_t pid, int gen);
> > +
> > +void dof_stash_prune_dead(void);
> > +
> > +int
> > +reparse_dof(int out, int in,
> > +	    int (*reparse)(int pid, int out, int in, dev_t dev, ino_t inum,
> > +			   dof_helper_t *dh, const void *in_buf,
> > +			   size_t in_bufsz, int reparsing),
> > +	    int force);
> > +
> > +#endif
> > diff --git a/dtprobed/dtprobed.service b/dtprobed/dtprobed.service
> > index 95f0a1d6d606..70443f08fb9b 100644
> > --- a/dtprobed/dtprobed.service
> > +++ b/dtprobed/dtprobed.service
> > @@ -20,6 +20,8 @@ ProtectHome=true
> >  PrivateDevices=false
> >  PrivateNetwork=true
> >  ProtectControlGroups=true
> > +RuntimeDirectory=/run/dtrace
> > +RuntimeDirectoryPreserve=restart
> >  
> >  [Install]
> >  WantedBy=basic.target
> > diff --git a/dtrace.spec b/dtrace.spec
> > index 777dd28ded84..c1fbd17f4180 100644
> > --- a/dtrace.spec
> > +++ b/dtrace.spec
> > @@ -185,6 +185,10 @@ make DESTDIR=$RPM_BUILD_ROOT VERSION=%{version} \
> >       HDRPREFIX="$RPM_BUILD_ROOT/usr/include" \
> >       install install-test
> >  
> > +%if "%{?dist}" == ".el7"
> > +sed -i '/^ProtectSystem=/d; /^ProtectControlGroups=/d; /^RuntimeDirectory/d;' $RPM_BUILD_ROOT/usr/lib/systemd/system/dtprobed.service
> > +%endif
> > +
> >  # Because systemtap creates a sdt.h header file we have to rename
> >  # ours and then shift theirs out of the way.
> >  mv $RPM_BUILD_ROOT/usr/include/sys/sdt.h \
> > diff --git a/libcommon/dof_parser.h b/libcommon/dof_parser.h
> > index 7ab18f4dee01..41ef5a193b7f 100644
> > --- a/libcommon/dof_parser.h
> > +++ b/libcommon/dof_parser.h
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Oracle Linux DTrace; DOF parser interface with the outside world
> > - * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
> > + * Copyright (c) 2022, 2023, Oracle and/or its affiliates. All rights reserved.
> >   * Licensed under the Universal Permissive License v 1.0 as shown at
> >   * http://oss.oracle.com/licenses/upl.
> >   */
> > @@ -15,11 +15,10 @@
> >  #include <dtrace/helpers.h>
> >  
> >  /*
> > - * Result of DOF probe parsing (currently used only for probe creation. We
> > - * receive a provider info structure, followed by N probe info structures each
> > - * of which is followed by possibly many tracepoint info structures, all tagged.
> > - * Things not useful for probe creation (like args, translated types, etc) are
> > - * not returned.
> > + * Result of DOF probe parsing.  We receive a provider info structure, followed
> > + * by N probe info structures each of which is followed by possibly many
> > + * tracepoint info structures, all tagged.  Things not useful for probe creation
> > + * (like args, translated types, etc) are not returned.
> >   *
> >   * On error, a DIT_ERR structure is returned with an error message.
> >   */
> > @@ -31,6 +30,15 @@ typedef enum dof_parsed_info {
> >  	DIT_ERR = 3
> >  } dof_parsed_info_t;
> >  
> > +/*
> > + * Bump this whenever dof_parsed changes.
> > + *
> > + * Files consisting of arrays of dof_parsed have a single 64-bit word at the
> > + * start which is the version of the dof_parseds within it.  The data flowing
> > + * over the stream from the seccomped parser has no such prefix.
> > + */
> > +#define DOF_PARSED_VERSION 1
> > +
> >  typedef struct dof_parsed {
> >  	/*
> >  	 * Size of this instance of this structure.
> > -- 
> > 2.43.0.272.gce700b77fd
> > 
> > 
> > _______________________________________________
> > DTrace-devel mailing list
> > DTrace-devel at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/dtrace-devel
> 
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel



More information about the DTrace-devel mailing list