[DTrace-devel] [PATCH v3 06/21] dtprobed: add the DOF stash

Thu Feb 15 15:41:58 UTC 2024

On Thu, Feb 15, 2024 at 01:13:17PM +0000, Nick Alcock wrote:
> On 15 Feb 2024, Kris Van Hees uttered the following:
> 
> > On Tue, Jan 16, 2024 at 09:13:02PM +0000, Nick Alcock via DTrace-devel wrote:
> >> +/*
> >> + * Write out all the parsed DOF from the passed-in accumulator.  Split out from
> >> + * dof_stash_add() because we don't want to create anything in the DOF stash for
> >> + * a given piece of DOF until we have verified that all the parsed data for all
> >> + * the probes in that DOF is valid.
> >> + *
> >> + * Each parsed file starts with a single 64-bit word giving the version of
> >> + * dof_parsed_t in use, to ensure that readers never use an incompatible
> >> + * version, and a single uint64_t giving the number of tracepoints following,
> >> + * each of which is a struct dof_parsed_t. The name of the probe is not given,
> >> + * since the filename (or, for the probes/ dirs, the file and directory names)
> >> + * already contains it.
> >> + *
> >> + * We also write the parsed version out to the parsed/version file in the parsed
> >> + * mapping dir, to make it easier to determine whether to reparse all the DOF
> >> + * for outdatedness.
> >> + */
> >> +int
> >> +dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum)
> >> +{
> >> +	char *pid_name = make_numeric_name(pid);
> >> +	char *dof_name = make_dof_name(dev, ino);
> >> +	char *provpid_name = NULL;
> >> +	int perpid_dir, perpid_dof_dir, probepid_dir, parsed_mapping_dir,
> >> +		prov_dir = -1, mod_dir = -1, fun_dir = -1;
> >> +        dof_parsed_list_t *accump;
> >> +	int version_fd, fd = -1;
> >
> > Why oh why use 'fd' when all other directory and file descriptors have actual
> > descriptive names?  Especially since fd is carried across multiple iterations
> > in the loop.
> 
> Yes that's awful naming, fixed to parsed_fd.
> 
> >> +	const char *op;
> >> +	const char *probe_err = "unknown probe";
> >> +	uint64_t version = DOF_PARSED_VERSION;
> >> +	int state = -1;
> >> +	int err;
> >> +
> >> +	if (dt_list_next(accum) == NULL) {
> >> +		free(pid_name);
> >> +		free(dof_name);
> >> +		return 0;
> >> +	}
> >
> > Why not
> >
> > 	if (dt_list_next(accum) == NULL)
> > 		goto early_out;
> >
> > and set a label below at the successful case return that ends with these exact
> > three statements?  Might as well, given the presence of so many other gotos
> > anyway in this function.
> 
> Nicely spotted. Done.
> 
> >> +
> >> +        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0) {
> >> +		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: open failed for PID dir for parsed DOF write: %s\n",
> >> +			 pid, dev, ino, strerror(errno));
> >> +		goto early_err;
> >> +	}
> >> +
> >> +	op = "per-mapping open";
> >> +        if ((perpid_dof_dir = openat(perpid_dir, dof_name, O_PATH | O_CLOEXEC)) < 0)
> >> +		goto close_perpid_err;
> >> +
> >> +	op = "per-mapping parsed DOF mkdirat";
> >> +	if ((parsed_mapping_dir = make_state_dirat(perpid_dof_dir, "parsed", op, 0)) < 0)
> >> +		goto close_perpid_dof_err;
> >> +
> >> +	op = "probe PID mkdirat";
> >> +	if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
> >> +		goto remove_parsed_mapping_err;
> >> +
> >> +        if ((version_fd = openat(parsed_mapping_dir, "version",
> >> +				 O_CREAT | O_TRUNC | O_WRONLY, 0644)) < 0)
> >> +		goto remove_probepid_dir_err;
> >> +
> >> +	if (write_chunk(version_fd, &version, sizeof(uint64_t)) < 0) {
> >> +		close(version_fd);
> >
> > Shouldn't this perform an unlinkat() on version_fd?
> 
> This is done in remove_probepid_dir_err, via
> 
> 	unlinkat(parsed_mapping_dir, "version", 0);
> 
> >> +		goto remove_probepid_dir_err;
> >
> > The code path starting at the remove_probepid_dir_err label does not actually
> > contain an unlinkat() for probepid_dir, so it never gets cleaned up in case of
> > error?
> 
> What? It contains
> 
> 	unlinkat(probes_dir, pid_name, AT_REMOVEDIR);
> 
> which is the same as the thing probepid_dir is an fd to:
> 
> 	if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
> 
> You cannot remove probepid_dir (or version_fd): unlinkat() takes an fd
> to a directory and a pathname, not an fd to a file. (Which is *really
> annoying* and made auditing this for leaks much harder than it should
> have been.)

Ah, I missed that.  Thank you.

> >> +		/*
> >> +		 * Probe: make (mod, fun, probe) dirs as needed: close
> >> +		 * out file, open new one, make hardlink to it.
> >> +		 */
> >> +		case DIT_PROBE: {
> >> +                        assert(state >= 0);
> >
> > This is ok, but DIT_TRACEPOINT uses:
> >
> > 			assert(state == DIT_PROBE || state == DIT_TRACEPOINT);
> >
> > which is much better, right?  So why not use that here?
> 
> We're not checking the same thing. DIT_TRACEPOINT needs to validate that
> the tracepoint is preceded by a probe (or a probe and a string of
> tracepoints): DIT_PROBE only needs there to have been at least one
> provider (which means the state might be *anything* this state machine
> can reach, but cannot be -1, the before-we-started state).

Maybe make it explicitly list all three possibilities?  That makes it more
clear I think, and avoids problems if the state machine might need to be
expanded in the future to include other cases where DIT_PROBE is not valid
after just any previous state.

> (This is all outright asserts because it can't fire unless dof_parser.c
> is buggy anyway.)
> 
> >> +			/*
> >> +			 * This rather weird pattern carries out all closes and
> >> +			 * then errors if any of them fail.
> >> +			 */
> >> +                        err = maybe_close(mod_dir);
> >> +			err |= maybe_close(fun_dir);
> >> +			err |= maybe_close(fd);
> >
> > Why not do this at the beginning of the DIT_PROBE processing?  Obviously, if
> > a previous DIT_PROBE existed, this new one ends the previous one.  There does
> > not seem to be any point in not doing these closes immediately.  Even if the
> > processing of this probe fails somehow, the previous one is known to be done.
> 
> I did that originally, but it meant duplicating all of this three times,
> at the start of DIT_PROBE, after the loop, and in the error handlers.
> There's too much error handling and closing noise in this function already...

You already have it three times in this function.  I merely mean that this
instance can be moved to the beginning of the DIT_PROBE case, which seems
better.  It looks odd that other work is done first, even though it does not
in any way affect whether these closes are done, and it also does not depend
on them not being done yet.

> >> +			op = "parsed version write";
> >> +			if (write_chunk(fd, &version, sizeof(uint64_t)) < 0)
> >> +				goto err_probe_link;
> >> +
> >> +			op = "parsed nprobes write";
> >> +			_Static_assert(sizeof(accump->parsed->probe.ntp) == sizeof(uint64_t),
> >> +				       "chunk write assumes parsed.probe.ntp is a uint64_t");
> >
> > Why not use a regular assert or something else?  Since this is the only use
> > of static assertions in the entire code base, perhaps use the typical assert
> > suffices (and we can always do an audit of the code base and introduce static
> > assertions if we find there are enough uses for it)?
> 
> Mostly because this is something that is statically detectable because
> it is purely related to type sizes, and I find it a lot more helpful to
> get compile-time errors when that sort of thing happens than runtime
> assertions. This seems like the same question as "why don't we just make
> everything void * and check types at runtime": well, if you've got any
> choice at all you don't do that, no.
> 
> Static assertions are over a decade old now. Every compiler we support
> has supported them for many years. We can probably use them more :)
> 
> Leaking them in bit by bit seems more likely to *actually get done* than
> some future vast sweep which feels like a lot more work and is
> correspondingly unlikely to ever happen. There is nothing about static
> assertions that requires them to be done wholesale!
> 
> >> +	close(perpid_dir);
> >> +	free(provpid_name);
> >
> > early_out: (see above for explanation)
> 
> Ack!
> 
> >> +/*
> >> + * Write out the DOF helper.  A trivial wrapper.
> >> + */
> >> +static
> >> +int dof_stash_write_dh(int dirfd, const dof_helper_t *dh)
> >> +{
> >> +	int fd;
> >> +
> >> +        if ((fd = openat(dirfd, "dh", O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
> >> +		goto err;
> >> +
> >> +	if (write_chunk(fd, dh, sizeof(dof_helper_t *)) < 0) {
> >> +		close(fd);
> >> +		goto err;
> >> +	}
> >> +
> >> +        if (close(fd) < 0)
> >> +		goto err;
> >> +
> >> +	return 0;
> >> +
> >> + err:
> >> +	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF helper: %s\n",
> >> +		 strerror(errno));
> >
> > I would think that this function takes care of the unlink if the writee fails.
> > Why leave that to the caller, given that this function actually creates the
> > file?
> 
> Good point, fixed. (With luck this function can go away before the next
> round if relocs truly are unnecessary.)

They are - code review of anything related to them shows that very clearly.
THis does not mean that they may not get used in the future, but right now,
there is nothing that uses them in the current version of DTrace.

> >> +	/*
> >> +	 * Write out the raw DOF, if not already known, and link it into its
> >> +	 * per-PID dir: write the helper directly out.  dof_stash_write tells us
> >> +	 * whether the DOF actually needed writing or not (via a straight size
> >> +	 * comparison, which suffices to spot almost everything other than the
> >> +	 * pathological case of stale DOF from a killed process whose inode is
> >> +	 * unlinked and replaced with another identically-numbered inode with
> >> +	 * DOF of the *exact same length*: this seems too unlikely to be worth
> >> +	 * bothering with.)
> >> +	 */
> >> +	new_dof = 1;
> >> +	switch (dof_stash_write(dof_dir, dof_name, dof, size)) {
> >> +	case 0: new_dof = 0; break;
> >> +	case -1: goto err_unlink_nomsg; break;
> >> +	}
> >
> > Why this odd switch statement?  For one, the -1 case doesn't need a break
> > because it has a goto right before that break.  Couldn't this be simply
> > written as:
> >
> > 	new_dof = 1;
> > 	if (dof_stash_write(dof_dir, dof_name, dof, size) == -1)
> > 		goto err_unlink_nomsg;
> > 	new_dof = 0;
> >
> > or even:
> >
> > 	new_dof = dof_stash_write(dof_dir, dof_name, dof, size);
> > 	if (new_dof == -1)
> > 		goto err_unlink_nomsg;
> 
> dof_stash_write_raw() has a tripartite return value:
> 
>  -1: error
>  >0: DOF written, size given
>  0: *no DOF written*.
> 
> The case statement is handling that, much as it would do for fork()
> (which I modelled this on, return-value-wise).
> 
> I've augmented the comment above it to indicate that.

Ah, my fault.  I was looking at the wrong function when checking what the
return values of dof_stash_write() are.  My stupidity - sorry.

> >> +	if ((fd = openat(dirfd, dirs[diroff], O_RDONLY | O_CLOEXEC)) < 0) {
> >> +		fuse_log(FUSE_LOG_ERR, "Cannot open %s during probe state dir deletion: %s\n",
> >> +			 dirs[diroff], strerror(errno));
> >> +		char foo[PATH_MAX];
> >> +		sprintf(foo, "ls -l /proc/%i/fd", getpid());
> >> +		system(foo);
> >
> > Isn't this a potential security concern?  Which 'ls' is going to get executed?
> > It seems to me that the potential red flags that this could raise do not really
> > justify the limited benefit of havinng this output collected.  Is this worth
> > it?
> 
> As the variable names indicate, that is debugging code I failed to
> remove, dammit! Ripped out with *extreme* prejudice. DAMMIT. I scanned
> for all of this before committing and *still* missed a bit.
> 
> The actual code we wanted is a fuse_log() and a return -1, as usual.
> 
> >> +/*
> >> + * Unlink a file and then a bunch of directories, as given by the names
> >> + * argument (the last entry in which is a file).  Each directory in the dirs
> >> + * array is a subdirectory of the one before it.  If any directory removal
> >> + * fails due to non-emptiness, all later removals are skipped.  If any directory
> >> + * does not exist, removals start from that point.
> >
> > How do I reconcile this comment with the actual function?  Where is a file
> > being unlinked?  What names argument?
> 
> dirs argument, whoops.
> 
> The actual function is your usual "thin base case wrapper around a
> recursive function", so the meat of it is above.
> 
> >> + *
> >> + * The first entry must be a file: the next must be directories.  If this is
> >> + * violated, just return without unlinking anything (but do log an error).
> >
> > First entry of what?  Next of what?  Where is this validated?
> 
> I kind of assumed you could look at its single use:
> 
> 	const char *dirnames[] = {pid_name, prov, mod, fun, prb, NULL};
> 	if (unlinkat_many(probes_dir, dirnames) < 0) {

Sure, but since that conflicted with the comment, I was confused.

> (I generalized it because I just bet we'll be needing this again in
> dtprobed when we get systemwide tracing, for the non-pid-containing
> /run/dtrace/probes/$prov/$mod/$fun/$prb dir.)

Since you are generalizing anyway, why not handle the case where the last
element is a directory as well?  So this could be used to get rid of an
empty directory as well?  It seems odd to make a generalized function that
has such a specific condition on it that seems unnecessary.  Removing a chain
of directories of which the end is an empty directory seems useful.

And you already pass an argument dirnames which would imply that its elements
are directory names.  The fact that the last element (as you say below is to
be a file is rather counter-intuitive for something names dirnames.

> The comment is wrong anyway, bloody dyslexia:
> 
>  * The last entry in dirs must be a file: all before it must be directories.  If
>  * f this is violated, just return without unlinking anything (but do log an
>  * error).
> 
> i.e. this acts like rm -d $pid_name/$prov/$mod/$fun/$prb. I've added a
> reference to rm -rd in the comment as well.
> 
> >> +	/*
> >> +	 * Truncated dir -> concurrent modifications, not allowed.
> >> +	 */
> >> +	if (readlinkat(perpid_dir, gen_name, gen_linkname, gen_stat.st_size + 1) ==
> >> +	    gen_stat.st_size + 1) {
> >> +		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: state dir is under concurrent modification, maybe multiple dtprobeds are running\n");
> >
> > OK, this could happen (but never should).  But isn't the possibility of
> > concurrent modifications pretty much ignored everywhere else?  If it is a
> > possible concern, wouldn't it need to be addressed everywhere (and honestly,
> > in that case there is nothing that can be done I think other than to abort).
> 
> In most cases the filesystem deals with it for us: you can't see a
> partial directory entry and the whole thing is more or less atomic:
> concurrent modifications do no harm since we end up in the same state
> regardless.
> 
> I was mostly trying to make sure that we didn't accidentally remove the
> wrong thing, which seems like a more unfortunate outcome than most
> others arising from concurrent modification!
> 
> You're right that we should just abort if we spot this, though: that
> solves the problem immediately. In practice, given that two CUSE-using
> daemons can't simultaneously hang off the same device node, this is
> only actually going to happen if the testsuite or something mistakenly
> fires up using the same /run dir as the primary daemon, and if that
> happens I don't think anyone will mind too much if that daemon, or the
> systemwide daemon, suddenly keels over backwards. (The systemwide daemon
> will be insta-restarted by systemd anyway, and it's more or less
> stateless, so nothing bad will result.)
> 
> >>  /*
> >> - * Result of DOF probe parsing (currently used only for probe creation. We
> >> - * receive a provider info structure, followed by N probe info structures each
> >> - * of which is followed by possibly many tracepoint info structures, all tagged.
> >> - * Things not useful for probe creation (like args, translated types, etc) are
> >> - * not returned.
> >> + * Result of DOF probe parsing.  We receive a provider info structure, followed
> >> + * by N probe info structures each of which is followed by possibly many
> >> + * tracepoint info structures, all tagged.  Things not useful for probe creation
> >> + * (like args, translated types, etc) are not returned.
> >
> > Well, probe args are actually useful (and needed) for probe creation, as are
> > translated types and mapping info.  I understand that we currently do not
> > support them yet, but it is misleading to write in this comment that those are
> > not needed for probe creation, because we actually do consider them part of the
> > probe info.
> 
> True, adjusted to "not yet used".
> 
> -- 
> NULL && (void)