[DTrace-devel] [PATCH v3 06/21] dtprobed: add the DOF stash

Thu Feb 15 13:13:17 UTC 2024

On 15 Feb 2024, Kris Van Hees uttered the following:

> On Tue, Jan 16, 2024 at 09:13:02PM +0000, Nick Alcock via DTrace-devel wrote:
>> +/*
>> + * Write out all the parsed DOF from the passed-in accumulator.  Split out from
>> + * dof_stash_add() because we don't want to create anything in the DOF stash for
>> + * a given piece of DOF until we have verified that all the parsed data for all
>> + * the probes in that DOF is valid.
>> + *
>> + * Each parsed file starts with a single 64-bit word giving the version of
>> + * dof_parsed_t in use, to ensure that readers never use an incompatible
>> + * version, and a single uint64_t giving the number of tracepoints following,
>> + * each of which is a struct dof_parsed_t. The name of the probe is not given,
>> + * since the filename (or, for the probes/ dirs, the file and directory names)
>> + * already contains it.
>> + *
>> + * We also write the parsed version out to the parsed/version file in the parsed
>> + * mapping dir, to make it easier to determine whether to reparse all the DOF
>> + * for outdatedness.
>> + */
>> +int
>> +dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum)
>> +{
>> +	char *pid_name = make_numeric_name(pid);
>> +	char *dof_name = make_dof_name(dev, ino);
>> +	char *provpid_name = NULL;
>> +	int perpid_dir, perpid_dof_dir, probepid_dir, parsed_mapping_dir,
>> +		prov_dir = -1, mod_dir = -1, fun_dir = -1;
>> +        dof_parsed_list_t *accump;
>> +	int version_fd, fd = -1;
>
> Why oh why use 'fd' when all other directory and file descriptors have actual
> descriptive names?  Especially since fd is carried across multiple iterations
> in the loop.

Yes that's awful naming, fixed to parsed_fd.

>> +	const char *op;
>> +	const char *probe_err = "unknown probe";
>> +	uint64_t version = DOF_PARSED_VERSION;
>> +	int state = -1;
>> +	int err;
>> +
>> +	if (dt_list_next(accum) == NULL) {
>> +		free(pid_name);
>> +		free(dof_name);
>> +		return 0;
>> +	}
>
> Why not
>
> 	if (dt_list_next(accum) == NULL)
> 		goto early_out;
>
> and set a label below at the successful case return that ends with these exact
> three statements?  Might as well, given the presence of so many other gotos
> anyway in this function.

Nicely spotted. Done.

>> +
>> +        if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0) {
>> +		fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: open failed for PID dir for parsed DOF write: %s\n",
>> +			 pid, dev, ino, strerror(errno));
>> +		goto early_err;
>> +	}
>> +
>> +	op = "per-mapping open";
>> +        if ((perpid_dof_dir = openat(perpid_dir, dof_name, O_PATH | O_CLOEXEC)) < 0)
>> +		goto close_perpid_err;
>> +
>> +	op = "per-mapping parsed DOF mkdirat";
>> +	if ((parsed_mapping_dir = make_state_dirat(perpid_dof_dir, "parsed", op, 0)) < 0)
>> +		goto close_perpid_dof_err;
>> +
>> +	op = "probe PID mkdirat";
>> +	if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
>> +		goto remove_parsed_mapping_err;
>> +
>> +        if ((version_fd = openat(parsed_mapping_dir, "version",
>> +				 O_CREAT | O_TRUNC | O_WRONLY, 0644)) < 0)
>> +		goto remove_probepid_dir_err;
>> +
>> +	if (write_chunk(version_fd, &version, sizeof(uint64_t)) < 0) {
>> +		close(version_fd);
>
> Shouldn't this perform an unlinkat() on version_fd?

This is done in remove_probepid_dir_err, via

	unlinkat(parsed_mapping_dir, "version", 0);

>> +		goto remove_probepid_dir_err;
>
> The code path starting at the remove_probepid_dir_err label does not actually
> contain an unlinkat() for probepid_dir, so it never gets cleaned up in case of
> error?

What? It contains

	unlinkat(probes_dir, pid_name, AT_REMOVEDIR);

which is the same as the thing probepid_dir is an fd to:

	if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)

You cannot remove probepid_dir (or version_fd): unlinkat() takes an fd
to a directory and a pathname, not an fd to a file. (Which is *really
annoying* and made auditing this for leaks much harder than it should
have been.)

>> +		/*
>> +		 * Probe: make (mod, fun, probe) dirs as needed: close
>> +		 * out file, open new one, make hardlink to it.
>> +		 */
>> +		case DIT_PROBE: {
>> +                        assert(state >= 0);
>
> This is ok, but DIT_TRACEPOINT uses:
>
> 			assert(state == DIT_PROBE || state == DIT_TRACEPOINT);
>
> which is much better, right?  So why not use that here?

We're not checking the same thing. DIT_TRACEPOINT needs to validate that
the tracepoint is preceded by a probe (or a probe and a string of
tracepoints): DIT_PROBE only needs there to have been at least one
provider (which means the state might be *anything* this state machine
can reach, but cannot be -1, the before-we-started state).

(This is all outright asserts because it can't fire unless dof_parser.c
is buggy anyway.)

>> +			/*
>> +			 * This rather weird pattern carries out all closes and
>> +			 * then errors if any of them fail.
>> +			 */
>> +                        err = maybe_close(mod_dir);
>> +			err |= maybe_close(fun_dir);
>> +			err |= maybe_close(fd);
>
> Why not do this at the beginning of the DIT_PROBE processing?  Obviously, if
> a previous DIT_PROBE existed, this new one ends the previous one.  There does
> not seem to be any point in not doing these closes immediately.  Even if the
> processing of this probe fails somehow, the previous one is known to be done.

I did that originally, but it meant duplicating all of this three times,
at the start of DIT_PROBE, after the loop, and in the error handlers.
There's too much error handling and closing noise in this function already...

>> +			op = "parsed version write";
>> +			if (write_chunk(fd, &version, sizeof(uint64_t)) < 0)
>> +				goto err_probe_link;
>> +
>> +			op = "parsed nprobes write";
>> +			_Static_assert(sizeof(accump->parsed->probe.ntp) == sizeof(uint64_t),
>> +				       "chunk write assumes parsed.probe.ntp is a uint64_t");
>
> Why not use a regular assert or something else?  Since this is the only use
> of static assertions in the entire code base, perhaps use the typical assert
> suffices (and we can always do an audit of the code base and introduce static
> assertions if we find there are enough uses for it)?

Mostly because this is something that is statically detectable because
it is purely related to type sizes, and I find it a lot more helpful to
get compile-time errors when that sort of thing happens than runtime
assertions. This seems like the same question as "why don't we just make
everything void * and check types at runtime": well, if you've got any
choice at all you don't do that, no.

Static assertions are over a decade old now. Every compiler we support
has supported them for many years. We can probably use them more :)

Leaking them in bit by bit seems more likely to *actually get done* than
some future vast sweep which feels like a lot more work and is
correspondingly unlikely to ever happen. There is nothing about static
assertions that requires them to be done wholesale!

>> +	close(perpid_dir);
>> +	free(provpid_name);
>
> early_out: (see above for explanation)

Ack!

>> +/*
>> + * Write out the DOF helper.  A trivial wrapper.
>> + */
>> +static
>> +int dof_stash_write_dh(int dirfd, const dof_helper_t *dh)
>> +{
>> +	int fd;
>> +
>> +        if ((fd = openat(dirfd, "dh", O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
>> +		goto err;
>> +
>> +	if (write_chunk(fd, dh, sizeof(dof_helper_t *)) < 0) {
>> +		close(fd);
>> +		goto err;
>> +	}
>> +
>> +        if (close(fd) < 0)
>> +		goto err;
>> +
>> +	return 0;
>> +
>> + err:
>> +	fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF helper: %s\n",
>> +		 strerror(errno));
>
> I would think that this function takes care of the unlink if the writee fails.
> Why leave that to the caller, given that this function actually creates the
> file?

Good point, fixed. (With luck this function can go away before the next
round if relocs truly are unnecessary.)

>> +	/*
>> +	 * Write out the raw DOF, if not already known, and link it into its
>> +	 * per-PID dir: write the helper directly out.  dof_stash_write tells us
>> +	 * whether the DOF actually needed writing or not (via a straight size
>> +	 * comparison, which suffices to spot almost everything other than the
>> +	 * pathological case of stale DOF from a killed process whose inode is
>> +	 * unlinked and replaced with another identically-numbered inode with
>> +	 * DOF of the *exact same length*: this seems too unlikely to be worth
>> +	 * bothering with.)
>> +	 */
>> +	new_dof = 1;
>> +	switch (dof_stash_write(dof_dir, dof_name, dof, size)) {
>> +	case 0: new_dof = 0; break;
>> +	case -1: goto err_unlink_nomsg; break;
>> +	}
>
> Why this odd switch statement?  For one, the -1 case doesn't need a break
> because it has a goto right before that break.  Couldn't this be simply
> written as:
>
> 	new_dof = 1;
> 	if (dof_stash_write(dof_dir, dof_name, dof, size) == -1)
> 		goto err_unlink_nomsg;
> 	new_dof = 0;
>
> or even:
>
> 	new_dof = dof_stash_write(dof_dir, dof_name, dof, size);
> 	if (new_dof == -1)
> 		goto err_unlink_nomsg;

dof_stash_write_raw() has a tripartite return value:

 -1: error
 >0: DOF written, size given
 0: *no DOF written*.

The case statement is handling that, much as it would do for fork()
(which I modelled this on, return-value-wise).

I've augmented the comment above it to indicate that.

>> +	if ((fd = openat(dirfd, dirs[diroff], O_RDONLY | O_CLOEXEC)) < 0) {
>> +		fuse_log(FUSE_LOG_ERR, "Cannot open %s during probe state dir deletion: %s\n",
>> +			 dirs[diroff], strerror(errno));
>> +		char foo[PATH_MAX];
>> +		sprintf(foo, "ls -l /proc/%i/fd", getpid());
>> +		system(foo);
>
> Isn't this a potential security concern?  Which 'ls' is going to get executed?
> It seems to me that the potential red flags that this could raise do not really
> justify the limited benefit of havinng this output collected.  Is this worth
> it?

As the variable names indicate, that is debugging code I failed to
remove, dammit! Ripped out with *extreme* prejudice. DAMMIT. I scanned
for all of this before committing and *still* missed a bit.

The actual code we wanted is a fuse_log() and a return -1, as usual.

>> +/*
>> + * Unlink a file and then a bunch of directories, as given by the names
>> + * argument (the last entry in which is a file).  Each directory in the dirs
>> + * array is a subdirectory of the one before it.  If any directory removal
>> + * fails due to non-emptiness, all later removals are skipped.  If any directory
>> + * does not exist, removals start from that point.
>
> How do I reconcile this comment with the actual function?  Where is a file
> being unlinked?  What names argument?

dirs argument, whoops.

The actual function is your usual "thin base case wrapper around a
recursive function", so the meat of it is above.

>> + *
>> + * The first entry must be a file: the next must be directories.  If this is
>> + * violated, just return without unlinking anything (but do log an error).
>
> First entry of what?  Next of what?  Where is this validated?

I kind of assumed you could look at its single use:

	const char *dirnames[] = {pid_name, prov, mod, fun, prb, NULL};
	if (unlinkat_many(probes_dir, dirnames) < 0) {

(I generalized it because I just bet we'll be needing this again in
dtprobed when we get systemwide tracing, for the non-pid-containing
/run/dtrace/probes/$prov/$mod/$fun/$prb dir.)

The comment is wrong anyway, bloody dyslexia:

 * The last entry in dirs must be a file: all before it must be directories.  If
 * f this is violated, just return without unlinking anything (but do log an
 * error).

i.e. this acts like rm -d $pid_name/$prov/$mod/$fun/$prb. I've added a
reference to rm -rd in the comment as well.

>> +	/*
>> +	 * Truncated dir -> concurrent modifications, not allowed.
>> +	 */
>> +	if (readlinkat(perpid_dir, gen_name, gen_linkname, gen_stat.st_size + 1) ==
>> +	    gen_stat.st_size + 1) {
>> +		fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: state dir is under concurrent modification, maybe multiple dtprobeds are running\n");
>
> OK, this could happen (but never should).  But isn't the possibility of
> concurrent modifications pretty much ignored everywhere else?  If it is a
> possible concern, wouldn't it need to be addressed everywhere (and honestly,
> in that case there is nothing that can be done I think other than to abort).

In most cases the filesystem deals with it for us: you can't see a
partial directory entry and the whole thing is more or less atomic:
concurrent modifications do no harm since we end up in the same state
regardless.

I was mostly trying to make sure that we didn't accidentally remove the
wrong thing, which seems like a more unfortunate outcome than most
others arising from concurrent modification!

You're right that we should just abort if we spot this, though: that
solves the problem immediately. In practice, given that two CUSE-using
daemons can't simultaneously hang off the same device node, this is
only actually going to happen if the testsuite or something mistakenly
fires up using the same /run dir as the primary daemon, and if that
happens I don't think anyone will mind too much if that daemon, or the
systemwide daemon, suddenly keels over backwards. (The systemwide daemon
will be insta-restarted by systemd anyway, and it's more or less
stateless, so nothing bad will result.)

>>  /*
>> - * Result of DOF probe parsing (currently used only for probe creation. We
>> - * receive a provider info structure, followed by N probe info structures each
>> - * of which is followed by possibly many tracepoint info structures, all tagged.
>> - * Things not useful for probe creation (like args, translated types, etc) are
>> - * not returned.
>> + * Result of DOF probe parsing.  We receive a provider info structure, followed
>> + * by N probe info structures each of which is followed by possibly many
>> + * tracepoint info structures, all tagged.  Things not useful for probe creation
>> + * (like args, translated types, etc) are not returned.
>
> Well, probe args are actually useful (and needed) for probe creation, as are
> translated types and mapping info.  I understand that we currently do not
> support them yet, but it is misleading to write in this comment that those are
> not needed for probe creation, because we actually do consider them part of the
> probe info.

True, adjusted to "not yet used".

-- 
NULL && (void)