[DTrace-devel] [PATCH v3 06/21] dtprobed: add the DOF stash
Nick Alcock
nick.alcock at oracle.com
Thu Feb 15 13:13:17 UTC 2024
On 15 Feb 2024, Kris Van Hees uttered the following:
> On Tue, Jan 16, 2024 at 09:13:02PM +0000, Nick Alcock via DTrace-devel wrote:
>> +/*
>> + * Write out all the parsed DOF from the passed-in accumulator. Split out from
>> + * dof_stash_add() because we don't want to create anything in the DOF stash for
>> + * a given piece of DOF until we have verified that all the parsed data for all
>> + * the probes in that DOF is valid.
>> + *
>> + * Each parsed file starts with a single 64-bit word giving the version of
>> + * dof_parsed_t in use, to ensure that readers never use an incompatible
>> + * version, and a single uint64_t giving the number of tracepoints following,
>> + * each of which is a struct dof_parsed_t. The name of the probe is not given,
>> + * since the filename (or, for the probes/ dirs, the file and directory names)
>> + * already contains it.
>> + *
>> + * We also write the parsed version out to the parsed/version file in the parsed
>> + * mapping dir, to make it easier to determine whether to reparse all the DOF
>> + * for outdatedness.
>> + */
>> +int
>> +dof_stash_write_parsed(pid_t pid, dev_t dev, ino_t ino, dt_list_t *accum)
>> +{
>> + char *pid_name = make_numeric_name(pid);
>> + char *dof_name = make_dof_name(dev, ino);
>> + char *provpid_name = NULL;
>> + int perpid_dir, perpid_dof_dir, probepid_dir, parsed_mapping_dir,
>> + prov_dir = -1, mod_dir = -1, fun_dir = -1;
>> + dof_parsed_list_t *accump;
>> + int version_fd, fd = -1;
>
> Why oh why use 'fd' when all other directory and file descriptors have actual
> descriptive names? Especially since fd is carried across multiple iterations
> in the loop.
Yes that's awful naming, fixed to parsed_fd.
>> + const char *op;
>> + const char *probe_err = "unknown probe";
>> + uint64_t version = DOF_PARSED_VERSION;
>> + int state = -1;
>> + int err;
>> +
>> + if (dt_list_next(accum) == NULL) {
>> + free(pid_name);
>> + free(dof_name);
>> + return 0;
>> + }
>
> Why not
>
> if (dt_list_next(accum) == NULL)
> goto early_out;
>
> and set a label below at the successful case return that ends with these exact
> three statements? Might as well, given the presence of so many other gotos
> anyway in this function.
Nicely spotted. Done.
>> +
>> + if ((perpid_dir = openat(pid_dir, pid_name, O_PATH | O_CLOEXEC)) < 0) {
>> + fuse_log(FUSE_LOG_ERR, "dtprobed: PID %i, %lx/%lx: open failed for PID dir for parsed DOF write: %s\n",
>> + pid, dev, ino, strerror(errno));
>> + goto early_err;
>> + }
>> +
>> + op = "per-mapping open";
>> + if ((perpid_dof_dir = openat(perpid_dir, dof_name, O_PATH | O_CLOEXEC)) < 0)
>> + goto close_perpid_err;
>> +
>> + op = "per-mapping parsed DOF mkdirat";
>> + if ((parsed_mapping_dir = make_state_dirat(perpid_dof_dir, "parsed", op, 0)) < 0)
>> + goto close_perpid_dof_err;
>> +
>> + op = "probe PID mkdirat";
>> + if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
>> + goto remove_parsed_mapping_err;
>> +
>> + if ((version_fd = openat(parsed_mapping_dir, "version",
>> + O_CREAT | O_TRUNC | O_WRONLY, 0644)) < 0)
>> + goto remove_probepid_dir_err;
>> +
>> + if (write_chunk(version_fd, &version, sizeof(uint64_t)) < 0) {
>> + close(version_fd);
>
> Shouldn't this perform an unlinkat() on version_fd?
This is done in remove_probepid_dir_err, via
unlinkat(parsed_mapping_dir, "version", 0);
>> + goto remove_probepid_dir_err;
>
> The code path starting at the remove_probepid_dir_err label does not actually
> contain an unlinkat() for probepid_dir, so it never gets cleaned up in case of
> error?
What? It contains
unlinkat(probes_dir, pid_name, AT_REMOVEDIR);
which is the same as the thing probepid_dir is an fd to:
if ((probepid_dir = make_state_dirat(probes_dir, pid_name, op, 0)) < 0)
You cannot remove probepid_dir (or version_fd): unlinkat() takes an fd
to a directory and a pathname, not an fd to a file. (Which is *really
annoying* and made auditing this for leaks much harder than it should
have been.)
>> + /*
>> + * Probe: make (mod, fun, probe) dirs as needed: close
>> + * out file, open new one, make hardlink to it.
>> + */
>> + case DIT_PROBE: {
>> + assert(state >= 0);
>
> This is ok, but DIT_TRACEPOINT uses:
>
> assert(state == DIT_PROBE || state == DIT_TRACEPOINT);
>
> which is much better, right? So why not use that here?
We're not checking the same thing. DIT_TRACEPOINT needs to validate that
the tracepoint is preceded by a probe (or a probe and a string of
tracepoints): DIT_PROBE only needs there to have been at least one
provider (which means the state might be *anything* this state machine
can reach, but cannot be -1, the before-we-started state).
(This is all outright asserts because it can't fire unless dof_parser.c
is buggy anyway.)
>> + /*
>> + * This rather weird pattern carries out all closes and
>> + * then errors if any of them fail.
>> + */
>> + err = maybe_close(mod_dir);
>> + err |= maybe_close(fun_dir);
>> + err |= maybe_close(fd);
>
> Why not do this at the beginning of the DIT_PROBE processing? Obviously, if
> a previous DIT_PROBE existed, this new one ends the previous one. There does
> not seem to be any point in not doing these closes immediately. Even if the
> processing of this probe fails somehow, the previous one is known to be done.
I did that originally, but it meant duplicating all of this three times,
at the start of DIT_PROBE, after the loop, and in the error handlers.
There's too much error handling and closing noise in this function already...
>> + op = "parsed version write";
>> + if (write_chunk(fd, &version, sizeof(uint64_t)) < 0)
>> + goto err_probe_link;
>> +
>> + op = "parsed nprobes write";
>> + _Static_assert(sizeof(accump->parsed->probe.ntp) == sizeof(uint64_t),
>> + "chunk write assumes parsed.probe.ntp is a uint64_t");
>
> Why not use a regular assert or something else? Since this is the only use
> of static assertions in the entire code base, perhaps use the typical assert
> suffices (and we can always do an audit of the code base and introduce static
> assertions if we find there are enough uses for it)?
Mostly because this is something that is statically detectable because
it is purely related to type sizes, and I find it a lot more helpful to
get compile-time errors when that sort of thing happens than runtime
assertions. This seems like the same question as "why don't we just make
everything void * and check types at runtime": well, if you've got any
choice at all you don't do that, no.
Static assertions are over a decade old now. Every compiler we support
has supported them for many years. We can probably use them more :)
Leaking them in bit by bit seems more likely to *actually get done* than
some future vast sweep which feels like a lot more work and is
correspondingly unlikely to ever happen. There is nothing about static
assertions that requires them to be done wholesale!
>> + close(perpid_dir);
>> + free(provpid_name);
>
> early_out: (see above for explanation)
Ack!
>> +/*
>> + * Write out the DOF helper. A trivial wrapper.
>> + */
>> +static
>> +int dof_stash_write_dh(int dirfd, const dof_helper_t *dh)
>> +{
>> + int fd;
>> +
>> + if ((fd = openat(dirfd, "dh", O_CREAT | O_EXCL | O_WRONLY, 0644)) < 0)
>> + goto err;
>> +
>> + if (write_chunk(fd, dh, sizeof(dof_helper_t *)) < 0) {
>> + close(fd);
>> + goto err;
>> + }
>> +
>> + if (close(fd) < 0)
>> + goto err;
>> +
>> + return 0;
>> +
>> + err:
>> + fuse_log(FUSE_LOG_ERR, "dtprobed: cannot write out DOF helper: %s\n",
>> + strerror(errno));
>
> I would think that this function takes care of the unlink if the writee fails.
> Why leave that to the caller, given that this function actually creates the
> file?
Good point, fixed. (With luck this function can go away before the next
round if relocs truly are unnecessary.)
>> + /*
>> + * Write out the raw DOF, if not already known, and link it into its
>> + * per-PID dir: write the helper directly out. dof_stash_write tells us
>> + * whether the DOF actually needed writing or not (via a straight size
>> + * comparison, which suffices to spot almost everything other than the
>> + * pathological case of stale DOF from a killed process whose inode is
>> + * unlinked and replaced with another identically-numbered inode with
>> + * DOF of the *exact same length*: this seems too unlikely to be worth
>> + * bothering with.)
>> + */
>> + new_dof = 1;
>> + switch (dof_stash_write(dof_dir, dof_name, dof, size)) {
>> + case 0: new_dof = 0; break;
>> + case -1: goto err_unlink_nomsg; break;
>> + }
>
> Why this odd switch statement? For one, the -1 case doesn't need a break
> because it has a goto right before that break. Couldn't this be simply
> written as:
>
> new_dof = 1;
> if (dof_stash_write(dof_dir, dof_name, dof, size) == -1)
> goto err_unlink_nomsg;
> new_dof = 0;
>
> or even:
>
> new_dof = dof_stash_write(dof_dir, dof_name, dof, size);
> if (new_dof == -1)
> goto err_unlink_nomsg;
dof_stash_write_raw() has a tripartite return value:
-1: error
>0: DOF written, size given
0: *no DOF written*.
The case statement is handling that, much as it would do for fork()
(which I modelled this on, return-value-wise).
I've augmented the comment above it to indicate that.
>> + if ((fd = openat(dirfd, dirs[diroff], O_RDONLY | O_CLOEXEC)) < 0) {
>> + fuse_log(FUSE_LOG_ERR, "Cannot open %s during probe state dir deletion: %s\n",
>> + dirs[diroff], strerror(errno));
>> + char foo[PATH_MAX];
>> + sprintf(foo, "ls -l /proc/%i/fd", getpid());
>> + system(foo);
>
> Isn't this a potential security concern? Which 'ls' is going to get executed?
> It seems to me that the potential red flags that this could raise do not really
> justify the limited benefit of havinng this output collected. Is this worth
> it?
As the variable names indicate, that is debugging code I failed to
remove, dammit! Ripped out with *extreme* prejudice. DAMMIT. I scanned
for all of this before committing and *still* missed a bit.
The actual code we wanted is a fuse_log() and a return -1, as usual.
>> +/*
>> + * Unlink a file and then a bunch of directories, as given by the names
>> + * argument (the last entry in which is a file). Each directory in the dirs
>> + * array is a subdirectory of the one before it. If any directory removal
>> + * fails due to non-emptiness, all later removals are skipped. If any directory
>> + * does not exist, removals start from that point.
>
> How do I reconcile this comment with the actual function? Where is a file
> being unlinked? What names argument?
dirs argument, whoops.
The actual function is your usual "thin base case wrapper around a
recursive function", so the meat of it is above.
>> + *
>> + * The first entry must be a file: the next must be directories. If this is
>> + * violated, just return without unlinking anything (but do log an error).
>
> First entry of what? Next of what? Where is this validated?
I kind of assumed you could look at its single use:
const char *dirnames[] = {pid_name, prov, mod, fun, prb, NULL};
if (unlinkat_many(probes_dir, dirnames) < 0) {
(I generalized it because I just bet we'll be needing this again in
dtprobed when we get systemwide tracing, for the non-pid-containing
/run/dtrace/probes/$prov/$mod/$fun/$prb dir.)
The comment is wrong anyway, bloody dyslexia:
* The last entry in dirs must be a file: all before it must be directories. If
* f this is violated, just return without unlinking anything (but do log an
* error).
i.e. this acts like rm -d $pid_name/$prov/$mod/$fun/$prb. I've added a
reference to rm -rd in the comment as well.
>> + /*
>> + * Truncated dir -> concurrent modifications, not allowed.
>> + */
>> + if (readlinkat(perpid_dir, gen_name, gen_linkname, gen_stat.st_size + 1) ==
>> + gen_stat.st_size + 1) {
>> + fuse_log(FUSE_LOG_ERR, "dtprobed: internal error: state dir is under concurrent modification, maybe multiple dtprobeds are running\n");
>
> OK, this could happen (but never should). But isn't the possibility of
> concurrent modifications pretty much ignored everywhere else? If it is a
> possible concern, wouldn't it need to be addressed everywhere (and honestly,
> in that case there is nothing that can be done I think other than to abort).
In most cases the filesystem deals with it for us: you can't see a
partial directory entry and the whole thing is more or less atomic:
concurrent modifications do no harm since we end up in the same state
regardless.
I was mostly trying to make sure that we didn't accidentally remove the
wrong thing, which seems like a more unfortunate outcome than most
others arising from concurrent modification!
You're right that we should just abort if we spot this, though: that
solves the problem immediately. In practice, given that two CUSE-using
daemons can't simultaneously hang off the same device node, this is
only actually going to happen if the testsuite or something mistakenly
fires up using the same /run dir as the primary daemon, and if that
happens I don't think anyone will mind too much if that daemon, or the
systemwide daemon, suddenly keels over backwards. (The systemwide daemon
will be insta-restarted by systemd anyway, and it's more or less
stateless, so nothing bad will result.)
>> /*
>> - * Result of DOF probe parsing (currently used only for probe creation. We
>> - * receive a provider info structure, followed by N probe info structures each
>> - * of which is followed by possibly many tracepoint info structures, all tagged.
>> - * Things not useful for probe creation (like args, translated types, etc) are
>> - * not returned.
>> + * Result of DOF probe parsing. We receive a provider info structure, followed
>> + * by N probe info structures each of which is followed by possibly many
>> + * tracepoint info structures, all tagged. Things not useful for probe creation
>> + * (like args, translated types, etc) are not returned.
>
> Well, probe args are actually useful (and needed) for probe creation, as are
> translated types and mapping info. I understand that we currently do not
> support them yet, but it is misleading to write in this comment that those are
> not needed for probe creation, because we actually do consider them part of the
> probe info.
True, adjusted to "not yet used".
--
NULL && (void)
More information about the DTrace-devel
mailing list