[DTrace-devel] [PATCH v3 06/21] dtprobed: add the DOF stash

Nick Alcock nick.alcock at oracle.com
Mon Feb 12 21:59:29 UTC 2024


On 12 Feb 2024, Kris Van Hees stated:

> Initial comments (mostly on the comments, that are pretty crucial for people
> to understand what is going on here).

Thanks!

> On Tue, Jan 16, 2024 at 09:13:02PM +0000, Nick Alcock via DTrace-devel wrote:
>> The DOF stash is a runtime directory (/run/dtrace/stash by default) in which
>> the dtprobed daemon keeps track of incoming DOF.  There's a comment
>> describing it at the top of dtprobed/dof_stash.c, but in brief we have two
>> directories under /run/dtrace/stash:
>> 
>> - dof/, which contains raw DOF named by mapping
>> - dof-pid/, which tracks DOF on a PID-by-PID and mapping-by-mapping basis.
>>   This contains subdirectories $pid/$mapping containing both the same raw
>>   DOF (hardlinked) and parsed DOF split up by probe, in
>>   $pid/$mapping/parsed/$spec.
>> 
>> To aid consumer globbing of probes we also put the DOF for each individual
>> probe in /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb; in future we may
>> introduce further symlinks not in per-pid directories to ease matching of
>> probes across processes.
>
> Mention here whether the stuff under probes is raw or parsed DOF (or both)?

Good point, whoops. (Back when I wrote that comment, it was all parsed...)

>> +++ b/dtprobed/dof_stash.c
>> @@ -0,0 +1,1621 @@
>> +/*
>> + * Oracle Linux DTrace; DOF storage for later probe removal.
>
> This is a very narrow scope description of what the DOF stash is, and I would

Yeah, it's terrible. It wasn't revised after the big change...

> say that the removal of probes is probably a more minor use of it when compared
> to enabling probe create/remove by dtrace itself and state retention across
> dtprobed restarts.  Maybe just sya 'DOF state storage management' and leave it
> at that?

Ack.

>> + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reservedaeve.
>
> 2024 (and in general all copyright years should be updated to reflect these
> changes are getting applied to the tree in 2024).

Ack. I'll... put that off until the review is done, I think, and then do
it all at once.

>> + * Licensed under the Universal Permissive License v 1.0 as shown at
>> + * http://oss.oracle.com/licenses/upl.
>> + *
>> + * The DOF stash is the principal state preservation mechanism used by dtprobed.
>> + * Keeping DOF on disk keeps memory management complexity low, and makes it
>> + * possible for dtprobed to track DOF across daemon restarts and upgrades.
>
> Wouldn't you mention here also the important purpose of allow dtrace to get
> to USDT probe data for probe creation and removal?

Hell yes (now) :) this was also not revised...

>> + *
>> + * The state storage structure is as follows:
>> + *
>> + * /run/dtrace/stash: Things private to dtprobed.
>
> I would put the section I delineate below as #1# here, indented to show that
> it is content that goes under this directory.

Ack, reshuffled.

>> + *
>> + * /run/dtrace/probes: Per-probe info, written by dtprobed, read by DTrace.
>
> I would put the section I delineate below as #2# here, indented to show that
> it is content that goes under this directory.  That also means you can drop the
> /run/dtrace, /run/dtrace/stash, and /run/dtrace/probes prefixes which saves
> some space and makes reading the paths a little easier I think.

Ack. I'm saying e.g.

    .../dof-pid/$pid/$dev-$ino

here now instead, because I think we need *some* sort of prefix to
indicate what's going on, and we obviously can't use . or .. :)

>> + *
>> + * /run/dtrace/stash/dof/$dev-$ino: DOF contributed by particular mappings,
>> + * in raw form (as received from the client).
>
> What is 'the client'?  That is terminology that seems confusing here because I
> would think it refers to a client of dof-stash, which would be dtprobed and
> dtrace, but then again, perhaps it refers to executables and libraries that
> have DOF info in them, presented to dtprobed using the drti code?  Also, since
> I assume that a $dev-$ino directory corresponds to a (one) particular mapping
> I think that 'by particular mappings' above should be 'a particular mapping'?

Yeah, this is awful terminology. "Probe-containing program"?
"Probe-containing executing ELF object"? There really is no good term
for this. Maybe I should just name it after what it contains.

... no, you thought up 'DOF source'. I have a new section for the
directory now:

 *    .../dof-pid/$pid/$dev-$ino: contains everything relating to DOF
 *    contributed by a particular USDT-containing ELF object within a given
 *    process (henceforth "DOF source").  Each DTRACEHIOC_ADDDOF ioctl call
 *    creates one of these.


>> + *
>> + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/raw: hardlink to the DOF for a given
>> + * PID. Pruned of dead processes at startup and on occasion: entries also
>> + * deleted on receipt of DTRACEHIOC_REMOVE ioctls. The principal effect of this
>> + * is to bump the link count for the corresponding DOF in dof/ directory: when
>> + * this falls to 1, the DOF is considered dead and the corresponding probe is
>> + * removed.
>
> I would rephrase this a bit, beecause the way it reads it gives the impression
> that the principal effect of DTRACEHIOC_REMOVE is to dump the link count for
> the corresponding DOF in dof/ which makes no sense whatsoever.  Only after

Agreed.

> re-reading this a few times did I realize that you mean that you are using a
> hardlink to the DOF for a given PID because this has the principal effect of
> bumping the link count of the corresponding DOF in dof/.  So that should be
> mentioned first, and then you can mention how this directory is used in terms
> of creation, removal, and pruning.

Yes.

> I am also confused by 'hardlink to the DOF for a given PID' when it is clear
> from the $pid/$dev-$ino directory naming (and the nature of processes and their
> USDT probes) that a PID can have USDT probes that are provided by DOF from
> multiple mappings.

It's never been clear to me what the heck is supposed to happen if two
mappings contribute the same probe. v1 just ignored all but the first. I
guess that's what we should do too.  (Not implemented, we failed
horribly: fixed accordingly. Not bothering to validate that they're the
same, we just assume it. Maybe this is too lazy?)

>                                     So isn't this a hardlink to the
> DOF for a given mapping that is used by a given PID?

Yes, exactly. Awful phrasing, sorry.

>> + *
>> + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/dh: Serialized form of the
>> + * dof_helper_t received from a given client. Preserved for all
>
> Again, what is 'a given client'?  Here is would seem to indicate a mapping, or
> at least an executable or shared library that provides DOF through drti.

This is better expressed in terms of where it comes from, i.e. the
ioctl, since this at least is not the DOF at all. See below.

>> + * DTRACEHIOC_ADDDOFs, even if the DOF was already contributed by some other
>> + * process.  Keeping this around allows the DOF to be re-parsed as needed, even
>> + * by a later re-execution of a different version of the same daemon with a
>> + * different dof_parsed_t structure, and to be re-parsed with appropriate
>> + * addresses for whichever mapping of this DOF is under consideration.
>
> I assume 'Serialized' means 'parsed' here?  Since that is the terminology used
> throughout this patch in reference to DOF I would stick to that.

No, this is the dof_helper_t, not the DOF itself. It is literally a dump
of the dof_helper_t struct: thus, serialized. No parsing. (We write out
both parsed and serialized -- raw -- forms of the DOF itself.)

Hm maybe I should call this 'raw' instead of 'serialized', for
consistency. (Done in the copy above.)

Rephrased to

 *    .../dof-pid/$pid/$dev-$ino/dh: Raw form of the dof_helper_t received from
 *    a given DTRACEHIOC_ADDDOF, serialized straight to disk with no changes.

>> + *
>> + * /usr/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed: parsed DOF for a given PID.
>> + * Generated from the helper for PIDs that are the first appearance of a given
>> + * piece of DOF and used for the common case where that DOF has only one mapping
>> + * until its deletion (short-lived processes, mostly).  Deleted at daemon
>> + * startup.
>
> I am confused here.  The description seems to indicate that this is a file,
> but reading firther down it becomes clear that this is a directory.  (Hence my
> suggestion to reorganize this comment block.)

It, uh, used to be a file? um. yeah it's just a container now.  Totally
missed that.

Also the whole "used for the common case" stuff is just wrong -- that's
residue from the days when I was trying to parse the DOF only once per
mapping. It's always used.

... fixed.  We regenerate "parsed" only when the dof_parsed_t changes.

> But then there is the question of what you mean by refer to the first appearance
> of a given piece of DOF and DOF that only has one mapping.  If I have two tasks
> that execute the same executable that contains USDT probes, will only one of
> them have a /parsed directory?

This stuff is per-PID. There will be one entry for the mapping in
/run/dtrace/dof/$dev-$ino, and two PIDs in /run/dtrace/dof-pid for each
PID. (We deal with the result of fork and exec in a latter commit.)

Most of this stuff is hardlinked so the disk space consumption is quite
low (though tmpfs actually consumes one "inode" per link, um...)

There is no per-executable stuff at all -- there is per-mapping stuff,
per-pid-per-mapping stuff, and per-pid-per-probe stuff.  We never look
to see what executable anything comes from, ever (though libproc could
tell us if we needed it).

>                                What if that one finishes execution while the
> other is still running?  What if then a third innstance gets started?  Many
> questiond and I do not find clarity in the description of this directory to
> actually know how this is used and managed.  Is this not used for long-lived
> processes?  You peak of 'DOF [that] has only one mapping' but aren't we above
> talking about 'DOF contributed by [a] particular mapping'?  It feels like we
> are mixing terminology.

Now I've cleared up the messes above, there are only two distinct cases
here:

When DOF is first contributed by a given PID for a given mapping, we
create /run/dtrace/stash/dof/$dev-$ino with a link count of 1, create
the /run/dtrace/stash/dof-pid/$pid/$dev-$ino directory and populate it,
then hardlink /run/dtrace/probes/$prov$pid/... for every provider it
contributes. Subsequent contributions by other PIDs do all that except
for bumping the link count of /run/dtrace/stash/dof/$dev-$ino (since it
already exists). This makes it possible to determine when providers and
their probes vanish *systemwide*, across all processes, though this is
not yet recorded in /run/dtrace/probes, which seems likely to be useful
for systemwide probing.

Perhaps systemwide probing will introduce
/run/dtrace/probes/$prv/$mod/$fun/$prb stuff, i.e. not per-pid?

Another oversight: aliased mappings are a thing, but if an aliased
mapping contributed DOF twice in one process, the second would fail. I
just fixed that so the second simply does nothing (in the future, maybe
we should bump something so that you need to do two DTRACEHIOC_REMOVEs
to remove such things. I've added a note to that effect.)

>> + *
>> + * /run/dtrace/stash/dof-pid/$pid/$gen: symlink to a $dev-$ino dir. "$gen" is an
>> + * incrementing generation counter starting at zero for every unique PID, used
>> + * by clients to uniquely identify DOF pieces to remove.
>
> What 'clients'?  Which $dev-$ino is $gen a symlink to (since a given pid can
> have multiple $dev-$ino pairs associated with it)?

This is rubbish, it's used by dtprobed :)

$gen is a symlink to a given *mapping* directory, a $dev-$ino dir --
it's used to identify a mapping to remove when a USDT-containing program
lobs an ioctl() its way, since all it gives as an indication of what to
remove is the generation counter handed back by dtprobed when the DOF
was originally contributed.

>> + * /run/dtrace/probes/$pid/$prv$pid/$mod/$fun/$prb: Parsed representation of one
>> + * probe in a given process. Removed by dtprobed when the process dies, or if
>> + * all mappings containing the probe are unmmapped.  Used by DTrace for tracing
>> + * by PID.  Hardlink to...
>
> It might be more clear to start with saying that this is a hardlink to <blah>,
> which is the parsed representation of one probe in a given process.  And then
> refer to the actual description of that file.

yeah, definitely now we shuffled things!

>> + *
>> + * /run/dtrace/stash/dof-pid/$pid/$dev-$ino/parsed/$prv$pid:$mod:$fun:$prb:
>> + * parsed DOF for a given probe for a given PID.  Generated from the helper for
>> + * PIDs that are the first appearance of a given piece of DOF and used for the
>> + * common case where that DOF has only one mapping until its deletion
>> + * (short-lived processes, mostly).  Deleted at daemon startup if the
>> + * dof_parsed_t structure changed.  The file consists of a uint64_t version
>> + * number (DOF_PARSED_VERSION), a uint64_t count of tracepoints, and then
>> + * that many struct dof_parsed_t's of type DIT_TRACEPOINT.
>
> Same question about the 'used for the common case' which seems to imply that
> this is not used in other cases - which then brings up the question where
> dtrace will get USDT probe info for the uncommon cases.

Yeah, still rubbish, rephrased :)

> So, on a higher level...  I would expect things to work as follows (and please
> correct me if/where I am wrong)...
>
> Evwry piece of DOF data is associated with a unique $dev-$ino pair that
> identifies the ELF object that provided it (executable or shared library).

Yes.

> When a task starts, the DOF data of the executable (if any) and shared
> libraries (if any) is presented to dtprobed for processing, along with task
> specific information (dof_helper_t - of which the base address of the mapping
> for the associated $dev-$ino pair is the most important piece).

Yep.

> This means that if multiple tasks contain a mapping for a given $dev-$ino,
> only one copy of the DOF data is needed because the only task-specific data
> is contained in dof_helper_t for each task.  Furthermore, if a task has a
> mapping for a given $dev-$ino, the DOF data for that $dev-$ino apply to the
> task (i.e. it will contain the USDT probes specified in that DOF data).

Yep.

> So, what is needed is a collection of $dev-$ino directories that contain the
> actual (non-task specific) DOF data.  For practical reasons, storing it in
> raw, parsed, and per-probe form seems convenient.

It's not even a directory because the only thing contributed which does
not vary by mapping is the raw DOF. It's a collection of $dev-$ino
*files*, each of which is a piece of raw DOF.

> And then there is a collection of $pid directories that contain links to the
> associated $dev-$ino directories, and dof_helper_t data for each of them.

Yep. The parsed probes go here too, and info to let dtprobed emit a new
ascending generation number when new DOF is contributed, and tie each
ioctl() to a generation number so that it can remove the directory when
presented with the generation number in a REMOVE ioctl.

> Due to provider names including the pid of the task, this would also require
> links to be created for the task-specific provider names (with pid) to the
> generic ones in the apporpriate $dev-$ino directory (the rest of the probe
> name components are nno-task specific).

Yep.

> Management-wise, the $pid directories are created for any task that presents
> DOF to dtprobed, and removed when the task terminates (or possibly when a
> DTRACEHIOC_REMOVE ioctl has been issued for all its DOF).  A $dev-$ino
> directory is created the first time DOF is presented for the $dev-$ino pair,
> and remains until the last task that references is has terminated.

$dev-$ino file, but yes.  (This is written out in the call to
dof_stash_write() in dof_stash_add().)

> Does that sum it up correctly?  If so, this is not what I understand from the
> existing description in this comment block.

Nearly! And all your misunderstandings are 100% accounted for by the
bloody mess of relics that comment block was :(

>> +/*
>> + * Like close(), but only try to close positively-numbered fds.
>> + */
>> +static int
>> +maybe_close(int fd)
>> +{
>> +	if (fd > -1)
>
> I would think (fd >= 0) is more standard, but either way works.

Adjusted.

>> +		return close(fd);
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Like mkdirat(), but already-existing directories are considered
>> + * successful.
>> + */
>> +static int
>> +mkexistingdirat(int dirfd, const char *name, mode_t mode)
>
> There is only one caller for this function (make_state_dirat) and I would
> think that this functionality could as well just be inlined there.  That also
> makes it nice and similar to make_state_dir.

Sure: done.

>> +/*
>> + * Compose a provider's name from pieces.
>> + */
>> +static char *
>> +make_provider_name(const char *prov, pid_t pid)
>> +{
>> +	char *ret;
>> +
>> +        if (asprintf(&ret, "%s%li", prov, (long) pid) < 0) {
>> +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making provider name\n");
>> +		return NULL;
>> +	}
>> +	return ret;
>> +}
>
> This function does not validate against the length limit on provider names?

... why would it need to? It's only being used to create filenames (and
the OS checks its own limits on those already). Nothing in dof_stash.c
cares how long the names are.

Certainly dt_pid.c should validate the lengths, which it looks like it
isn't. I'll add code to do that. (But that's another commit!)

> This function seems to be something that would belong in a library of common
> functions between dtprobed and dtrace since e.g. dt_prov_uprobe.d and dt_pid.c
> provide similar code.

Alas, no -- that fuse_log() isn't something you'd want to do from either
of those places, and if you take that out all this is an asprintf()
call.

>> +
>> +/*
>> + * Compose a full probespec's name from pieces.
>> + */
>> +static char *
>> +make_probespec_name(const char *prov, const char *mod, const char *fn,
>> +		    const char *prb)
>> +{
>> +	char *ret;
>> +
>> +        if (asprintf(&ret, "%s:%s:%s:%s", prov, mod, fn, prb) < 0) {
>> +		fuse_log(FUSE_LOG_ERR, "dtprobed: out of memory making probespec\n");
>> +		return NULL;
>> +	}
>> +	return ret;
>> +}
>
> This also seems like a good candidate for a more general-use function.  It
> probably could also use some probe name length validation.  Perhaps it could
> also be useful to make use of the dtrace_probedesc_t type to represent the
> quad (prv, mod, fun, prb) as is commonly done in DTrace code?

See above. Though if I was using dtrace_probedesc_t I suppose I would
need to length-check it -- but it doesn't provide any clarity
improvement here.

(dtrace_probedesc_t provides a valuable clarity improvement when you're
passing these things around a lot, which of course DTrace does, but here
we don't pass any of these around at all -- they're only used inside
dof_stash_write_parsed().)

>> +
>> +/*
>> + * Split the pieces back up again.  The input spec becomes the provider name.
>> + * The pieces' lifetime is no longer than that of the input string.
>> + */
>> +static int
>> +split_probespec_name(char *spec, const char **mod, const char **fn,
>> +		     const char **prb)
>> +{
>> +	char *tmp;
>> +
>> +	tmp = strchr(spec, ':');
>> +	if (tmp == NULL)
>> +		return -1;
>> +	*tmp = 0;
>> +	*mod = tmp + 1;
>> +
>> +	tmp = strchr(*mod, ':');
>> +	if (tmp == NULL)
>> +		return -1;
>> +	*tmp = 0;
>> +	*fn = tmp + 1;
>> +
>> +	tmp = strchr(*fn, ':');
>> +	if (tmp == NULL)
>> +		return -1;
>> +	*tmp = 0;
>> +	*prb = tmp + 1;
>> +
>> +	return 0;
>> +}
>
> Same comments (general use function, use dtrace_probedesc_t, length validation).
> Also see dtrace_xstr2desc() in dt_subr.c for a very similar hat's.

That's *way* more complex and depends on all sorts of infrastructure
dtprobed doesn't have (dt_set_errno for starters) and does all sorts of
things like idhash lookups and integration into the parser and
$-expansion which surely we must never do even if we see $ in the names
we are given. All we want to do is split three times on colons, nothing
more.

If dtrace_str2desc() was *actually* the oppposite of dtrace_desc2str()
rathe rthan being implemneted in terms of a far more general function,
maybe it could be reused, but...

-- 
NULL && (void)



More information about the DTrace-devel mailing list