[DTrace-devel] [PATCH 15/15] Implement the proc provider

Thu Feb 23 15:24:51 UTC 2023

On Thu, Feb 23, 2023 at 02:00:11PM +0000, Nick Alcock wrote:
> On 23 Feb 2023, Kris Van Hees via DTrace-devel uttered the following:
> 
> > The following probes (and their arguments) are currently supported:
> >
> > proc:::exec
> > proc:::exec-failure
> > proc:::exec-success
> > proc:::exit
> > proc:::lwp-exit
> > proc:::create
> > proc:::lwp-create
> > proc:::start
> > proc:::lwp-start
> > proc:::signal-discard
> > proc:::signal-handle
> > proc:::signal-send
> 
> YES!
> 
> > Various tests now pass and therefore no longer need @@xfail.
> >
> > Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> 
> Reviewed-by: Nick Alcock <nick.alcock at oracle.com>

Thanks.

> (caveat: not yet tested, but that's going to change soon enough)
> 
> > +#define KPROBES			"kprobes"
> > +#define SYSCALLS		"syscalls"
> > +#define UPROBES			"uprobes"
> > +#define PID			"dt_pid"
> 
> Do these defines really add anything?

Oh, shoot, I forgot to remove a bunch of stuff that was meant to be cleaned up
before posting that final patch.  Sorry :(  Obviously, those are not needed.

> > +/*
> > + * Implement proc probes as dependent probes on other probes.  E.g. for the
> 
> (mental gymnastics, consults sticky note, ok, overlying other probes)
> 
> > + * exec* case, call (to be written) dt_probe_add_dependent(probe).  When the
> 
> This has been written, not with those args though :)
> 
> > + * program for the main probe (e.g. syscall::execve:entry) is generated, it
> > + * should loop through any dependent probes, and for each call a (new) hook
> 
> So now it's 'main' and 'dependent'? But in the dependent probe code
> itself you were calling it 'underlying' and 'dependent'. This
> terminology is a bit of a mess :)
> 
> (I don't see what makes the underlying probe any more 'main' than the
> overlying ones. It happens to be the system-level probe...
> 
> Another possible terminology: why not call them system-level and
> synthetic probes, since we're synthesising the dtrace-level/overlying/
> dependent probes out of nothing and the underlying tracing system has no
> idea about them?)
> 
> > + * function that generates a pseudo-trampoline that converts the main probe
> > + * into the dependent probe.  Then it should add calls to all the clauses of
> > + * the dependent probe.
> 
> So this would now be "... a pseudo-trampoline that converts the
> system-level probe into the synthetic probe", or even "that synthesises
> the synthetic probe out of the system-level probe".
> 
> ... maybe?
> 
> > + * The conversion needs to be able to implement predicate-style conditions that
> > + * determine whether the dependent probe is to fire when the main probe does.
> 
> The main probe usually can't have conditions at all, presumably, since
> the OS-level tracing framework doesn't implement them, but if it did
> they'd be ORed together out of all the dependent probes.
> 
> > + * Dependent probes need priorities for cases where two probes are dependent on
> > + * the same underlying probe, but one needs to 'fire' before the other.  One
> > + * such example is proc:::start and proc:::lwp-start.
> > + *
> > + * proc:::exec
> > + *	- Use syscall::exec*:entry
> > + *	- Clear arg1 and beyond (arg0 is already the filename)
> > + * proc:::exec-success
> > + *	- Use syscall::exec*:return
> > + *	- Only when arg0 == 0
> > + *	- Clear all args
> > + * proc:::exec-failure
> > + *	- Use syscall::exec*:return
> > + *	- Only when arg0 != 0
> > + *	- Argument arg0 is already correct
> 
> Because of the save_args / restore_args calls in the dependent probe
> code, all of these get to modify the main/underlying args, and none of
> them leak args into each other, right?
> 
> > + */

This entire comment block is old notes that eventually grew stale.

Not sure if I should bother rewriting it or just remove it.

> > +
> > +typedef struct probe_arg {
> > +	const char	*name;			/* name of probe */
> > +	int		argno;			/* argument number */
> > +	dt_argdesc_t	argdesc;		/* argument description */
> > +} probe_arg_t;
> > +
> > +static probe_arg_t probe_args[] = {
> > +	{ "create", 0, { 0, "struct task_struct *", "psinfo_t *" } },
> > +	{ "exec", 0, { 0, "string", } },
> > +	{ "exec-failure", 0, { 0, "int", } },
> > +	{ "exec-success", },
> > +	{ "exit", 0, { 0, "int", } },
> [...]
> > +	{ "signal-discard", 0, { 0, "struct task_struct *", "lwpsinfo_t *" } },
> > +	{ "signal-discard", 1, { 0, "struct task_struct *", "psinfo_t *" } },
> > +	{ "signal-discard", 2, { 1, "int", } },
> > +	{ "signal-handle", 0, { 0, "int", } },
> > +	{ "signal-handle", 1, { 1, "siginfo_t *", } },
> > +	{ "signal-handle", 2, { 2, "void (*)(void)", } },
> 
> The code in probe_info assumes that all entries for a given probe are
> packed together, and silently misbehaves if that's not true. You should
> probably stick a comment in above probe_args mentioning that.

Sure.

> > +	{ "signal-send", 0, { 0, "struct task_struct *", "lwpsinfo_t *" } },
> > +	{ "signal-send", 1, { 0, "struct task_struct *", "psinfo_t *" } },
> > +	{ "signal-send", 2, { 1, "int", } },
> 
> It took a bit of effort to figure out that the native arg type (e.g
> "int" here) being NULL is used to indicate a probe with zero args. You
> should probably note that too :)

Well, no...  NULL as translated type (see above) means that no translation is
needed.  No values other than the name (see exec-success above) means there are
no arguments and that is simply because the 2nd value in the struct is 0.

> > +static void enable(dtrace_hdl_t *dtp, dt_probe_t *prp)
> > +{
> > +	dt_probe_t		*uprp = NULL;
> > +	dtrace_probedesc_t	pd;
> > +
> > +	if (strcmp(prp->desc->prb, "create") == 0 ||
> > +	    strcmp(prp->desc->prb, "lwp-create") == 0) {
> 
> You use dispatch tables for the args: part of me says a dispatch table
> for this bit would be nice too, but then I try to design it and it gets
> really clunky really fast, so I guess this is reasonable (there's almost
> no repeated code in here anyway, and it's easy to read).
> 
> But sdt providers with lots more probes in them should probably do
> something like that, maybe, just to split things up a bit, or have a
> function like this one with if bodies that call out.

I expect we'll revise this in the future as more SDT providers are implemented
but that is future development.

> > +		pd.id = DTRACE_IDNONE;
> > +		pd.prv = "rawtp";
> > +		pd.mod = "sched";
> > +		pd.fun = "";
> > +		pd.prb = "sched_process_fork";
> 
> This stuff will definitely end up needing to be data-driven in some way
> as kernels change: whether via hardwiring in here or something in the
> libdir/kernel.version stuff like translators, I'm not sure, but honestly
> in the long run I think something like translators would be better.
> 
> Not worth it yet, of course, and this should be easily translatable into
> that sort of model.

Time will tell.  For now, I am waiting to see how complex these things can get
before deciding how to make it more data-driven.  It is just too soon to tell
what we'll need.

> > +/*
> > + * Generate a BPF trampoline for a SDT probe.
> > + *
> > + * The trampoline function is called when a SDT probe triggers, and it must
> > + * satisfy the following prototype:
> > + *
> > + *	int dt_proc(void *data)
> > + *
> > + * The trampoline will populate a dt_dctx_t struct and then call the function
> > + * that implements the compiled D clause.  It returns the value that it gets
> > + * back from that function.
> > + */
> 
> This is the bit that depends on the actual args for the raw tracepoints,
> and probably *does* require something more complex (much like, well,
> translators) when we start getting hit with changing kernels.
> 
> I'm honestly wondering if we can't find a way to literally get
> translators to do the job for us, somehow. The job is so similar...
> 
> (the code does look reasonable insofar as we are forced to do it all by
> hand for now.)
> 
> > +static int probe_info(dtrace_hdl_t *dtp, const dt_probe_t *prp,
> > +		      int *argcp, dt_argdesc_t **argvp)
> > +{
> > +	int		i;
> > +	int		pidx = -1;
> > +	int		argc = 0;
> > +	dt_argdesc_t	*argv = NULL;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(probe_args); i++) {
> > +		probe_arg_t	*arg = &probe_args[i];
> > +
> > +		if (strcmp(arg->name, prp->desc->prb) == 0) {
> > +			if (pidx == -1) {
> > +				pidx = i;
> > +
> > +				if (arg->argdesc.native == NULL)
> > +					break;
> > +			}
> > +
> > +			argc++;
> > +		}
> > +	}
> 
> Needs at least some comment here or above probe_args as noted earlier:
> this function imposes requirements on the structure of probe_args which
> are nowhere described.
> 
> > diff --git a/libdtrace/dt_provider.h b/libdtrace/dt_provider.h
> > index 92fa7d55..5015a275 100644
> > --- a/libdtrace/dt_provider.h
> > +++ b/libdtrace/dt_provider.h
> > @@ -47,13 +47,9 @@ struct dt_probe;
> >   * to multiple args[X] variables.
> >   */
> >  typedef struct dt_argdesc {
> > -#ifdef FIXME
> > -	dtrace_id_t id;
> > -	int ndx;
> > -#endif
> >  	int mapping;
> > -	char *native;
> > -	char *xlate;
> > +	const char *native;
> > +	const char *xlate;
> >  } dt_argdesc_t;
> 
> Oh good!
> 
> > diff --git a/test/unittest/proc/tst.discard.sh b/test/unittest/proc/tst.discard.sh
> > index 8ac82c2e..b157cf66 100755
> > --- a/test/unittest/proc/tst.discard.sh
> > +++ b/test/unittest/proc/tst.discard.sh
> > @@ -8,20 +8,22 @@
> >  # This script tests that the proc:::signal-discard probe fires correctly
> >  # and with the correct arguments.
> >  #
> > -# If this fails, the script will run indefinitely; it relies on the harness
> > -# to time it out.
> > -#
> >  # @@xfail: dtv2
> >  
> >  script()
> >  {
> >  	$dtrace $dt_flags -s /dev/stdin <<EOF
> >  	proc:::signal-discard
> > -	/args[1]->pr_pid == $child &&
> > -	    args[1]->pr_psargs == "$longsleep" && args[2] == SIGHUP/
> > +	/args[1]->pr_pid == $child && args[1]->pr_fname == "sleep" &&
> > +	 args[2] == SIGHUP/
> >  	{
> >  		exit(0);
> >  	}
> > +
> > +	tick-5s
> > +	{
> > +		exit(124);
> > +	}
> 
> ... obviously better than having the harness kill it! I was trying not
> to rely on the fname, though, because there aren't going to be any other
> longsleeps on the system, but sleeps doing actual work that shouldn't be
> picked up are quite common (and yes I know the pr_pid clause should rule
> those out).

Yes... unfortunately pr_psargs doesn't work right now so I had to come up
with an alternative.  For the tst.exitcore.sh case I had to make it a bit
more specific because there are multiple sleeps in the actual sleeper loop,
but since here there is just the long sleep the pr_pid + pr_fname check is
sufficient.