[DTrace-devel] [PATCH 2/7] Implement the sched provider

Sat May 13 01:50:53 UTC 2023

On Fri, May 12, 2023 at 05:08:48PM +0100, Nick Alcock wrote:
> On 9 May 2023, Kris Van Hees via DTrace-devel uttered the following:
> 
> > Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> 
> Nice! I mean I'm fairly sure I don't understand it yet (hence no
> reviewed-by until I do), but nice!
> 
> > +/*
> > + * Implement sched probes as dependent probes on other probes.  E.g. for the
> > + * on-cpu probe, call dt_probe_add_dependent() to add the on-cpu probe to the
> > + * underlying fbt::schedule_tail:entry probe.  When the program for probe
> > + * fbt::schedule_tail:entry) is generated, loop through all dependent probes,
> 
> That looks like a bracket out of place.
> 
> > + * The conversion needs to be able to implement predicate-style conditions that
> > + * determine whether the dependent probe is to fire when the main probe does.
> > + * The sched:::preempt and sched:::sleep probes are an example of dependent
> > + * probes that need a conditional.  If the condition fails, the main probe
> > + * processing should move on to the next dependent probe (if any).  If there
> > + * are no more dependent probes, it will end up skipping to the exit label,
> 
> Really nice comment!
> 
> > +static probe_arg_t probe_args[] = {
> > +#if 0
> 
> (presumably this is being done later)
> 
> > +	/*
> > +	 * Finally, ensure we're in the list of enablings as well.
> > +	 * (This ensures that, among other things, the probes map
> > +	 * gains entries for us.)
> > +	 */
> > +	if (!dt_in_list(&dtp->dt_enablings, prp))
> > +		dt_list_append(&dtp->dt_enablings, prp);
> 
> Aside: at some point in the future we should probably add a hash over
> the enablings so we don't need to traverse it *quite* so often. (Right
> now I'd guess it to be undetectable unless you did something like enable
> fbt::.)

Yes, that is an area that needs attention in the future.  A hash or some other
way to identify quickly whether a probe is already in the list or not.  We
might even just set a flag on the probe since it is not like the probe list is
shared between instances or anything.

> +	if (strcmp(prp->desc->prb, "on-cpu") == 0) {
> +		pd.id = DTRACE_IDNONE;
> +		pd.prv = "fbt";
> +		pd.mod = "";
> +		pd.fun = "schedule_tail";
> +		pd.prb = "entry";
> +
> +		uprp = dt_probe_lookup(dtp, &pd);
> +		assert(uprp != NULL);
> +
> +		dt_probe_add_dependent(dtp, uprp, prp);
> +		dt_probe_enable(dtp, uprp);
> +	} else if (strcmp(prp->desc->prb, "preempt") == 0 ||
> +		   strcmp(prp->desc->prb, "sleep") == 0) {
> +		pd.id = DTRACE_IDNONE;
> +		pd.prv = "sdt";
> +		pd.mod = "sched";
> +		pd.fun = "";
> +		pd.prb = "sched_switch";
> +
> +		uprp = dt_probe_lookup(dtp, &pd);
> +		assert(uprp != NULL);
> +
> +		dt_probe_add_dependent(dtp, uprp, prp);
> +		dt_probe_enable(dtp, uprp);
> 
> So for on-cpu, the probedesc is the description of the underlying fbt
> probe, but for preempt and sleep it's... what, something in sdt:sched,
> which is what we're defining here, even though we do a
> dt_probe_add_dependent() with it just like we do with on-cpu above?
> 
> I guess I don't understand what the underlying probe even is for preempt
> and sleep.

sdt:sched::<probe> is a kernel tracepoint.
The probes that the sched provider is implementing are sched:::

> > +/*
> > + * Generate a BPF trampoline for a SDT probe.
> > + *
> > + * The trampoline function is called when a SDT probe triggers, and it must
> > + * satisfy the following prototype:
> > + *
> > + *	int dt_proc(void *data)
> > + *
> > + * The trampoline will populate a dt_dctx_t struct and then call the function
> > + * that implements the compiled D clause.  It returns the value that it gets
> > + * back from that function.
> > + */
> > +static void trampoline(dt_pcb_t *pcb, uint_t exitlbl)
> > +{
> > +	dtrace_hdl_t	*dtp = pcb->pcb_hdl;
> > +	dt_irlist_t	*dlp = &pcb->pcb_ir;
> > +	dt_probe_t	*prp = pcb->pcb_probe;
> > +
> > +	if (strcmp(prp->desc->prb, "preempt") == 0) {
> > +		emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_0, BPF_REG_7, DMST_ARG(3)));
> > +                emit(dlp, BPF_BRANCH_IMM(BPF_JLT, BPF_REG_0, 1 << 8, pcb->pcb_exitlbl));
> > +	} else if (strcmp(prp->desc->prb, "sleep") == 0) {
> > +		emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_0, BPF_REG_7, DMST_ARG(3)));
> > +                emit(dlp, BPF_BRANCH_IMM(BPF_JGE, BPF_REG_0, 1 << 8, pcb->pcb_exitlbl));
> > +	}
> > +}
> 
> ... that's the conditional: you should probably mention it, since I had
> to dig back to the comment at the top to figure out why only those two
> probes needed naming here. (Describing the condition more clearly than
> this would be nice too -- this is only clear if you know what the
> underlying probe is and what it's doing, and as noted I am already
> confused about that.)

Yes, though I have been debating about how to best document this because it
is quite often very dependent on the underlying probe implementation at the
kernel level.  We don't want to end up documenting too much detail either.

> > +static int probe_info(dtrace_hdl_t *dtp, const dt_probe_t *prp,
> > +		      int *argcp, dt_argdesc_t **argvp)
> > +{
> > +	int		i;
> > +	int		pidx = -1;
> > +	int		argc = 0;
> > +	dt_argdesc_t	*argv = NULL;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(probe_args); i++) {
> > +		probe_arg_t	*arg = &probe_args[i];
> > +
> > +		if (strcmp(arg->name, prp->desc->prb) == 0) {
> > +			if (pidx == -1) {
> > +				pidx = i;
> > +
> > +				if (arg->argdesc.native == NULL)
> > +					break;
> > +			}
> > +
> > +			argc++;
> > +		}
> > +	}
> > +
> > +	if (argc == 0)
> > +		goto done;
> > +
> > +	argv = dt_zalloc(dtp, argc * sizeof(dt_argdesc_t));
> > +	if (!argv)
> > +		return -ENOMEM;
> > +
> > +	for (i = pidx; i < pidx + argc; i++) {
> > +		probe_arg_t	*arg = &probe_args[i];
> > +
> > +		argv[arg->argno] = arg->argdesc;
> > +	}
> 
> I find myself wondering if the dependent probe stuff could reuse
> the guts of probe_info somehow. Everything using a dependent probe is
> going to have a nearly identical probe_info...

No, quite often the probe_info of the underlying probe is quite different
from the dependent probe.