[DTrace-devel] [PATCH 3/4] Drop counters support

Wed May 10 03:32:48 UTC 2023

On Tue, May 09, 2023 at 06:33:36PM -0400, Eugene Loh via DTrace-devel wrote:
> Reviewed-by: Eugene Loh <eugene.loh at oracle.com>
> with questions/comments.
> 
> First, the subject line sounds like some support is being dropped.  Like,
> "Drop support for counters."  So how about changing to "Add support for drop
> counters."  Or, "Add drop-counter support."

Ah, good point.

> On 5/9/23 03:19, Kris Van Hees via DTrace-devel wrote:
> > DTrace provides a mechanism to report dropped events.  A drop occurs
> > when trace data cannot be recorded for a particular reason.  This
> > patch contains the full implementation of 4 categories of drops,
> > primarily because they are all closely related:
> > 
> > - principal buffer drops: reported when the producer failed to add
> >    trace data to the principal buffer
> > - aggregation buffer drops: reported when the producer failed to
> >    allocate an aggreggation and therefore failed to record data
> > - speculation drops: reported when something goes wrong with the
> >    recording of speculative tracing data
> >      + regular drops: reported when speculative data could not be
> >        written to a speculation buffer
> >      + busy drops: reported when a speculation could not be created
> >        because all buffers are busy being committed or discarded
> >      + unavailable drops: reported when no available speculation
> >        buffers were found
> > - dynamic variable drops: reported when a dynamic variable (or
> >    associative array element) could not be allocated
> > 
> > Two mechanism for reporting drops are needed:
> 
> s/mechanism/mechanisms/

Yup.

> > (1) Per-CPU reporting: used for principal and aggregation buffer
> >      drops (stored in the cpuinfo structures)
> > (2) Global reporting: used for speculation and dynamic variable
> >      drops (stored in the state BPF map)
> > 
> > Detection of drops (and subsequent reporting to the user) is done
> > through frequent retrieval of status data.  The handling of status
> > data (and the use of statusrate) is being re-introduced with this
> > patch.
> > 
> > The drop count for speculations is a bit more complex than the other
> > ones because drops can occur both in the producer (when data cannot
> > be written to the trace output buffer) *and* in the consumer (when
> > data cannot be recorded in a speculation buffer.  These separate
> 
> Second parenthetical remark needs a closing parenthesis.

Yup.

> > counts are combined whenever status processing takes place to ensure
> > the correct drop count is presented to the user.
> > 
> > Various tests have updated expected results because drops are now
> > being reported correctly.
> > 
> > Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> > 
> > diff --git a/bpf/get_dvar.c b/bpf/get_dvar.c
> > @@ -1,19 +1,38 @@
> > +/*
> > + * Register an aggregation drop.
> > + */
> 
> "Aggregation drop" is copy/paste error?

Correct.  Fixed.

> > +noinline void *dt_no_dvar(void)
> > +{
> > +	uint32_t	kind = DT_STATE_DYNVAR_DROPS;
> > +	uint32_t	*valp;
> > +
> > +	valp = bpf_map_lookup_elem(&state, &kind);
> > +	if (valp == 0)
> > +		return 0;
> > +
> > +	atomic_add32(valp, 1);
> > +	return 0;
> > +}
> > +
> >   /*
> >    * Dynamic variables are identified using a unique 64-bit key.  Three different
> >    * categories of dynamic variables are supported in DTrace:
> > @@ -80,7 +99,7 @@ noinline void *dt_get_dvar(uint64_t key, uint64_t store, uint64_t nval,
> >   	 * with the default value.
> >   	 */
> >   	if (bpf_map_update_elem(&dvars, &key, dflt, BPF_ANY) < 0)
> > -		return 0;
> > +		return dt_no_dvar();
> >   	val = bpf_map_lookup_elem(&dvars, &key);
> >   	if (val != 0)
> > @@ -126,13 +145,13 @@ noinline void *dt_get_assoc(uint32_t id, const char *tuple, uint64_t store,
> >   		 * actual value.
> >   		 */
> >   		if (bpf_map_update_elem(&tuples, tuple, &dflt_val, BPF_ANY) < 0)
> > -			return 0;
> > +			return dt_no_dvar();
> >   		valp = bpf_map_lookup_elem(&tuples, tuple);
> >   		if (valp == 0)
> >   			return 0;
> 
> Should this one also return dt_no_dvar()?

Hm, yes...  good point.

> >   		*valp = (uint64_t)valp;
> >   		if (bpf_map_update_elem(&tuples, tuple, valp, BPF_ANY) < 0)
> > -			return 0;
> > +			return dt_no_dvar();
> >   		val = *valp;
> >   	} else {
> > diff --git a/bpf/speculation.c b/bpf/speculation.c
> > @@ -19,18 +19,36 @@
> > +/*
> > + * Register an aggregation drop.
> > + */
> 
> "Aggregation drop" is a copy/paste typo.
> 
> > +noinline uint32_t dt_no_spec(uint32_t kind)
> > +{
> > +	uint32_t	*valp;
> > +
> > +	valp = bpf_map_lookup_elem(&state, &kind);
> > +	if (valp == 0)
> > +		return 0;
> > +
> > +	atomic_add32(valp, 1);
> > +	return 0;
> > +}
> > +
> >   /*
> >    * Assign a speculation ID.
> >    */
> >   noinline uint32_t dt_speculation(void)
> >   {
> > -	uint32_t id;
> > -	dt_bpf_specs_t zero;
> > +	uint32_t	id, busy;
> > +	dt_bpf_specs_t	zero;
> > +	dt_bpf_specs_t	*spec;
> >   	__builtin_memset(&zero, 0, sizeof (dt_bpf_specs_t));
> > +	busy = 0;
> 
> Or, do this initialization in the declaration.
> 
> >   #if 1 /* Loops are broken in BPF right now */
> >   #define SEARCH(n)							\
> >   	do {								\
> > diff --git a/libdtrace/dt_bpf.c b/libdtrace/dt_bpf.c
> > @@ -604,26 +604,29 @@ gmap_create_buffers(dtrace_hdl_t *dtp)
> >   static int
> >   gmap_create_cpuinfo(dtrace_hdl_t *dtp)
> >   {
> > -	int			i, fd, rc;
> > +	int			i, rc;
> >   	uint32_t		key = 0;
> >   	dtrace_conf_t		*conf = &dtp->dt_conf;
> >   	size_t			ncpus = conf->max_cpuid + 1;
> >   	dt_bpf_cpuinfo_t	*data;
> >   	cpuinfo_t		*ci;
> > -	data = dt_zalloc(dtp, ncpus * sizeof(dt_bpf_cpuinfo_t));
> > +	data = dt_calloc(dtp, dtp->dt_conf.num_possible_cpus,
> > +			 sizeof(dt_bpf_cpuinfo_t));
> >   	if (data == NULL)
> >   		return dt_set_errno(dtp, EDT_NOMEM);
> >   	for (i = 0, ci = &conf->cpus[0]; i < ncpus; i++, ci++)
> >   		memcpy(&data[ci->cpu_id].ci, ci, sizeof(cpuinfo_t));
> > -	fd = create_gmap(dtp, "cpuinfo", BPF_MAP_TYPE_PERCPU_ARRAY,
> > -			 sizeof(uint32_t), sizeof(dt_bpf_cpuinfo_t), 1);
> > -	if (fd == -1)
> > +	dtp->dt_cpumap_fd = create_gmap(dtp, "cpuinfo",
> > +					BPF_MAP_TYPE_PERCPU_ARRAY,
> > +					sizeof(uint32_t),
> > +					sizeof(dt_bpf_cpuinfo_t), 1);
> > +	if (dtp->dt_cpumap_fd == -1)
> >   		return -1;
> > -	rc = dt_bpf_map_update(fd, &key, data);
> > +	rc = dt_bpf_map_update(dtp->dt_cpumap_fd, &key, data);
> >   	dt_free(dtp, data);
> >   	if (rc == -1)
> >   		return dt_bpf_error(dtp,
> 
> Should these changes to gmap_create_cpuinfo(), along with the introduction
> to dt_cpumap_fd, be put in the preceding patch?

They are not needed in the preceeding patch.  They are only needed with this
patch.  I guess it could be split out into its own patch but I figured I
might as well include it here along with a few other changes that are very
xpecific to drop counter support.

> > diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> > @@ -867,6 +870,36 @@ dt_cg_epilogue(dt_pcb_t *pcb)
> >   		emit(dlp, BPF_MOV_IMM(BPF_REG_5, pcb->pcb_bufoff));
> >   		emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_5, 4));
> >   		emit(dlp, BPF_CALL_HELPER(BPF_FUNC_perf_event_output));
> > +
> > +		/*
> > +		 * If writing the trace data to the output buffer failed,
> > +		 * increment the drop count for the principal buffer (no
> > +		 * speculate() in the clause) or for aggregations (speculate()
> > +		 * in the clause).
> > +		 */
> 
> The comment is out-of-order with respect to the code, refers incorrectly to
> aggregations, and is longer than necessary.  How about just stopping the
> comment at "drop count."

I don't think that the order of cases in the comment vs the conditional in the
code is that big of a deal (it is rather obvious), but yes, should mention
speculations rather than aggregtions.

> > +		emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, pcb->pcb_exitlbl));
> > +		if (cflags & DT_CLSFLAG_SPECULATE) {
> > +			idp = dt_dlib_get_map(dtp, "state");
> > +			assert(idp != NULL);
> > +			dt_cg_xsetx(dlp, idp, DT_LBL_NONE, BPF_REG_1, idp->di_id);
> > +			emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_2, BPF_REG_FP, DT_STK_SP));
> > +			emit(dlp, BPF_STORE_IMM(BPF_DW, BPF_REG_2, 0, DT_STATE_SPEC_DROPS));
> > +			emit(dlp, BPF_CALL_HELPER(BPF_FUNC_map_lookup_elem));
> > +			emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, pcb->pcb_exitlbl));
> > +			emit(dlp, BPF_MOV_IMM(BPF_REG_1, 1));
> > +			emit(dlp, BPF_XADD_REG(BPF_W, BPF_REG_0, 0, BPF_REG_1));
> > +		} else {
> > +			idp = dt_dlib_get_map(dtp, "cpuinfo");
> > +			assert(idp != NULL);
> > +			dt_cg_xsetx(dlp, idp, DT_LBL_NONE, BPF_REG_1, idp->di_id);
> > +			emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_2, BPF_REG_FP, DT_STK_SP));
> > +			emit(dlp, BPF_STORE_IMM(BPF_DW, BPF_REG_2, 0, 0));
> > +			emit(dlp, BPF_CALL_HELPER(BPF_FUNC_map_lookup_elem));
> > +			emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, pcb->pcb_exitlbl));
> > +			emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, offsetof(dt_bpf_cpuinfo_t, buf_drops)));
> > +			emit(dlp, BPF_MOV_IMM(BPF_REG_1, 1));
> > +			emit(dlp, BPF_XADD_REG(BPF_DW, BPF_REG_0, 0, BPF_REG_1));
> 
> Instead of BPF_ADD the offset to %r0, you can just specify the offset in the
> BPF_XADD instruction.

Yes, good point.

> > +		}
> >   	}
> >   	/*
> > @@ -3157,7 +3191,9 @@ dt_cg_store_var(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp,
> >   		emit(dlp,  BPF_MOV_REG(reg, BPF_REG_0));
> >   		dt_regset_free(drp, BPF_REG_0);
> > +#if 0
> >   		dt_cg_check_notnull(dlp, drp, reg);
> > +#endif
> >   		if (dnp->dn_flags & DT_NF_REF) {
> >   			size_t	srcsz;
> 
> Why is this #if 0 code still there?

Forgot to remove it.

> > diff --git a/libdtrace/dt_consume.c b/libdtrace/dt_consume.c
> > @@ -2613,8 +2617,8 @@ dt_consume_one(dtrace_hdl_t *dtp, FILE *fp, char *buf,
> >   					    rfunc, flow, quiet, peekflags,
> >   					    last, 0, arg);
> >   	} else if (hdr->type == PERF_RECORD_LOST) {
> > -#ifdef FIXME
> > -		uint64_t	lost;
> > +#if 0
> > +		uint64_t	drops;
> >   		/*
> >   		 * struct {
> 
> Why #if 0?  What is this code waiting for?

It is to be removed.  We cannot use the PERF_RECORD_LOST entry to report
drops because that does not differentiate between kinds.

> > diff --git a/libdtrace/dt_work.c b/libdtrace/dt_work.c
> > @@ -189,5 +266,25 @@ dtrace_work(dtrace_hdl_t *dtp, FILE *fp, dtrace_consume_probe_f *pfunc,
> >   	    DTRACE_WORKSTATUS_ERROR)
> >   		return DTRACE_WORKSTATUS_ERROR;
> > +	/*
> > +	 * If we are stopped, we use dt_get_status() to get any potential
> > +	 * pending speculation drops because we want to ensure that old and new
> > +	 * counts for other drops are identical (lest they be reported more
> > +	 * than once).
> > +	 *
> > +	 * If we are not stopped, we use dt_add_local_status() to adjust the
> > +	 * current drop counters without retrieving producer counts (since they
> > +	 * might have changed and we do not want to report them yet).
> > +	 */
> 
> Put comments in the same order as the code.

Sure.

> > +	if (!dtp->dt_stopped) {
> > +		dt_add_local_status(dtp);
> > +		gen = dtp->dt_statusgen ^ 1;
> > +	} else {
> > +		dt_get_status(dtp);
> > +		gen = dtp->dt_statusgen;
> > +	}
> > +
> > +	dt_handle_status(dtp, &dtp->dt_status[gen], &dtp->dt_status[gen ^ 1]);
> > +
> >   	return rval;
> >   }
> > diff --git a/test/unittest/assocs/tst.store_zero_deletes.d b/test/unittest/assocs/tst.store_zero_deletes.d
> > @@ -12,8 +12,13 @@
> >    * SECTION: Variables/Thread-Local Variables
> >    */
> > +/*
> > + * We use a dynvarsizde that guarantees that we can only store 4 values without
> > + * causing a drop.  Since we use an associative array indexed by an int and
> > + * storing an int value, each element will use 20 bytes.
> > + */
> 
> typo: dynvarsizde
> 
> Also, the int+int=20bytes math is not intuitive.

That is ok - not meant to be intuitive - it is meant as a reminder,

> Finally, this test is changing without change in xfail annotation? So is
> this change irrelevant to pass/fail status?  Presumably what happened was
> this test started failing a few patches ago and now is being fixed.  If so,
> this fix should be put in the earlier patch so that no patch in the series
> introduces a test regression.
> 
> > +#pragma D option dynvarsize=80
> >   #pragma D option quiet
> > -#pragma D option dynvarsize=15
> >   BEGIN
> >   {

Yes, it belongs in the dynvarsize patch.  Thanks for pointing this out.