[DTrace-devel] [PATCH 3/4] Drop counters support

Tue May 9 22:33:36 UTC 2023

Reviewed-by: Eugene Loh <eugene.loh at oracle.com>
with questions/comments.

First, the subject line sounds like some support is being dropped.  
Like, "Drop support for counters."  So how about changing to "Add 
support for drop counters."  Or, "Add drop-counter support."

On 5/9/23 03:19, Kris Van Hees via DTrace-devel wrote:
> DTrace provides a mechanism to report dropped events.  A drop occurs
> when trace data cannot be recorded for a particular reason.  This
> patch contains the full implementation of 4 categories of drops,
> primarily because they are all closely related:
>
> - principal buffer drops: reported when the producer failed to add
>    trace data to the principal buffer
> - aggregation buffer drops: reported when the producer failed to
>    allocate an aggreggation and therefore failed to record data
> - speculation drops: reported when something goes wrong with the
>    recording of speculative tracing data
>      + regular drops: reported when speculative data could not be
>        written to a speculation buffer
>      + busy drops: reported when a speculation could not be created
>        because all buffers are busy being committed or discarded
>      + unavailable drops: reported when no available speculation
>        buffers were found
> - dynamic variable drops: reported when a dynamic variable (or
>    associative array element) could not be allocated
>
> Two mechanism for reporting drops are needed:

s/mechanism/mechanisms/

> (1) Per-CPU reporting: used for principal and aggregation buffer
>      drops (stored in the cpuinfo structures)
> (2) Global reporting: used for speculation and dynamic variable
>      drops (stored in the state BPF map)
>
> Detection of drops (and subsequent reporting to the user) is done
> through frequent retrieval of status data.  The handling of status
> data (and the use of statusrate) is being re-introduced with this
> patch.
>
> The drop count for speculations is a bit more complex than the other
> ones because drops can occur both in the producer (when data cannot
> be written to the trace output buffer) *and* in the consumer (when
> data cannot be recorded in a speculation buffer.  These separate

Second parenthetical remark needs a closing parenthesis.

> counts are combined whenever status processing takes place to ensure
> the correct drop count is presented to the user.
>
> Various tests have updated expected results because drops are now
> being reported correctly.
>
> Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
>
> diff --git a/bpf/get_dvar.c b/bpf/get_dvar.c
> @@ -1,19 +1,38 @@
> +/*
> + * Register an aggregation drop.
> + */

"Aggregation drop" is copy/paste error?

> +noinline void *dt_no_dvar(void)
> +{
> +	uint32_t	kind = DT_STATE_DYNVAR_DROPS;
> +	uint32_t	*valp;
> +
> +	valp = bpf_map_lookup_elem(&state, &kind);
> +	if (valp == 0)
> +		return 0;
> +
> +	atomic_add32(valp, 1);
> +	return 0;
> +}
> +
>   /*
>    * Dynamic variables are identified using a unique 64-bit key.  Three different
>    * categories of dynamic variables are supported in DTrace:
> @@ -80,7 +99,7 @@ noinline void *dt_get_dvar(uint64_t key, uint64_t store, uint64_t nval,
>   	 * with the default value.
>   	 */
>   	if (bpf_map_update_elem(&dvars, &key, dflt, BPF_ANY) < 0)
> -		return 0;
> +		return dt_no_dvar();
>   
>   	val = bpf_map_lookup_elem(&dvars, &key);
>   	if (val != 0)
> @@ -126,13 +145,13 @@ noinline void *dt_get_assoc(uint32_t id, const char *tuple, uint64_t store,
>   		 * actual value.
>   		 */
>   		if (bpf_map_update_elem(&tuples, tuple, &dflt_val, BPF_ANY) < 0)
> -			return 0;
> +			return dt_no_dvar();
>   		valp = bpf_map_lookup_elem(&tuples, tuple);
>   		if (valp == 0)
>   			return 0;

Should this one also return dt_no_dvar()?

>   		*valp = (uint64_t)valp;
>   		if (bpf_map_update_elem(&tuples, tuple, valp, BPF_ANY) < 0)
> -			return 0;
> +			return dt_no_dvar();
>   
>   		val = *valp;
>   	} else {
> diff --git a/bpf/speculation.c b/bpf/speculation.c
> @@ -19,18 +19,36 @@
> +/*
> + * Register an aggregation drop.
> + */

"Aggregation drop" is a copy/paste typo.

> +noinline uint32_t dt_no_spec(uint32_t kind)
> +{
> +	uint32_t	*valp;
> +
> +	valp = bpf_map_lookup_elem(&state, &kind);
> +	if (valp == 0)
> +		return 0;
> +
> +	atomic_add32(valp, 1);
> +	return 0;
> +}
> +
>   /*
>    * Assign a speculation ID.
>    */
>   noinline uint32_t dt_speculation(void)
>   {
> -	uint32_t id;
> -	dt_bpf_specs_t zero;
> +	uint32_t	id, busy;
> +	dt_bpf_specs_t	zero;
> +	dt_bpf_specs_t	*spec;
>   
>   	__builtin_memset(&zero, 0, sizeof (dt_bpf_specs_t));
>   
> +	busy = 0;

Or, do this initialization in the declaration.

>   #if 1 /* Loops are broken in BPF right now */
>   #define SEARCH(n)							\
>   	do {								\
> diff --git a/libdtrace/dt_bpf.c b/libdtrace/dt_bpf.c
> @@ -604,26 +604,29 @@ gmap_create_buffers(dtrace_hdl_t *dtp)
>   static int
>   gmap_create_cpuinfo(dtrace_hdl_t *dtp)
>   {
> -	int			i, fd, rc;
> +	int			i, rc;
>   	uint32_t		key = 0;
>   	dtrace_conf_t		*conf = &dtp->dt_conf;
>   	size_t			ncpus = conf->max_cpuid + 1;
>   	dt_bpf_cpuinfo_t	*data;
>   	cpuinfo_t		*ci;
>   
> -	data = dt_zalloc(dtp, ncpus * sizeof(dt_bpf_cpuinfo_t));
> +	data = dt_calloc(dtp, dtp->dt_conf.num_possible_cpus,
> +			 sizeof(dt_bpf_cpuinfo_t));
>   	if (data == NULL)
>   		return dt_set_errno(dtp, EDT_NOMEM);
>   
>   	for (i = 0, ci = &conf->cpus[0]; i < ncpus; i++, ci++)
>   		memcpy(&data[ci->cpu_id].ci, ci, sizeof(cpuinfo_t));
>   
> -	fd = create_gmap(dtp, "cpuinfo", BPF_MAP_TYPE_PERCPU_ARRAY,
> -			 sizeof(uint32_t), sizeof(dt_bpf_cpuinfo_t), 1);
> -	if (fd == -1)
> +	dtp->dt_cpumap_fd = create_gmap(dtp, "cpuinfo",
> +					BPF_MAP_TYPE_PERCPU_ARRAY,
> +					sizeof(uint32_t),
> +					sizeof(dt_bpf_cpuinfo_t), 1);
> +	if (dtp->dt_cpumap_fd == -1)
>   		return -1;
>   
> -	rc = dt_bpf_map_update(fd, &key, data);
> +	rc = dt_bpf_map_update(dtp->dt_cpumap_fd, &key, data);
>   	dt_free(dtp, data);
>   	if (rc == -1)
>   		return dt_bpf_error(dtp,

Should these changes to gmap_create_cpuinfo(), along with the 
introduction to dt_cpumap_fd, be put in the preceding patch?

> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> @@ -867,6 +870,36 @@ dt_cg_epilogue(dt_pcb_t *pcb)
>   		emit(dlp, BPF_MOV_IMM(BPF_REG_5, pcb->pcb_bufoff));
>   		emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_5, 4));
>   		emit(dlp, BPF_CALL_HELPER(BPF_FUNC_perf_event_output));
> +
> +		/*
> +		 * If writing the trace data to the output buffer failed,
> +		 * increment the drop count for the principal buffer (no
> +		 * speculate() in the clause) or for aggregations (speculate()
> +		 * in the clause).
> +		 */

The comment is out-of-order with respect to the code, refers incorrectly 
to aggregations, and is longer than necessary.  How about just stopping 
the comment at "drop count."

> +		emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, pcb->pcb_exitlbl));
> +		if (cflags & DT_CLSFLAG_SPECULATE) {
> +			idp = dt_dlib_get_map(dtp, "state");
> +			assert(idp != NULL);
> +			dt_cg_xsetx(dlp, idp, DT_LBL_NONE, BPF_REG_1, idp->di_id);
> +			emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_2, BPF_REG_FP, DT_STK_SP));
> +			emit(dlp, BPF_STORE_IMM(BPF_DW, BPF_REG_2, 0, DT_STATE_SPEC_DROPS));
> +			emit(dlp, BPF_CALL_HELPER(BPF_FUNC_map_lookup_elem));
> +			emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, pcb->pcb_exitlbl));
> +			emit(dlp, BPF_MOV_IMM(BPF_REG_1, 1));
> +			emit(dlp, BPF_XADD_REG(BPF_W, BPF_REG_0, 0, BPF_REG_1));
> +		} else {
> +			idp = dt_dlib_get_map(dtp, "cpuinfo");
> +			assert(idp != NULL);
> +			dt_cg_xsetx(dlp, idp, DT_LBL_NONE, BPF_REG_1, idp->di_id);
> +			emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_2, BPF_REG_FP, DT_STK_SP));
> +			emit(dlp, BPF_STORE_IMM(BPF_DW, BPF_REG_2, 0, 0));
> +			emit(dlp, BPF_CALL_HELPER(BPF_FUNC_map_lookup_elem));
> +			emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, pcb->pcb_exitlbl));
> +			emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, offsetof(dt_bpf_cpuinfo_t, buf_drops)));
> +			emit(dlp, BPF_MOV_IMM(BPF_REG_1, 1));
> +			emit(dlp, BPF_XADD_REG(BPF_DW, BPF_REG_0, 0, BPF_REG_1));

Instead of BPF_ADD the offset to %r0, you can just specify the offset in 
the BPF_XADD instruction.

> +		}
>   	}
>   
>   	/*
> @@ -3157,7 +3191,9 @@ dt_cg_store_var(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp,
>   		emit(dlp,  BPF_MOV_REG(reg, BPF_REG_0));
>   		dt_regset_free(drp, BPF_REG_0);
>   
> +#if 0
>   		dt_cg_check_notnull(dlp, drp, reg);
> +#endif
>   
>   		if (dnp->dn_flags & DT_NF_REF) {
>   			size_t	srcsz;

Why is this #if 0 code still there?

> diff --git a/libdtrace/dt_consume.c b/libdtrace/dt_consume.c
> @@ -2613,8 +2617,8 @@ dt_consume_one(dtrace_hdl_t *dtp, FILE *fp, char *buf,
>   					    rfunc, flow, quiet, peekflags,
>   					    last, 0, arg);
>   	} else if (hdr->type == PERF_RECORD_LOST) {
> -#ifdef FIXME
> -		uint64_t	lost;
> +#if 0
> +		uint64_t	drops;
>   
>   		/*
>   		 * struct {

Why #if 0?  What is this code waiting for?

> diff --git a/libdtrace/dt_work.c b/libdtrace/dt_work.c
> @@ -189,5 +266,25 @@ dtrace_work(dtrace_hdl_t *dtp, FILE *fp, dtrace_consume_probe_f *pfunc,
>   	    DTRACE_WORKSTATUS_ERROR)
>   		return DTRACE_WORKSTATUS_ERROR;
>   
> +	/*
> +	 * If we are stopped, we use dt_get_status() to get any potential
> +	 * pending speculation drops because we want to ensure that old and new
> +	 * counts for other drops are identical (lest they be reported more
> +	 * than once).
> +	 *
> +	 * If we are not stopped, we use dt_add_local_status() to adjust the
> +	 * current drop counters without retrieving producer counts (since they
> +	 * might have changed and we do not want to report them yet).
> +	 */

Put comments in the same order as the code.

> +	if (!dtp->dt_stopped) {
> +		dt_add_local_status(dtp);
> +		gen = dtp->dt_statusgen ^ 1;
> +	} else {
> +		dt_get_status(dtp);
> +		gen = dtp->dt_statusgen;
> +	}
> +
> +	dt_handle_status(dtp, &dtp->dt_status[gen], &dtp->dt_status[gen ^ 1]);
> +
>   	return rval;
>   }
> diff --git a/test/unittest/assocs/tst.store_zero_deletes.d b/test/unittest/assocs/tst.store_zero_deletes.d
> @@ -12,8 +12,13 @@
>    * SECTION: Variables/Thread-Local Variables
>    */
>   
> +/*
> + * We use a dynvarsizde that guarantees that we can only store 4 values without
> + * causing a drop.  Since we use an associative array indexed by an int and
> + * storing an int value, each element will use 20 bytes.
> + */

typo: dynvarsizde

Also, the int+int=20bytes math is not intuitive.

Finally, this test is changing without change in xfail annotation? So is 
this change irrelevant to pass/fail status?  Presumably what happened 
was this test started failing a few patches ago and now is being fixed.  
If so, this fix should be put in the earlier patch so that no patch in 
the series introduces a test regression.

> +#pragma D option dynvarsize=80
>   #pragma D option quiet
> -#pragma D option dynvarsize=15
>   
>   BEGIN
>   {