[DTrace-devel] [PATCH v2 2/2] Extend the USDT bit mask to multiple words

Tue Jul 22 12:34:37 UTC 2025

On 25 Jun 2025, eugene loh outgrape:

> From: Eugene Loh <eugene.loh at oracle.com>
>
> Currently, USDT is limited to 64 probe descriptions since the
> underlying probe uses a 64-bit mask to decide which probes to execute.
>
> Change to a multi-word bit mask that can be extended to however many
> probe descriptions there are.

Oh yeah!!!!

Not quite happy enough with this one to r-b it, but it's sooo close...
:)

> Also, change the mask words to be 32-bit rather than 64-bit.  The reason
> is that, commonly, there will be fewer than 32 probe descriptions.  In
> this case, we shorten the value of the "USDT prids" BPF map from 16 bytes
>         uint32_t        prid;
>         long long       mask[1];
> down to 8 bytes
>         uint32_t        prid;
>         uint32_t        mask[1];
> (The second member is smaller and no longer costs extra padding.)

Nice!

> We also add an
>         extern int usdt_prids_map_val_extra_bytes;
> to denote how many extra bytes will be needed for the extended mask.

Ugh. Is this really the best we can do? A global variable referenced via
extern? Can't we put it in the dtp or anything? This will fail as soon
as we have more than one dtrace handle... and yes I now that's not
exactly tested and probably doesn't work, but we shouldn't break it
worse when keeping it working as well as it is now is so simple.

> @@ -48,8 +48,9 @@ typedef struct usdt_prids_map_key {
>  } usdt_prids_map_key_t;
>  typedef struct usdt_prids_map_val {
>  	uint32_t	prid;		/* should be dtrace_id_t, sys/dtrace_types.h */
> -	long long	mask;
> +	uint32_t	mask[1];
>  } usdt_prids_map_val_t;
> +extern int usdt_prids_map_val_extra_bytes;

Note that this is not how you're supposed to implement VLAs: [1] is
extremely obsolete (as in, pre-C99) and very deprecated at this point.
[] is the recommended approach, if you can live with its restrictions
(and I think we can here: all we need to do is remove the extra in
extra_bytes and *always* allocate at least one byte, which actually
simplifies the code in usdt_prids_map_val_extra_bytes(). [] does include
the size of necessary padding: mask will start at precisely
sizeof(usdt_prids_map_val_t).)

> diff --git a/libdtrace/dt_prov_uprobe.c b/libdtrace/dt_prov_uprobe.c
> index 5ba50b678..65413cba6 100644
> --- a/libdtrace/dt_prov_uprobe.c
> +++ b/libdtrace/dt_prov_uprobe.c
> @@ -303,6 +303,8 @@ typedef struct list_key {
>  	usdt_prids_map_key_t	key;
>  } list_key_t;
>  
> +int usdt_prids_map_val_extra_bytes;

Ugh!

>  static const dtrace_pattr_t	pattr = {
>  { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
>  { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
> @@ -403,7 +405,7 @@ clean_usdt_probes(dtrace_hdl_t *dtp)
>  	int			fdprids = dtp->dt_usdt_pridsmap_fd;
>  	int			fdnames = dtp->dt_usdt_namesmap_fd;
>  	usdt_prids_map_key_t	key, nxt;
> -	usdt_prids_map_val_t	val;
> +	usdt_prids_map_val_t	*val = alloca(sizeof(usdt_prids_map_val_t) + usdt_prids_map_val_extra_bytes);

(this bit will need changing slightly if you use [].)

>  
> +void usdt_prids_map_val_extra_bytes_init(dtrace_hdl_t *dtp) {

Wrongly-placed {.

> +	int i, n = 0, w = sizeof(((usdt_prids_map_val_t *)0)->mask[0]);
> +
> +	/* Count how many statements cannot be ignored, regardless of uprp. */
> +	for (i = 0; i < dtp->dt_stmt_nextid; i++) {
> +		dtrace_stmtdesc_t *stp;
> +
> +		stp = dtp->dt_stmts[i];
> +		if (stp == NULL || ignore_clause(dtp, i, NULL))
> +			continue;

I was about to complain about the cost of all these ignore_clause()s,
but of course this is only invoked once per run...

> +		n++;
> +	}
> +
> +	/* Determine how many bytes are needed for this many bits. */
> +	n = (n + CHAR_BIT - 1) / CHAR_BIT;
> +
> +	/* Determine how many words are needed for this many bytes. */
> +	n = (n + w - 1) / w;

Thank God you commented this -- it took me ages to realise how it works.
It feels like it should be a standard idiom, but I've never encountered
it before...

> +	/* Determine how many extra bytes are needed. */
> +	if (n > 1)
> +		usdt_prids_map_val_extra_bytes = (n - 1) * w;
> +	else
> +		usdt_prids_map_val_extra_bytes = 0;

If you use a [] array, this can simply become

usdt_prids_map_val_bytes = n * w;

> +}
> +
>  static int add_probe_uprobe(dtrace_hdl_t *dtp, dt_probe_t *prp)
>  {
>  	dtrace_difo_t   *dp;
> @@ -650,6 +679,7 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>  	int				fd = dtp->dt_usdt_namesmap_fd;
>  	pid_t				pid;
>  	list_probe_t			*pup;
> +	usdt_prids_map_val_t		*val;
>  
>  	/* Add probe name elements to usdt_names map. */
>  	p = probnam;
> @@ -685,11 +715,11 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>  	}
>  
>  	/* Add prid and bit mask to usdt_prids map. */
> +	val = alloca(sizeof(usdt_prids_map_val_t) + usdt_prids_map_val_extra_bytes);
>  	for (pup = prp->prv_data; pup != NULL; pup = dt_list_next(pup)) {
>  		dt_probe_t		*uprp = pup->probe;
> -		long long		mask = 0, bit = 1;
> +		uint32_t		iword = 0, mask = 0, bit = 1;
>  		usdt_prids_map_key_t	key;
> -		usdt_prids_map_val_t	val;
>  		dt_uprobe_t		*upp = uprp->prv_data;
>  
>  		/*
> @@ -707,11 +737,15 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>  				dtrace_stmtdesc_t *stp;
>  
>  				stp = dtp->dt_stmts[n];
> -				if (stp == NULL)
> +				if (stp == NULL || ignore_clause(dtp, n, uprp))
>  					continue;
>  
> -				if (ignore_clause(dtp, n, uprp))
> -					continue;
> +				if (bit == 0) {
> +					val->mask[iword] = mask;
> +					mask = 0;
> +					iword++;
> +					bit = 1;
> +				}
>  
>  				if (dt_gmatch(prp->desc->prv, stp->dtsd_ecbdesc->dted_probe.prv) &&
>  				    dt_gmatch(prp->desc->mod, stp->dtsd_ecbdesc->dted_probe.mod) &&
> @@ -726,11 +760,11 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>  		key.pid = pid;
>  		key.uprid = uprp->desc->id;
>  
> -		val.prid = prp->desc->id;
> -		val.mask = mask;
> +		val->prid = prp->desc->id;
> +		val->mask[iword] = mask;
>
>  		// FIXME Check return value, but how should errors be handled?
> -		dt_bpf_map_update(dtp->dt_usdt_pridsmap_fd, &key, &val);
> +		dt_bpf_map_update(dtp->dt_usdt_pridsmap_fd, &key, val);
>  	}

... am I missing something, or do you never modify "mask" (the local
variable) anywhere in this function? So it ends up always zero...

> @@ -1451,7 +1485,7 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>  	const list_probe_t	*pop;
>  	uint_t			lbl_exit = pcb->pcb_exitlbl;
>  	dt_ident_t		*usdt_prids = dt_dlib_get_map(dtp, "usdt_prids");
> -	int			n;
> +	int			n, ibit, w = CHAR_BIT * sizeof(((usdt_prids_map_val_t *)0)->mask[0]);

We could really do with better variable names than "n" and "w". w seems
to be the bit-width of a single word: word_width? width?

> @@ -1538,7 +1572,8 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>  	 */
>  	assert(sizeof(usdt_prids_map_key_t) <= DT_STK_SLOT_SZ);
>  	emit(dlp,  BPF_STORE(BPF_W, BPF_REG_FP, DT_TRAMP_SP_SLOT(0), BPF_REG_0));
> -	emit(dlp,  BPF_STORE_IMM(BPF_W, BPF_REG_FP, DT_TRAMP_SP_SLOT(0) + (int)sizeof(pid_t), uprp->desc->id));
> +	emit(dlp,  BPF_STORE_IMM(BPF_W, BPF_REG_FP,
> +		   DT_TRAMP_SP_SLOT(0) + (int)sizeof(pid_t), uprp->desc->id));
>  	dt_cg_xsetx(dlp, usdt_prids, DT_LBL_NONE, BPF_REG_1, usdt_prids->di_id);
>  	emit(dlp,  BPF_MOV_REG(BPF_REG_2, BPF_REG_FP));
>  	emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, DT_TRAMP_SP_SLOT(0)));
> @@ -1572,8 +1607,8 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>  	emit(dlp,  BPF_LOAD(BPF_W, BPF_REG_1, BPF_REG_0, 0));
>  	emit(dlp,  BPF_STORE(BPF_W, BPF_REG_7, DMST_PRID, BPF_REG_1));
>  
> -	/* Read the bit mask from the table lookup in %r6. */    // FIXME someday, extend this past 64 bits
> -	emit(dlp,  BPF_LOAD(BPF_DW, BPF_REG_6, BPF_REG_0, offsetof(usdt_prids_map_val_t, mask)));
> +	/* Store the value key for reuse. */
> +	emit(dlp,  BPF_STORE(BPF_DW, BPF_REG_FP, DT_TRAMP_SP_SLOT(0), BPF_REG_0));

This all makes sense...

> @@ -1587,21 +1622,24 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>  	/*
>  	 * Hold the bit mask in %r6 between clause calls.
>  	 */
> -	for (n = 0; n < dtp->dt_stmt_nextid; n++) {
> +	for (ibit = n = 0; n < dtp->dt_stmt_nextid; n++) {
>  		dtrace_stmtdesc_t *stp;
>  		dt_ident_t	*idp;
>  		uint_t		lbl_next;
>  
>  		stp = dtp->dt_stmts[n];
> -		if (stp == NULL)
> -			continue;
> -
> -		if (ignore_clause(dtp, n, uprp))
> +		if (stp == NULL || ignore_clause(dtp, n, uprp))
>  			continue;
>  
>  		idp = stp->dtsd_clause;
>  		lbl_next = dt_irlist_label(dlp);

... as does this (although we are now calling ignore_clause() at least
twice for every clause, I think three times)...

> +		/* Load the next word of the bit mask into %r6. */
> +		if (ibit % w == 0) {
> +			emit(dlp,  BPF_LOAD(BPF_DW, BPF_REG_0, BPF_REG_FP, DT_TRAMP_SP_SLOT(0)));
> +			emit(dlp,  BPF_LOAD(BPF_W, BPF_REG_6, BPF_REG_0, offsetof(usdt_prids_map_val_t, mask[ibit / w])));

ok, so this uses the 'w' thing to get an array offset out of a bit
count, and then you can just use the existing repated-shift code. Should
work.

> +		}
> +
>  		/* If the lowest %r6 bit is 0, skip over this clause. */
>  		emit(dlp,  BPF_MOV_REG(BPF_REG_1, BPF_REG_6));
>  		emit(dlp,  BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 1));
> @@ -1629,6 +1667,7 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>  
>  		/* Right-shift %r6. */
>  		emit(dlp,  BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 1));
> +		ibit++;

... so, this isn't in the loop conditionl because the values are only
filled for non-ignored, non-NULL statements. This assumption is
duplicated in two places (here and in usdt_prids_map_val_extra_bytes())
and if it ever drifts we get uninitialized rubbish or buffer overruns.
Should it be centralized somewhere, somehow? At least it needs a
comment.

> +# We stick 80 probe descriptions in each of 3 scripts to test
> +# USDT's ability to handle hundreds of probe descriptions.

> +++ b/test/unittest/usdt/tst.manyprobedescriptions2.sh

... maybe manyprobedescriptions-bytesize.sh :)

> @@ -0,0 +1,127 @@
> +#!/bin/bash
> +#
> +# Oracle Linux DTrace.
> +# Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.
> +# Licensed under the Universal Permissive License v 1.0 as shown at
> +# http://oss.oracle.com/licenses/upl.
> +
> +# This test uses many probes and probe descriptions.  Therefore, the
> +# number of BPF programs to load into the kernel -- dt_bpf_load_prog()
> +# calling prp->prov->impl->load_prog(), which is dt_bpf_prog_load() --
> +# and the duration of each load are both increasing.
> +# @@timeout: 400
> +
> +dtrace=$1
> +
> +DIRNAME="$tmpdir/usdt-many_probe_descriptions2.$$.$RANDOM"
> +mkdir -p $DIRNAME
> +cd $DIRNAME
> +
> +# Set the lists.
> +# - The probes will be foo$x$y.
> +# - The probe descriptions will be foo$x* and foo*$y, for each $d.
> +# So if there are nx items in xlist, ny in ylist, and nd in dlist,
> +# - there will be roughly nx*ny probes
> +# - there will be roughly (nx+ny)*nd probe descriptions
> +
> +xlist="a b c d e f g h i j k l m"
> +ylist="n o p q r s t u v w x y z"
> +dlist="0 1 2 3 4 5 6 7 8"

... and with 8... maybe we should also test 0, 1, 7 and 9, checking all
the boundary conditions (and refactor things so we don't need to copy
all of this every time).

(You'll only really see things go wrong under valgrind, but that's fine,
valgrinding should be routine before releases anyway.)

-- 
NULL && (void)