[DTrace-devel] [PATCH v2 2/2] Extend the USDT bit mask to multiple words

Wed Jul 23 00:47:36 UTC 2025

On 7/22/25 08:34, Nick Alcock wrote:

> On 25 Jun 2025, eugene loh outgrape:
>> From: Eugene Loh <eugene.loh at oracle.com>
>>
>> Currently, USDT is limited to 64 probe descriptions since the
>> underlying probe uses a 64-bit mask to decide which probes to execute.
>>
>> Change to a multi-word bit mask that can be extended to however many
>> probe descriptions there are.
> Oh yeah!!!!
>
> Not quite happy enough with this one to r-b it, but it's sooo close...
> :)
>
>> Also, change the mask words to be 32-bit rather than 64-bit.  The reason
>> is that, commonly, there will be fewer than 32 probe descriptions.  In
>> this case, we shorten the value of the "USDT prids" BPF map from 16 bytes
>>          uint32_t        prid;
>>          long long       mask[1];
>> down to 8 bytes
>>          uint32_t        prid;
>>          uint32_t        mask[1];
>> (The second member is smaller and no longer costs extra padding.)
> Nice!
>
>> We also add an
>>          extern int usdt_prids_map_val_extra_bytes;
>> to denote how many extra bytes will be needed for the extended mask.
> Ugh. Is this really the best we can do? A global variable referenced via
> extern? Can't we put it in the dtp or anything? This will fail as soon
> as we have more than one dtrace handle... and yes I now that's not
> exactly tested and probably doesn't work, but we shouldn't break it
> worse when keeping it working as well as it is now is so simple.

Okay, changed.

>> @@ -48,8 +48,9 @@ typedef struct usdt_prids_map_key {
>>   } usdt_prids_map_key_t;
>>   typedef struct usdt_prids_map_val {
>>   	uint32_t	prid;		/* should be dtrace_id_t, sys/dtrace_types.h */
>> -	long long	mask;
>> +	uint32_t	mask[1];
>>   } usdt_prids_map_val_t;
>> +extern int usdt_prids_map_val_extra_bytes;
> Note that this is not how you're supposed to implement VLAs: [1] is
> extremely obsolete (as in, pre-C99) and very deprecated at this point.
> [] is the recommended approach, if you can live with its restrictions
> (and I think we can here: all we need to do is remove the extra in
> extra_bytes and *always* allocate at least one byte, which actually
> simplifies the code in usdt_prids_map_val_extra_bytes(). [] does include
> the size of necessary padding: mask will start at precisely
> sizeof(usdt_prids_map_val_t).)

Also changed.

>> diff --git a/libdtrace/dt_prov_uprobe.c b/libdtrace/dt_prov_uprobe.c
>> index 5ba50b678..65413cba6 100644
>> --- a/libdtrace/dt_prov_uprobe.c
>> +++ b/libdtrace/dt_prov_uprobe.c
>> @@ -303,6 +303,8 @@ typedef struct list_key {
>>   	usdt_prids_map_key_t	key;
>>   } list_key_t;
>>   
>> +int usdt_prids_map_val_extra_bytes;
> Ugh!

Ditto.

>>   static const dtrace_pattr_t	pattr = {
>>   { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
>>   { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
>> @@ -403,7 +405,7 @@ clean_usdt_probes(dtrace_hdl_t *dtp)
>>   	int			fdprids = dtp->dt_usdt_pridsmap_fd;
>>   	int			fdnames = dtp->dt_usdt_namesmap_fd;
>>   	usdt_prids_map_key_t	key, nxt;
>> -	usdt_prids_map_val_t	val;
>> +	usdt_prids_map_val_t	*val = alloca(sizeof(usdt_prids_map_val_t) + usdt_prids_map_val_extra_bytes);
> (this bit will need changing slightly if you use [].)
>
>>   
>> +void usdt_prids_map_val_extra_bytes_init(dtrace_hdl_t *dtp) {
> Wrongly-placed {.

Got it.

>> +	int i, n = 0, w = sizeof(((usdt_prids_map_val_t *)0)->mask[0]);
>> +
>> +	/* Count how many statements cannot be ignored, regardless of uprp. */
>> +	for (i = 0; i < dtp->dt_stmt_nextid; i++) {
>> +		dtrace_stmtdesc_t *stp;
>> +
>> +		stp = dtp->dt_stmts[i];
>> +		if (stp == NULL || ignore_clause(dtp, i, NULL))
>> +			continue;
> I was about to complain about the cost of all these ignore_clause()s,
> but of course this is only invoked once per run...
>
>> +		n++;
>> +	}
>> +
>> +	/* Determine how many bytes are needed for this many bits. */
>> +	n = (n + CHAR_BIT - 1) / CHAR_BIT;
>> +
>> +	/* Determine how many words are needed for this many bytes. */
>> +	n = (n + w - 1) / w;
> Thank God you commented this -- it took me ages to realise how it works.
> It feels like it should be a standard idiom, but I've never encountered
> it before...
>
>> +	/* Determine how many extra bytes are needed. */
>> +	if (n > 1)
>> +		usdt_prids_map_val_extra_bytes = (n - 1) * w;
>> +	else
>> +		usdt_prids_map_val_extra_bytes = 0;
> If you use a [] array, this can simply become
>
> usdt_prids_map_val_bytes = n * w;

Earlier you wrote, "*always* allocate at least one byte, which actually 
simplifies the code in usdt_prids_map_val_extra_bytes()." So I'm missing 
what you mean in the case n=0.

>> +}
>> +
>>   static int add_probe_uprobe(dtrace_hdl_t *dtp, dt_probe_t *prp)
>>   {
>>   	dtrace_difo_t   *dp;
>> @@ -650,6 +679,7 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>>   	int				fd = dtp->dt_usdt_namesmap_fd;
>>   	pid_t				pid;
>>   	list_probe_t			*pup;
>> +	usdt_prids_map_val_t		*val;
>>   
>>   	/* Add probe name elements to usdt_names map. */
>>   	p = probnam;
>> @@ -685,11 +715,11 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>>   	}
>>   
>>   	/* Add prid and bit mask to usdt_prids map. */
>> +	val = alloca(sizeof(usdt_prids_map_val_t) + usdt_prids_map_val_extra_bytes);
>>   	for (pup = prp->prv_data; pup != NULL; pup = dt_list_next(pup)) {
>>   		dt_probe_t		*uprp = pup->probe;
>> -		long long		mask = 0, bit = 1;
>> +		uint32_t		iword = 0, mask = 0, bit = 1;
>>   		usdt_prids_map_key_t	key;
>> -		usdt_prids_map_val_t	val;
>>   		dt_uprobe_t		*upp = uprp->prv_data;
>>   
>>   		/*
>> @@ -707,11 +737,15 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>>   				dtrace_stmtdesc_t *stp;
>>   
>>   				stp = dtp->dt_stmts[n];
>> -				if (stp == NULL)
>> +				if (stp == NULL || ignore_clause(dtp, n, uprp))
>>   					continue;
>>   
>> -				if (ignore_clause(dtp, n, uprp))
>> -					continue;
>> +				if (bit == 0) {
>> +					val->mask[iword] = mask;
>> +					mask = 0;
>> +					iword++;
>> +					bit = 1;
>> +				}
>>   
>>   				if (dt_gmatch(prp->desc->prv, stp->dtsd_ecbdesc->dted_probe.prv) &&
>>   				    dt_gmatch(prp->desc->mod, stp->dtsd_ecbdesc->dted_probe.mod) &&
>> @@ -726,11 +760,11 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>>   		key.pid = pid;
>>   		key.uprid = uprp->desc->id;
>>   
>> -		val.prid = prp->desc->id;
>> -		val.mask = mask;
>> +		val->prid = prp->desc->id;
>> +		val->mask[iword] = mask;
>>
>>   		// FIXME Check return value, but how should errors be handled?
>> -		dt_bpf_map_update(dtp->dt_usdt_pridsmap_fd, &key, &val);
>> +		dt_bpf_map_update(dtp->dt_usdt_pridsmap_fd, &key, val);
>>   	}
> ... am I missing something, or do you never modify "mask" (the local
> variable) anywhere in this function? So it ends up always zero...

Perhaps you're missing something?  mask |= bit

>> @@ -1451,7 +1485,7 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>>   	const list_probe_t	*pop;
>>   	uint_t			lbl_exit = pcb->pcb_exitlbl;
>>   	dt_ident_t		*usdt_prids = dt_dlib_get_map(dtp, "usdt_prids");
>> -	int			n;
>> +	int			n, ibit, w = CHAR_BIT * sizeof(((usdt_prids_map_val_t *)0)->mask[0]);
> We could really do with better variable names than "n" and "w". w seems
> to be the bit-width of a single word: word_width? width?

Hopefully, this is good enough:  "w" for "word width" or "width"? Mimic 
other similar code closely.  Seemed like a reasonable choice to me.

>> @@ -1538,7 +1572,8 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>>   	 */
>>   	assert(sizeof(usdt_prids_map_key_t) <= DT_STK_SLOT_SZ);
>>   	emit(dlp,  BPF_STORE(BPF_W, BPF_REG_FP, DT_TRAMP_SP_SLOT(0), BPF_REG_0));
>> -	emit(dlp,  BPF_STORE_IMM(BPF_W, BPF_REG_FP, DT_TRAMP_SP_SLOT(0) + (int)sizeof(pid_t), uprp->desc->id));
>> +	emit(dlp,  BPF_STORE_IMM(BPF_W, BPF_REG_FP,
>> +		   DT_TRAMP_SP_SLOT(0) + (int)sizeof(pid_t), uprp->desc->id));
>>   	dt_cg_xsetx(dlp, usdt_prids, DT_LBL_NONE, BPF_REG_1, usdt_prids->di_id);
>>   	emit(dlp,  BPF_MOV_REG(BPF_REG_2, BPF_REG_FP));
>>   	emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, DT_TRAMP_SP_SLOT(0)));
>> @@ -1572,8 +1607,8 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>>   	emit(dlp,  BPF_LOAD(BPF_W, BPF_REG_1, BPF_REG_0, 0));
>>   	emit(dlp,  BPF_STORE(BPF_W, BPF_REG_7, DMST_PRID, BPF_REG_1));
>>   
>> -	/* Read the bit mask from the table lookup in %r6. */    // FIXME someday, extend this past 64 bits
>> -	emit(dlp,  BPF_LOAD(BPF_DW, BPF_REG_6, BPF_REG_0, offsetof(usdt_prids_map_val_t, mask)));
>> +	/* Store the value key for reuse. */
>> +	emit(dlp,  BPF_STORE(BPF_DW, BPF_REG_FP, DT_TRAMP_SP_SLOT(0), BPF_REG_0));
> This all makes sense...
>
>> @@ -1587,21 +1622,24 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>>   	/*
>>   	 * Hold the bit mask in %r6 between clause calls.
>>   	 */
>> -	for (n = 0; n < dtp->dt_stmt_nextid; n++) {
>> +	for (ibit = n = 0; n < dtp->dt_stmt_nextid; n++) {
>>   		dtrace_stmtdesc_t *stp;
>>   		dt_ident_t	*idp;
>>   		uint_t		lbl_next;
>>   
>>   		stp = dtp->dt_stmts[n];
>> -		if (stp == NULL)
>> -			continue;
>> -
>> -		if (ignore_clause(dtp, n, uprp))
>> +		if (stp == NULL || ignore_clause(dtp, n, uprp))
>>   			continue;
>>   
>>   		idp = stp->dtsd_clause;
>>   		lbl_next = dt_irlist_label(dlp);
> ... as does this (although we are now calling ignore_clause() at least
> twice for every clause, I think three times)...
>
>> +		/* Load the next word of the bit mask into %r6. */
>> +		if (ibit % w == 0) {
>> +			emit(dlp,  BPF_LOAD(BPF_DW, BPF_REG_0, BPF_REG_FP, DT_TRAMP_SP_SLOT(0)));
>> +			emit(dlp,  BPF_LOAD(BPF_W, BPF_REG_6, BPF_REG_0, offsetof(usdt_prids_map_val_t, mask[ibit / w])));
> ok, so this uses the 'w' thing to get an array offset out of a bit
> count, and then you can just use the existing repated-shift code. Should
> work.
>
>> +		}
>> +
>>   		/* If the lowest %r6 bit is 0, skip over this clause. */
>>   		emit(dlp,  BPF_MOV_REG(BPF_REG_1, BPF_REG_6));
>>   		emit(dlp,  BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 1));
>> @@ -1629,6 +1667,7 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>>   
>>   		/* Right-shift %r6. */
>>   		emit(dlp,  BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 1));
>> +		ibit++;
> ... so, this isn't in the loop conditionl because the values are only
> filled for non-ignored, non-NULL statements. This assumption is
> duplicated in two places (here and in usdt_prids_map_val_extra_bytes())
> and if it ever drifts we get uninitialized rubbish or buffer overruns.
> Should it be centralized somewhere, somehow? At least it needs a
> comment.

The loop bodies differ.  Okay, though, I added comments.

>> +# We stick 80 probe descriptions in each of 3 scripts to test
>> +# USDT's ability to handle hundreds of probe descriptions.
>> +++ b/test/unittest/usdt/tst.manyprobedescriptions2.sh
> ... maybe manyprobedescriptions-bytesize.sh :)

What is "byte size"?  Anyhow, the point is that there are many probe 
descriptions.

>> @@ -0,0 +1,127 @@
>> +#!/bin/bash
>> +#
>> +# Oracle Linux DTrace.
>> +# Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.
>> +# Licensed under the Universal Permissive License v 1.0 as shown at
>> +# http://oss.oracle.com/licenses/upl.
>> +
>> +# This test uses many probes and probe descriptions.  Therefore, the
>> +# number of BPF programs to load into the kernel -- dt_bpf_load_prog()
>> +# calling prp->prov->impl->load_prog(), which is dt_bpf_prog_load() --
>> +# and the duration of each load are both increasing.
>> +# @@timeout: 400
>> +
>> +dtrace=$1
>> +
>> +DIRNAME="$tmpdir/usdt-many_probe_descriptions2.$$.$RANDOM"
>> +mkdir -p $DIRNAME
>> +cd $DIRNAME
>> +
>> +# Set the lists.
>> +# - The probes will be foo$x$y.
>> +# - The probe descriptions will be foo$x* and foo*$y, for each $d.
>> +# So if there are nx items in xlist, ny in ylist, and nd in dlist,
>> +# - there will be roughly nx*ny probes
>> +# - there will be roughly (nx+ny)*nd probe descriptions
>> +
>> +xlist="a b c d e f g h i j k l m"
>> +ylist="n o p q r s t u v w x y z"
>> +dlist="0 1 2 3 4 5 6 7 8"
> ... and with 8... maybe we should also test 0, 1, 7 and 9, checking all
> the boundary conditions (and refactor things so we don't need to copy
> all of this every time).

But this is not a case of filling some byte up to 8 bits.  The test has 
nested loops over these lists, appending more probes or probe 
descriptions, as appropriate.  There are 13x13 probes, one after 
another.  There are (13+13)*9 probe descriptions, one after another.  
There are 13x13x9 probe firings, one after another.  It just so happens 
that none of those numbers is a multiple of 8;  the only point is that 
each of those numbers is "big."

Thanks for the review.  Next version momentarily.

> (You'll only really see things go wrong under valgrind, but that's fine,
> valgrinding should be routine before releases anyway.)