[DTrace-devel] [PATCH v2 2/2] Extend the USDT bit mask to multiple words
Eugene Loh
eugene.loh at oracle.com
Wed Jul 23 00:47:36 UTC 2025
On 7/22/25 08:34, Nick Alcock wrote:
> On 25 Jun 2025, eugene loh outgrape:
>> From: Eugene Loh <eugene.loh at oracle.com>
>>
>> Currently, USDT is limited to 64 probe descriptions since the
>> underlying probe uses a 64-bit mask to decide which probes to execute.
>>
>> Change to a multi-word bit mask that can be extended to however many
>> probe descriptions there are.
> Oh yeah!!!!
>
> Not quite happy enough with this one to r-b it, but it's sooo close...
> :)
>
>> Also, change the mask words to be 32-bit rather than 64-bit. The reason
>> is that, commonly, there will be fewer than 32 probe descriptions. In
>> this case, we shorten the value of the "USDT prids" BPF map from 16 bytes
>> uint32_t prid;
>> long long mask[1];
>> down to 8 bytes
>> uint32_t prid;
>> uint32_t mask[1];
>> (The second member is smaller and no longer costs extra padding.)
> Nice!
>
>> We also add an
>> extern int usdt_prids_map_val_extra_bytes;
>> to denote how many extra bytes will be needed for the extended mask.
> Ugh. Is this really the best we can do? A global variable referenced via
> extern? Can't we put it in the dtp or anything? This will fail as soon
> as we have more than one dtrace handle... and yes I now that's not
> exactly tested and probably doesn't work, but we shouldn't break it
> worse when keeping it working as well as it is now is so simple.
Okay, changed.
>> @@ -48,8 +48,9 @@ typedef struct usdt_prids_map_key {
>> } usdt_prids_map_key_t;
>> typedef struct usdt_prids_map_val {
>> uint32_t prid; /* should be dtrace_id_t, sys/dtrace_types.h */
>> - long long mask;
>> + uint32_t mask[1];
>> } usdt_prids_map_val_t;
>> +extern int usdt_prids_map_val_extra_bytes;
> Note that this is not how you're supposed to implement VLAs: [1] is
> extremely obsolete (as in, pre-C99) and very deprecated at this point.
> [] is the recommended approach, if you can live with its restrictions
> (and I think we can here: all we need to do is remove the extra in
> extra_bytes and *always* allocate at least one byte, which actually
> simplifies the code in usdt_prids_map_val_extra_bytes(). [] does include
> the size of necessary padding: mask will start at precisely
> sizeof(usdt_prids_map_val_t).)
Also changed.
>> diff --git a/libdtrace/dt_prov_uprobe.c b/libdtrace/dt_prov_uprobe.c
>> index 5ba50b678..65413cba6 100644
>> --- a/libdtrace/dt_prov_uprobe.c
>> +++ b/libdtrace/dt_prov_uprobe.c
>> @@ -303,6 +303,8 @@ typedef struct list_key {
>> usdt_prids_map_key_t key;
>> } list_key_t;
>>
>> +int usdt_prids_map_val_extra_bytes;
> Ugh!
Ditto.
>> static const dtrace_pattr_t pattr = {
>> { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
>> { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
>> @@ -403,7 +405,7 @@ clean_usdt_probes(dtrace_hdl_t *dtp)
>> int fdprids = dtp->dt_usdt_pridsmap_fd;
>> int fdnames = dtp->dt_usdt_namesmap_fd;
>> usdt_prids_map_key_t key, nxt;
>> - usdt_prids_map_val_t val;
>> + usdt_prids_map_val_t *val = alloca(sizeof(usdt_prids_map_val_t) + usdt_prids_map_val_extra_bytes);
> (this bit will need changing slightly if you use [].)
>
>>
>> +void usdt_prids_map_val_extra_bytes_init(dtrace_hdl_t *dtp) {
> Wrongly-placed {.
Got it.
>> + int i, n = 0, w = sizeof(((usdt_prids_map_val_t *)0)->mask[0]);
>> +
>> + /* Count how many statements cannot be ignored, regardless of uprp. */
>> + for (i = 0; i < dtp->dt_stmt_nextid; i++) {
>> + dtrace_stmtdesc_t *stp;
>> +
>> + stp = dtp->dt_stmts[i];
>> + if (stp == NULL || ignore_clause(dtp, i, NULL))
>> + continue;
> I was about to complain about the cost of all these ignore_clause()s,
> but of course this is only invoked once per run...
>
>> + n++;
>> + }
>> +
>> + /* Determine how many bytes are needed for this many bits. */
>> + n = (n + CHAR_BIT - 1) / CHAR_BIT;
>> +
>> + /* Determine how many words are needed for this many bytes. */
>> + n = (n + w - 1) / w;
> Thank God you commented this -- it took me ages to realise how it works.
> It feels like it should be a standard idiom, but I've never encountered
> it before...
>
>> + /* Determine how many extra bytes are needed. */
>> + if (n > 1)
>> + usdt_prids_map_val_extra_bytes = (n - 1) * w;
>> + else
>> + usdt_prids_map_val_extra_bytes = 0;
> If you use a [] array, this can simply become
>
> usdt_prids_map_val_bytes = n * w;
Earlier you wrote, "*always* allocate at least one byte, which actually
simplifies the code in usdt_prids_map_val_extra_bytes()." So I'm missing
what you mean in the case n=0.
>> +}
>> +
>> static int add_probe_uprobe(dtrace_hdl_t *dtp, dt_probe_t *prp)
>> {
>> dtrace_difo_t *dp;
>> @@ -650,6 +679,7 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>> int fd = dtp->dt_usdt_namesmap_fd;
>> pid_t pid;
>> list_probe_t *pup;
>> + usdt_prids_map_val_t *val;
>>
>> /* Add probe name elements to usdt_names map. */
>> p = probnam;
>> @@ -685,11 +715,11 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>> }
>>
>> /* Add prid and bit mask to usdt_prids map. */
>> + val = alloca(sizeof(usdt_prids_map_val_t) + usdt_prids_map_val_extra_bytes);
>> for (pup = prp->prv_data; pup != NULL; pup = dt_list_next(pup)) {
>> dt_probe_t *uprp = pup->probe;
>> - long long mask = 0, bit = 1;
>> + uint32_t iword = 0, mask = 0, bit = 1;
>> usdt_prids_map_key_t key;
>> - usdt_prids_map_val_t val;
>> dt_uprobe_t *upp = uprp->prv_data;
>>
>> /*
>> @@ -707,11 +737,15 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>> dtrace_stmtdesc_t *stp;
>>
>> stp = dtp->dt_stmts[n];
>> - if (stp == NULL)
>> + if (stp == NULL || ignore_clause(dtp, n, uprp))
>> continue;
>>
>> - if (ignore_clause(dtp, n, uprp))
>> - continue;
>> + if (bit == 0) {
>> + val->mask[iword] = mask;
>> + mask = 0;
>> + iword++;
>> + bit = 1;
>> + }
>>
>> if (dt_gmatch(prp->desc->prv, stp->dtsd_ecbdesc->dted_probe.prv) &&
>> dt_gmatch(prp->desc->mod, stp->dtsd_ecbdesc->dted_probe.mod) &&
>> @@ -726,11 +760,11 @@ static int add_probe_usdt(dtrace_hdl_t *dtp, dt_probe_t *prp)
>> key.pid = pid;
>> key.uprid = uprp->desc->id;
>>
>> - val.prid = prp->desc->id;
>> - val.mask = mask;
>> + val->prid = prp->desc->id;
>> + val->mask[iword] = mask;
>>
>> // FIXME Check return value, but how should errors be handled?
>> - dt_bpf_map_update(dtp->dt_usdt_pridsmap_fd, &key, &val);
>> + dt_bpf_map_update(dtp->dt_usdt_pridsmap_fd, &key, val);
>> }
> ... am I missing something, or do you never modify "mask" (the local
> variable) anywhere in this function? So it ends up always zero...
Perhaps you're missing something? mask |= bit
>> @@ -1451,7 +1485,7 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>> const list_probe_t *pop;
>> uint_t lbl_exit = pcb->pcb_exitlbl;
>> dt_ident_t *usdt_prids = dt_dlib_get_map(dtp, "usdt_prids");
>> - int n;
>> + int n, ibit, w = CHAR_BIT * sizeof(((usdt_prids_map_val_t *)0)->mask[0]);
> We could really do with better variable names than "n" and "w". w seems
> to be the bit-width of a single word: word_width? width?
Hopefully, this is good enough: "w" for "word width" or "width"? Mimic
other similar code closely. Seemed like a reasonable choice to me.
>> @@ -1538,7 +1572,8 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>> */
>> assert(sizeof(usdt_prids_map_key_t) <= DT_STK_SLOT_SZ);
>> emit(dlp, BPF_STORE(BPF_W, BPF_REG_FP, DT_TRAMP_SP_SLOT(0), BPF_REG_0));
>> - emit(dlp, BPF_STORE_IMM(BPF_W, BPF_REG_FP, DT_TRAMP_SP_SLOT(0) + (int)sizeof(pid_t), uprp->desc->id));
>> + emit(dlp, BPF_STORE_IMM(BPF_W, BPF_REG_FP,
>> + DT_TRAMP_SP_SLOT(0) + (int)sizeof(pid_t), uprp->desc->id));
>> dt_cg_xsetx(dlp, usdt_prids, DT_LBL_NONE, BPF_REG_1, usdt_prids->di_id);
>> emit(dlp, BPF_MOV_REG(BPF_REG_2, BPF_REG_FP));
>> emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, DT_TRAMP_SP_SLOT(0)));
>> @@ -1572,8 +1607,8 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>> emit(dlp, BPF_LOAD(BPF_W, BPF_REG_1, BPF_REG_0, 0));
>> emit(dlp, BPF_STORE(BPF_W, BPF_REG_7, DMST_PRID, BPF_REG_1));
>>
>> - /* Read the bit mask from the table lookup in %r6. */ // FIXME someday, extend this past 64 bits
>> - emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_6, BPF_REG_0, offsetof(usdt_prids_map_val_t, mask)));
>> + /* Store the value key for reuse. */
>> + emit(dlp, BPF_STORE(BPF_DW, BPF_REG_FP, DT_TRAMP_SP_SLOT(0), BPF_REG_0));
> This all makes sense...
>
>> @@ -1587,21 +1622,24 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>> /*
>> * Hold the bit mask in %r6 between clause calls.
>> */
>> - for (n = 0; n < dtp->dt_stmt_nextid; n++) {
>> + for (ibit = n = 0; n < dtp->dt_stmt_nextid; n++) {
>> dtrace_stmtdesc_t *stp;
>> dt_ident_t *idp;
>> uint_t lbl_next;
>>
>> stp = dtp->dt_stmts[n];
>> - if (stp == NULL)
>> - continue;
>> -
>> - if (ignore_clause(dtp, n, uprp))
>> + if (stp == NULL || ignore_clause(dtp, n, uprp))
>> continue;
>>
>> idp = stp->dtsd_clause;
>> lbl_next = dt_irlist_label(dlp);
> ... as does this (although we are now calling ignore_clause() at least
> twice for every clause, I think three times)...
>
>> + /* Load the next word of the bit mask into %r6. */
>> + if (ibit % w == 0) {
>> + emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_0, BPF_REG_FP, DT_TRAMP_SP_SLOT(0)));
>> + emit(dlp, BPF_LOAD(BPF_W, BPF_REG_6, BPF_REG_0, offsetof(usdt_prids_map_val_t, mask[ibit / w])));
> ok, so this uses the 'w' thing to get an array offset out of a bit
> count, and then you can just use the existing repated-shift code. Should
> work.
>
>> + }
>> +
>> /* If the lowest %r6 bit is 0, skip over this clause. */
>> emit(dlp, BPF_MOV_REG(BPF_REG_1, BPF_REG_6));
>> emit(dlp, BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 1));
>> @@ -1629,6 +1667,7 @@ static int trampoline(dt_pcb_t *pcb, uint_t exitlbl)
>>
>> /* Right-shift %r6. */
>> emit(dlp, BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 1));
>> + ibit++;
> ... so, this isn't in the loop conditionl because the values are only
> filled for non-ignored, non-NULL statements. This assumption is
> duplicated in two places (here and in usdt_prids_map_val_extra_bytes())
> and if it ever drifts we get uninitialized rubbish or buffer overruns.
> Should it be centralized somewhere, somehow? At least it needs a
> comment.
The loop bodies differ. Okay, though, I added comments.
>> +# We stick 80 probe descriptions in each of 3 scripts to test
>> +# USDT's ability to handle hundreds of probe descriptions.
>> +++ b/test/unittest/usdt/tst.manyprobedescriptions2.sh
> ... maybe manyprobedescriptions-bytesize.sh :)
What is "byte size"? Anyhow, the point is that there are many probe
descriptions.
>> @@ -0,0 +1,127 @@
>> +#!/bin/bash
>> +#
>> +# Oracle Linux DTrace.
>> +# Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved.
>> +# Licensed under the Universal Permissive License v 1.0 as shown at
>> +# http://oss.oracle.com/licenses/upl.
>> +
>> +# This test uses many probes and probe descriptions. Therefore, the
>> +# number of BPF programs to load into the kernel -- dt_bpf_load_prog()
>> +# calling prp->prov->impl->load_prog(), which is dt_bpf_prog_load() --
>> +# and the duration of each load are both increasing.
>> +# @@timeout: 400
>> +
>> +dtrace=$1
>> +
>> +DIRNAME="$tmpdir/usdt-many_probe_descriptions2.$$.$RANDOM"
>> +mkdir -p $DIRNAME
>> +cd $DIRNAME
>> +
>> +# Set the lists.
>> +# - The probes will be foo$x$y.
>> +# - The probe descriptions will be foo$x* and foo*$y, for each $d.
>> +# So if there are nx items in xlist, ny in ylist, and nd in dlist,
>> +# - there will be roughly nx*ny probes
>> +# - there will be roughly (nx+ny)*nd probe descriptions
>> +
>> +xlist="a b c d e f g h i j k l m"
>> +ylist="n o p q r s t u v w x y z"
>> +dlist="0 1 2 3 4 5 6 7 8"
> ... and with 8... maybe we should also test 0, 1, 7 and 9, checking all
> the boundary conditions (and refactor things so we don't need to copy
> all of this every time).
But this is not a case of filling some byte up to 8 bits. The test has
nested loops over these lists, appending more probes or probe
descriptions, as appropriate. There are 13x13 probes, one after
another. There are (13+13)*9 probe descriptions, one after another.
There are 13x13x9 probe firings, one after another. It just so happens
that none of those numbers is a multiple of 8; the only point is that
each of those numbers is "big."
Thanks for the review. Next version momentarily.
> (You'll only really see things go wrong under valgrind, but that's fine,
> valgrinding should be routine before releases anyway.)
More information about the DTrace-devel
mailing list