[DTrace-devel] [PATCH 01/14] Fix stack-skip counts for caller and stackdepth
Eugene Loh
eugene.loh at oracle.com
Mon Jun 16 19:21:09 UTC 2025
On 6/13/25 10:33, Nick Alcock wrote:
> On 22 May 2025, eugene loh told this:
>
>> From: Eugene Loh <eugene.loh at oracle.com>
>>
>> Apparently, when we call the BPF get_stack() helper function,
>> it generally knows how many frames to skip to get the real kernel
>> stack. For fentry/fexit, however, this is apparently not the case,
>> and commit bc65cb44d
>> ("cg: allow providers to specify a skip count for stack retrieval")
>> added the ability to skip frames for fentry/fexit probes.
>>
>> When this "skip" is needed, however, it must must be even deeper
>> when we descend further frames, such as when we call dt_bpf_*()
>> precompiled functions.
> Ack.
>
>> Add this support for dt_bpf_caller() and dt_bpf_stackdepth().
>> That is, if there are stack-skip frames, skip yet one more frame
>> when inside a bpf/get_bvar.c function.
> The general approach here seems sound, but this confuses me. Is the
> assumption that if the skip is not needed, get_stack() will itself know
> how many frames to skip, and will skip dt_dt_bvar_caller() for you?
Yes. I confirmed this with tests.
> (So a zero skip is actually a "the system knows", and a nonzero skip
> causes the system to not skip anything, so you have to do all the
> skipping yourself?
Yes.
> This behaviour is not documented in bpf_get_stack()'s documentation,
> which absolutely does not mean it doesn't do it...)
Yup: I did not learn this from documentation!
>> Note that we declare the skip count volatile. The compiler might
>> optimize code that uses the STACK_SKIP value, but we will subsequently
>> perform relocations that adjust this value.
> ... why doesn't this apply to every other extern global variable in
> get_bvar()? They're all similarly relocated...
Right. There is potentially a broader problem. But we simply do not
have evidence of misbehavior in other cases. Ruggedizing other cases
could be the subject of a different patch.
The problem in this case is that the compiler seems to assume
&symbol!=0, which is reasonable except that we violate that behavior for
our relocation tricks.
Consider the C code:
uint64_t dt_bvar_stackdepth(const dt_dctx_t *dctx)
{
uint32_t bufsiz = (uint32_t) (uint64_t) (&STKSIZ);
char *buf = dctx->mem + (uint64_t)(&STACK_OFF);
uint64_t retv;
volatile uint64_t skip = (uint64_t)(&STACK_SKIP);
retv = bpf_get_stack(dctx->ctx,
buf,
bufsiz,
skip & BPF_F_SKIP_FIELD_MASK);
if (skip)
return retv / sizeof(uint64_t) - 1; // branch "A"
return retv / sizeof(uint64_t); // branch "B"
}
If you omit "volatile", the compiler assumes &STACK_SKIP!=0. The emitted
code has:
*) no run-time "if (skip)" check
*) no code for branch "B"
*) only code for branch "A"
If you include "volatile", however, the compiler caches &STACK_SKIP on
the BPF stack and later performs a run-time check on its value to
correctly execute either branch "A" or branch "B".
Incidentally, I looked at this using the function like this:
profile:::tick-1s,
fbt:vmlinux:ksys_write:entry
{
trace(stackdepth);
}
Notice that the profile probe has STACK_SKIP==0 while the fbt probe has
STACK_SKIP==4.
I looked at the disassembly in three ways:
*) as emitted by the compiler: objdump -d build/bpf--get_bvar.o
(not too interesting but establishes a baseline; e.g., that a
branch is missing)
*) after relocation for tick-1s (-xdisasm=8 -S)
*) after relocation for fbt::ksys_write:entry (-xdisasm=8 -S)
The latter two confirm the relocations are correct.
>> Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
>> ---
>> bpf/get_bvar.c | 17 ++++++++++++++---
>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/bpf/get_bvar.c b/bpf/get_bvar.c
>> index 9625e764e..99a6503d5 100644
>> --- a/bpf/get_bvar.c
>> +++ b/bpf/get_bvar.c
>> @@ -53,12 +53,17 @@ noinline uint64_t dt_bvar_args(const dt_dctx_t *dctx, uint32_t idx)
>>
>> noinline uint64_t dt_bvar_caller(const dt_dctx_t *dctx)
>> {
>> - uint64_t buf[2] = { 0, };
>> + uint64_t buf[3] = { 0, };
>> + volatile uint64_t
>> + skip = (uint64_t)(&STACK_SKIP);
>>
>> if (bpf_get_stack(dctx->ctx, buf, sizeof(buf),
>> - (uint64_t)(&STACK_SKIP) & BPF_F_SKIP_FIELD_MASK) < 0)
>> + skip & BPF_F_SKIP_FIELD_MASK) < 0)
>> return 0;
>>
>> + /* If we had to skip any frames, account for the dt_bvar_caller() frame. */
>> + if (skip)
>> + return buf[2];
>> return buf[1];
>> }
>>
>> @@ -203,9 +208,11 @@ noinline uint64_t dt_bvar_stackdepth(const dt_dctx_t *dctx)
>> uint32_t bufsiz = (uint32_t) (uint64_t) (&STKSIZ);
>> char *buf = dctx->mem + (uint64_t)(&STACK_OFF);
>> uint64_t retv;
>> + volatile uint64_t
>> + skip = (uint64_t)(&STACK_SKIP);
>>
>> retv = bpf_get_stack(dctx->ctx, buf, bufsiz,
>> - (uint64_t)(&STACK_SKIP) & BPF_F_SKIP_FIELD_MASK);
>> + skip & BPF_F_SKIP_FIELD_MASK);
>> if (retv < 0)
>> return error(dctx, DTRACEFLT_BADSTACK, 0 /* FIXME */);
>>
>> @@ -217,7 +224,11 @@ noinline uint64_t dt_bvar_stackdepth(const dt_dctx_t *dctx)
>> * If retv==bufsiz, presumably the stack is larger than what we
>> * can retrieve. But it's also possible that the buffer was exactly
>> * large enough. So, leave it to the user to interpret the result.
>> + *
>> + * If we had to skip any frames, account for the dt_bvar_stackdepth() frame.
>> */
>> + if (skip)
>> + return retv / sizeof(uint64_t) - 1;
>> return retv / sizeof(uint64_t);
>> }
More information about the DTrace-devel
mailing list