[DTrace-devel] [PATCH 01/14] Fix stack-skip counts for caller and stackdepth

Eugene Loh eugene.loh at oracle.com
Mon Jun 16 19:21:09 UTC 2025


On 6/13/25 10:33, Nick Alcock wrote:

> On 22 May 2025, eugene loh told this:
>
>> From: Eugene Loh <eugene.loh at oracle.com>
>>
>> Apparently, when we call the BPF get_stack() helper function,
>> it generally knows how many frames to skip to get the real kernel
>> stack.  For fentry/fexit, however, this is apparently not the case,
>> and commit bc65cb44d
>> ("cg: allow providers to specify a skip count for stack retrieval")
>> added the ability to skip frames for fentry/fexit probes.
>>
>> When this "skip" is needed, however, it must must be even deeper
>> when we descend further frames, such as when we call dt_bpf_*()
>> precompiled functions.
> Ack.
>
>> Add this support for dt_bpf_caller() and dt_bpf_stackdepth().
>> That is, if there are stack-skip frames, skip yet one more frame
>> when inside a bpf/get_bvar.c function.
> The general approach here seems sound, but this confuses me. Is the
> assumption that if the skip is not needed, get_stack() will itself know
> how many frames to skip, and will skip dt_dt_bvar_caller() for you?

Yes.  I confirmed this with tests.

> (So a zero skip is actually a "the system knows", and a nonzero skip
> causes the system to not skip anything, so you have to do all the
> skipping yourself?

Yes.

> This behaviour is not documented in bpf_get_stack()'s documentation,
> which absolutely does not mean it doesn't do it...)

Yup:  I did not learn this from documentation!

>> Note that we declare the skip count volatile.  The compiler might
>> optimize code that uses the STACK_SKIP value, but we will subsequently
>> perform relocations that adjust this value.
> ... why doesn't this apply to every other extern global variable in
> get_bvar()? They're all similarly relocated...

Right.  There is potentially a broader problem.  But we simply do not 
have evidence of misbehavior in other cases.  Ruggedizing other cases 
could be the subject of a different patch.

The problem in this case is that the compiler seems to assume 
&symbol!=0, which is reasonable except that we violate that behavior for 
our relocation tricks.

Consider the C code:

     uint64_t dt_bvar_stackdepth(const dt_dctx_t *dctx)
     {
         uint32_t          bufsiz = (uint32_t) (uint64_t) (&STKSIZ);
         char              *buf = dctx->mem + (uint64_t)(&STACK_OFF);
         uint64_t          retv;
         volatile uint64_t skip = (uint64_t)(&STACK_SKIP);

         retv = bpf_get_stack(dctx->ctx,
                              buf,
                              bufsiz,
                              skip & BPF_F_SKIP_FIELD_MASK);

         if (skip)
                 return retv / sizeof(uint64_t) - 1;      // branch "A"
         return retv / sizeof(uint64_t);                  // branch "B"
     }

If you omit "volatile", the compiler assumes &STACK_SKIP!=0. The emitted 
code has:

     *)  no run-time "if (skip)" check

     *)  no code for branch "B"

     *)  only code for branch "A"

If you include "volatile", however, the compiler caches &STACK_SKIP on 
the BPF stack and later performs a run-time check on its value to 
correctly execute either branch "A" or branch "B".

Incidentally, I looked at this using the function like this:

     profile:::tick-1s,
     fbt:vmlinux:ksys_write:entry
     {
        trace(stackdepth);
     }

Notice that the profile probe has STACK_SKIP==0 while the fbt probe has 
STACK_SKIP==4.

I looked at the disassembly in three ways:

     *)  as emitted by the compiler:  objdump -d build/bpf--get_bvar.o
        (not too interesting but establishes a baseline;  e.g., that a 
branch is missing)

     *)  after relocation for tick-1s (-xdisasm=8 -S)

     *)  after relocation for fbt::ksys_write:entry (-xdisasm=8 -S)

The latter two confirm the relocations are correct.

>> Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
>> ---
>>   bpf/get_bvar.c | 17 ++++++++++++++---
>>   1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/bpf/get_bvar.c b/bpf/get_bvar.c
>> index 9625e764e..99a6503d5 100644
>> --- a/bpf/get_bvar.c
>> +++ b/bpf/get_bvar.c
>> @@ -53,12 +53,17 @@ noinline uint64_t dt_bvar_args(const dt_dctx_t *dctx, uint32_t idx)
>>   
>>   noinline uint64_t dt_bvar_caller(const dt_dctx_t *dctx)
>>   {
>> -	uint64_t	buf[2] = { 0, };
>> +	uint64_t	buf[3] = { 0, };
>> +	volatile uint64_t
>> +			skip = (uint64_t)(&STACK_SKIP);
>>   
>>   	if (bpf_get_stack(dctx->ctx, buf, sizeof(buf),
>> -			  (uint64_t)(&STACK_SKIP) & BPF_F_SKIP_FIELD_MASK) < 0)
>> +			  skip & BPF_F_SKIP_FIELD_MASK) < 0)
>>   		return 0;
>>   
>> +	/* If we had to skip any frames, account for the dt_bvar_caller() frame. */
>> +	if (skip)
>> +		return buf[2];
>>   	return buf[1];
>>   }
>>   
>> @@ -203,9 +208,11 @@ noinline uint64_t dt_bvar_stackdepth(const dt_dctx_t *dctx)
>>   	uint32_t	bufsiz = (uint32_t) (uint64_t) (&STKSIZ);
>>   	char		*buf = dctx->mem + (uint64_t)(&STACK_OFF);
>>   	uint64_t	retv;
>> +	volatile uint64_t
>> +			skip = (uint64_t)(&STACK_SKIP);
>>   
>>   	retv = bpf_get_stack(dctx->ctx, buf, bufsiz,
>> -			     (uint64_t)(&STACK_SKIP) & BPF_F_SKIP_FIELD_MASK);
>> +			     skip & BPF_F_SKIP_FIELD_MASK);
>>   	if (retv < 0)
>>   		return error(dctx, DTRACEFLT_BADSTACK, 0 /* FIXME */);
>>   
>> @@ -217,7 +224,11 @@ noinline uint64_t dt_bvar_stackdepth(const dt_dctx_t *dctx)
>>   	 * If retv==bufsiz, presumably the stack is larger than what we
>>   	 * can retrieve.  But it's also possible that the buffer was exactly
>>   	 * large enough.  So, leave it to the user to interpret the result.
>> +	 *
>> +	 * If we had to skip any frames, account for the dt_bvar_stackdepth() frame.
>>   	 */
>> +	if (skip)
>> +		return retv / sizeof(uint64_t) - 1;
>>   	return retv / sizeof(uint64_t);
>>   }



More information about the DTrace-devel mailing list