[DTrace-devel] [PATCH v4] Implement TLS variables

Thu Dec 2 01:39:48 UTC 2021

On 12/1/21 2:25 AM, Kris Van Hees wrote:

> On Tue, Nov 30, 2021 at 06:03:38PM -0500, Eugene Loh via DTrace-devel wrote:
>> We should mention in the commit message that the value size in the TLS BPF
>> map is the largest TLS variable size, which has drawbacks.
> I didn't mention that because it is no different from the legacy DTrace
> implementation.  So there is no additional drawback as compared to legacy
> DTrace.  We are not usually using commit messages to document things in
> DTrace that have always been this way.
Okay.  Going forward, however, I suspect there will be fewer and fewer 
developers who know Solaris or legacy DTrace and presumably legacy 
DTrace (on Linux) will become a blip of history.
>> I had some questions about "store only nonzero" for TLS variables.
>>
>>    * E.g., what happens when I load an unassigned TLS variable?
>>      I guess dt_cg_load_var() calls get_tvar(varid, 1), which
>>      looks up the key and presumably gets a NULL value pointer.
>>      Then -- and here is what puzzles me -- we create a hash entry
>>      with the default value (0) and return a pointer to the newly
>>      created zero value.  Isn't that wrong?  Why are we allocating
>>      something in the hash table?
>>
>>      Results are correct, but that's because we're allocating a bunch of
>>      zeroes in the BPF map.  Consider:
>>          $ git diff
>>          diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
>>          @@ -2076,7 +2076,7 @@ dt_cg_load_var(dt_node_t *dst, dt_irlist_t
>> *dlp, dt_regset_t *drp)
>>                  dt_regset_free_args(drp);
>>
>>                  emit(dlp,  BPF_BRANCH_IMM(BPF_JNE, BPF_REG_0, 0,
>> lbl_notnull));
>>          -       emit(dlp,  BPF_MOV_IMM(dst->dn_reg, 0));
>>          +       emit(dlp,  BPF_MOV_IMM(dst->dn_reg, 12345678));
>>                  emit(dlp,  BPF_JUMP(lbl_done));
>>                  emitl(dlp, lbl_notnull,
>>                             BPF_MOV_REG(dst->dn_reg, BPF_REG_0));
>>          $ dtrace -n 'self x; BEGIN { trace(self->x); exit(0) }'
>>          dtrace: description 'self x' matched 1 probe
>>          CPU     ID                    FUNCTION:NAME
>>            1      1                           :BEGIN           0
>>      That is, I hack the code to return 12345678 when a variable is
>>      unassigned, but the D script returns 0.
>>
>>      Or, using unhacked bits, consider:
>>          dtrace -n 'self x, y, z; BEGIN { trace(self->x); }'
>>          dtrace -n 'self x, y, z; BEGIN { trace(self->x); trace(self->y); }'
>>          dtrace -n 'self x, y, z; BEGIN { trace(self->x); trace(self->y);
>> trace(self->z); }'
>>      None of them exits, so I can run:
>>          bpftool map dump id `bpftool map | awk '/name dvars/ {print
>> int($1)}'`
>>      I get
>>          key: 00 00 00 00 00 00 00 00  value: 00 00 00 00
>>          key: 00 00 00 00 09 66 00 00  value: 00 00 00 00
>>          Found 2 elements
>>      then
>>          key: 00 00 00 00 00 00 00 00  value: 00 00 00 00
>>          key: 01 00 00 00 15 66 00 00  value: 00 00 00 00
>>          key: 00 00 00 00 15 66 00 00  value: 00 00 00 00
>>          Found 3 elements
>>      then
>>          key: 00 00 00 00 00 00 00 00  value: 00 00 00 00
>>          key: 02 00 00 00 2d 66 00 00  value: 00 00 00 00
>>          key: 01 00 00 00 2d 66 00 00  value: 00 00 00 00
>>          key: 00 00 00 00 2d 66 00 00  value: 00 00 00 00
>>          Found 4 elements
>>      That is, the dvars map shows that unassigned variables are being
>>      allocated in the BPF map.
>>
>>      Here is another way of seeing the problem.  Although
>>          dtrace -xdynvarsize=15 -n 'self x, y, z, w; BEGIN {
>>              trace(self->x); trace(self->y); trace(self->z); trace(self->w);
>> }'
>>      correctly shows the values are 0, that's because it's allocating
>>      zeroes in the BPF map until it runs out of room and then
>>      interprets "ran out of room" as zero values.  Hmm.
>>      So next we try:
>>          dtrace -xdynvarsize=15 -n 'self x, y, z, w; BEGIN {
>>              trace(self->x); trace(self->y); trace(self->z); self->w = 1234;
>> trace(self->w); }'
>>          dtrace: error on enabled probe ID 2 (ID 1: dtrace:::BEGIN): invalid
>> address (0x0) in action #1
>>      This exposes our having run out of room.
> Thanks for the detailed analysis, though you could have saved yourself some
> time by limiting it to just a single instance of demonstrating the problem :)
Ironically, it was the opposite.  I started writing, kept feeling 
unsatisfied with my understanding, etc.  In the end, I was too tired to 
edit what I had written.  "I was too busy to write shorter."  :^(
> The logic for handling zero/non-zero is flawed (as you found) because of the
> multiple modes for using get_tvar().  Fix in v5 (just finished that).
>
>>    * What happens when I store a zero value?  I guess dt_cg_store_var()
>>      calls get_tvar(varid, dnp->dn_reg).  The secret sauce is when
>>      the contents of dnp->dn_reg are 0.  Does that mean the deletion
>>      only occurs when, both, we store by value AND the value is 0?
>>      E.g., the following does not work;  should it?
>>          # cat z.d
>>          #pragma D option dynvarsize=15
>>          struct foo {
>>            int x;
>>          };
>>          self struct foo a, b, c, d;
>>          BEGIN {
>>            self->a.x = 0;
>>            self->b.x = 0;
>>            self->c.x = 0;
>>            self->d.x = 0;
>>            exit(0);
>>          }
>>          ERROR {
>>            printf("ugh\n");
>>            exit(1);
>>          }
>>          # dtrace -s z.d
>>          dtrace: script 'z.d' matched 2 probes
>>          dtrace: error on enabled probe ID 3 (ID 1: dtrace:::BEGIN): invalid
>> address (0x0) in action #1
>>          CPU     ID                    FUNCTION:NAME
>>            3      3                           :ERROR ugh
> Did you actually try this on the legacy version of DTrace?  I assume not
> because I just did and got:
>
> dtrace: script 'z.d' matched 2 probes
> dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid address (0x0) in action #1 at DIF offset 8
> CPU     ID                    FUNCTION:NAME
>    2      3                           :ERROR ugh
>
> So I think that the behaviour of my implementation isn't unreasonable.
I tried legacy DTrace, but the problems I encountered there seemed to be 
a different subject and so I decided not to bring them up. Code "like 
this" works on legacy DTrace... there are examples in the test suite.  I 
do not understand the issues.

Anyhow, you explain below some of the questions.
>> There needs to be a more graceful failure (and a test) for the case that
>> dt_options[DTRACEOPT_DYNVARSIZE] < dt_maxtlslen . Currently,
>>      # cat z.d
>>      #pragma D option dynvarsize=15
>>      struct foo {
>>          int x1, x2, x3, x4;
>>      };
>>      self struct foo a;
>>      BEGIN {
>>          trace(self->a.x1);
>>          printf("good\n");
>>          exit(0);
>>      }
>>      ERROR {
>>          printf("ugh\n");
>>          exit(1);
>>      }
>>      # dtrace -s z.d
>>      dtrace: script 'z.d' matched 2 probes
>>      BPF: fd -1 is not pointing to valid bpf_map
>>      BPF: verification time 18 usec
>>      BPF: stack depth
>>      BPF: processed 0 insns (limit 1000000) max_states_per_insn 0
>> total_states 0 peak_states 0 mark_read 0
>>      dtrace: could not enable tracing: BPF program load for 'dtrace:::BEGIN'
>> failed: Bad file descriptor
> It will now cause an invalid pointer fault - which in the future will be a
> dynamic variable drop.  That would be valid behaviour because if you set the
> dynvarsize too low, you do not have room to store any variable so any attempt
> to store one would be a drop.
Hmm.  A characteristic of a good product is good diagnosibility. This 
particular case is such an easy condition to diagnose.  We should help 
users.  If they have TLS variables in their scripts and we know we will 
never handle any of them, we might as well let them know.  "Valid 
behaviour" is a low bar.  In this case, it is very easy for us to do 
much better.
>> I cannot seem to get test/unittest/variables/tvar/tst.struct.d to run on
>> either Solaris or legacy DTrace.
> Yes, that was passing because of the aforementioned issue with loading a
> var that does not exist actually causing it to be created.  With v5, this
> test fails (and is replaced by a new one).  Note though that TLS variables with
> a struct datatype are a can of worms, both in the legacy version and with this
> implementation due to semantics not being properly defined for it.  E.g. there
> is no way to delete a TLS variable with a struct datatype because you cannot
> assign 0 (a scalar) to it.  And you cannot assign a member in a TLS struct
> unless that TLS variable has already been created (as this test failure shows).
Ah.  Good to know.  I was asking about this earlier.  I'd be in favor of 
mentioning such limitations in this patch's commit message.