[DTrace-devel] [PATCH v4] Implement TLS variables

Thu Dec 2 04:30:30 UTC 2021

On Wed, Dec 01, 2021 at 08:39:48PM -0500, Eugene Loh via DTrace-devel wrote:
> On 12/1/21 2:25 AM, Kris Van Hees wrote:
> 
> > On Tue, Nov 30, 2021 at 06:03:38PM -0500, Eugene Loh via DTrace-devel wrote:
> > > We should mention in the commit message that the value size in the TLS BPF
> > > map is the largest TLS variable size, which has drawbacks.
> > I didn't mention that because it is no different from the legacy DTrace
> > implementation.  So there is no additional drawback as compared to legacy
> > DTrace.  We are not usually using commit messages to document things in
> > DTrace that have always been this way.
> Okay.  Going forward, however, I suspect there will be fewer and fewer
> developers who know Solaris or legacy DTrace and presumably legacy DTrace
> (on Linux) will become a blip of history.

I will document it in the comment block at the creation of BPF maps.  This info
does not belong in the commit message.

> > > I had some questions about "store only nonzero" for TLS variables.
> > > 
> > >    * E.g., what happens when I load an unassigned TLS variable?
> > >      I guess dt_cg_load_var() calls get_tvar(varid, 1), which
> > >      looks up the key and presumably gets a NULL value pointer.
> > >      Then -- and here is what puzzles me -- we create a hash entry
> > >      with the default value (0) and return a pointer to the newly
> > >      created zero value.  Isn't that wrong?  Why are we allocating
> > >      something in the hash table?
> > > 
> > >      Results are correct, but that's because we're allocating a bunch of
> > >      zeroes in the BPF map.  Consider:
> > >          $ git diff
> > >          diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> > >          @@ -2076,7 +2076,7 @@ dt_cg_load_var(dt_node_t *dst, dt_irlist_t
> > > *dlp, dt_regset_t *drp)
> > >                  dt_regset_free_args(drp);
> > > 
> > >                  emit(dlp,  BPF_BRANCH_IMM(BPF_JNE, BPF_REG_0, 0,
> > > lbl_notnull));
> > >          -       emit(dlp,  BPF_MOV_IMM(dst->dn_reg, 0));
> > >          +       emit(dlp,  BPF_MOV_IMM(dst->dn_reg, 12345678));
> > >                  emit(dlp,  BPF_JUMP(lbl_done));
> > >                  emitl(dlp, lbl_notnull,
> > >                             BPF_MOV_REG(dst->dn_reg, BPF_REG_0));
> > >          $ dtrace -n 'self x; BEGIN { trace(self->x); exit(0) }'
> > >          dtrace: description 'self x' matched 1 probe
> > >          CPU     ID                    FUNCTION:NAME
> > >            1      1                           :BEGIN           0
> > >      That is, I hack the code to return 12345678 when a variable is
> > >      unassigned, but the D script returns 0.
> > > 
> > >      Or, using unhacked bits, consider:
> > >          dtrace -n 'self x, y, z; BEGIN { trace(self->x); }'
> > >          dtrace -n 'self x, y, z; BEGIN { trace(self->x); trace(self->y); }'
> > >          dtrace -n 'self x, y, z; BEGIN { trace(self->x); trace(self->y);
> > > trace(self->z); }'
> > >      None of them exits, so I can run:
> > >          bpftool map dump id `bpftool map | awk '/name dvars/ {print
> > > int($1)}'`
> > >      I get
> > >          key: 00 00 00 00 00 00 00 00  value: 00 00 00 00
> > >          key: 00 00 00 00 09 66 00 00  value: 00 00 00 00
> > >          Found 2 elements
> > >      then
> > >          key: 00 00 00 00 00 00 00 00  value: 00 00 00 00
> > >          key: 01 00 00 00 15 66 00 00  value: 00 00 00 00
> > >          key: 00 00 00 00 15 66 00 00  value: 00 00 00 00
> > >          Found 3 elements
> > >      then
> > >          key: 00 00 00 00 00 00 00 00  value: 00 00 00 00
> > >          key: 02 00 00 00 2d 66 00 00  value: 00 00 00 00
> > >          key: 01 00 00 00 2d 66 00 00  value: 00 00 00 00
> > >          key: 00 00 00 00 2d 66 00 00  value: 00 00 00 00
> > >          Found 4 elements
> > >      That is, the dvars map shows that unassigned variables are being
> > >      allocated in the BPF map.
> > > 
> > >      Here is another way of seeing the problem.  Although
> > >          dtrace -xdynvarsize=15 -n 'self x, y, z, w; BEGIN {
> > >              trace(self->x); trace(self->y); trace(self->z); trace(self->w);
> > > }'
> > >      correctly shows the values are 0, that's because it's allocating
> > >      zeroes in the BPF map until it runs out of room and then
> > >      interprets "ran out of room" as zero values.  Hmm.
> > >      So next we try:
> > >          dtrace -xdynvarsize=15 -n 'self x, y, z, w; BEGIN {
> > >              trace(self->x); trace(self->y); trace(self->z); self->w = 1234;
> > > trace(self->w); }'
> > >          dtrace: error on enabled probe ID 2 (ID 1: dtrace:::BEGIN): invalid
> > > address (0x0) in action #1
> > >      This exposes our having run out of room.
> > Thanks for the detailed analysis, though you could have saved yourself some
> > time by limiting it to just a single instance of demonstrating the problem :)
> Ironically, it was the opposite.  I started writing, kept feeling
> unsatisfied with my understanding, etc.  In the end, I was too tired to edit
> what I had written.  "I was too busy to write shorter."  :^(
> > The logic for handling zero/non-zero is flawed (as you found) because of the
> > multiple modes for using get_tvar().  Fix in v5 (just finished that).
> > 
> > >    * What happens when I store a zero value?  I guess dt_cg_store_var()
> > >      calls get_tvar(varid, dnp->dn_reg).  The secret sauce is when
> > >      the contents of dnp->dn_reg are 0.  Does that mean the deletion
> > >      only occurs when, both, we store by value AND the value is 0?
> > >      E.g., the following does not work;  should it?
> > >          # cat z.d
> > >          #pragma D option dynvarsize=15
> > >          struct foo {
> > >            int x;
> > >          };
> > >          self struct foo a, b, c, d;
> > >          BEGIN {
> > >            self->a.x = 0;
> > >            self->b.x = 0;
> > >            self->c.x = 0;
> > >            self->d.x = 0;
> > >            exit(0);
> > >          }
> > >          ERROR {
> > >            printf("ugh\n");
> > >            exit(1);
> > >          }
> > >          # dtrace -s z.d
> > >          dtrace: script 'z.d' matched 2 probes
> > >          dtrace: error on enabled probe ID 3 (ID 1: dtrace:::BEGIN): invalid
> > > address (0x0) in action #1
> > >          CPU     ID                    FUNCTION:NAME
> > >            3      3                           :ERROR ugh
> > Did you actually try this on the legacy version of DTrace?  I assume not
> > because I just did and got:
> > 
> > dtrace: script 'z.d' matched 2 probes
> > dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid address (0x0) in action #1 at DIF offset 8
> > CPU     ID                    FUNCTION:NAME
> >    2      3                           :ERROR ugh
> > 
> > So I think that the behaviour of my implementation isn't unreasonable.
> I tried legacy DTrace, but the problems I encountered there seemed to be a
> different subject and so I decided not to bring them up. Code "like this"
> works on legacy DTrace... there are examples in the test suite.  I do not
> understand the issues.
> 
> Anyhow, you explain below some of the questions.
> > > There needs to be a more graceful failure (and a test) for the case that
> > > dt_options[DTRACEOPT_DYNVARSIZE] < dt_maxtlslen . Currently,
> > >      # cat z.d
> > >      #pragma D option dynvarsize=15
> > >      struct foo {
> > >          int x1, x2, x3, x4;
> > >      };
> > >      self struct foo a;
> > >      BEGIN {
> > >          trace(self->a.x1);
> > >          printf("good\n");
> > >          exit(0);
> > >      }
> > >      ERROR {
> > >          printf("ugh\n");
> > >          exit(1);
> > >      }
> > >      # dtrace -s z.d
> > >      dtrace: script 'z.d' matched 2 probes
> > >      BPF: fd -1 is not pointing to valid bpf_map
> > >      BPF: verification time 18 usec
> > >      BPF: stack depth
> > >      BPF: processed 0 insns (limit 1000000) max_states_per_insn 0
> > > total_states 0 peak_states 0 mark_read 0
> > >      dtrace: could not enable tracing: BPF program load for 'dtrace:::BEGIN'
> > > failed: Bad file descriptor
> > It will now cause an invalid pointer fault - which in the future will be a
> > dynamic variable drop.  That would be valid behaviour because if you set the
> > dynvarsize too low, you do not have room to store any variable so any attempt
> > to store one would be a drop.
> Hmm.  A characteristic of a good product is good diagnosibility. This
> particular case is such an easy condition to diagnose.  We should help
> users.  If they have TLS variables in their scripts and we know we will
> never handle any of them, we might as well let them know.  "Valid behaviour"
> is a low bar.  In this case, it is very easy for us to do much better.

Sure, but what about a script that uses 2 TLS variables in a clause, but it is
executed with a dynvarsize that only accomodates one variable?  Then it is OK
that we let it go through without a problem and the user will see a dynamic
variable drop?  And no, we cannot statically determine how many TLS variables
actually will be used in all cases.

You can file a github issue for this to suggest an enhacement to implement a
check on dynvarsize < maxtlslen, with appropriate error message being displayed.
But it won't be part of this patch - here I am implementing the behaviour that
exists in DTrace.  Changes to that behaviour are not really part of the port of
TLS variables but rather an enhancement.

> > > I cannot seem to get test/unittest/variables/tvar/tst.struct.d to run on
> > > either Solaris or legacy DTrace.
> > Yes, that was passing because of the aforementioned issue with loading a
> > var that does not exist actually causing it to be created.  With v5, this
> > test fails (and is replaced by a new one).  Note though that TLS variables with
> > a struct datatype are a can of worms, both in the legacy version and with this
> > implementation due to semantics not being properly defined for it.  E.g. there
> > is no way to delete a TLS variable with a struct datatype because you cannot
> > assign 0 (a scalar) to it.  And you cannot assign a member in a TLS struct
> > unless that TLS variable has already been created (as this test failure shows).
> Ah.  Good to know.  I was asking about this earlier.  I'd be in favor of
> mentioning such limitations in this patch's commit message.

The commit message is not documentation.  This is an issue with hoe TLS vars
are handled in DTrace - it is not a limitation of this specific implementation.
This is again an issue that can be tracked as a github issue.  I do not know
when (or how) we can address this - it seems to be a rather significant design
issue.  But it is worth tracking in case we come up with sensible semantics
for this (and implememnt them).