[DTrace-devel] [PATCH v4] Implement TLS variables
Kris Van Hees
kris.van.hees at oracle.com
Thu Dec 2 04:30:30 UTC 2021
On Wed, Dec 01, 2021 at 08:39:48PM -0500, Eugene Loh via DTrace-devel wrote:
> On 12/1/21 2:25 AM, Kris Van Hees wrote:
>
> > On Tue, Nov 30, 2021 at 06:03:38PM -0500, Eugene Loh via DTrace-devel wrote:
> > > We should mention in the commit message that the value size in the TLS BPF
> > > map is the largest TLS variable size, which has drawbacks.
> > I didn't mention that because it is no different from the legacy DTrace
> > implementation. So there is no additional drawback as compared to legacy
> > DTrace. We are not usually using commit messages to document things in
> > DTrace that have always been this way.
> Okay. Going forward, however, I suspect there will be fewer and fewer
> developers who know Solaris or legacy DTrace and presumably legacy DTrace
> (on Linux) will become a blip of history.
I will document it in the comment block at the creation of BPF maps. This info
does not belong in the commit message.
> > > I had some questions about "store only nonzero" for TLS variables.
> > >
> > > * E.g., what happens when I load an unassigned TLS variable?
> > > I guess dt_cg_load_var() calls get_tvar(varid, 1), which
> > > looks up the key and presumably gets a NULL value pointer.
> > > Then -- and here is what puzzles me -- we create a hash entry
> > > with the default value (0) and return a pointer to the newly
> > > created zero value. Isn't that wrong? Why are we allocating
> > > something in the hash table?
> > >
> > > Results are correct, but that's because we're allocating a bunch of
> > > zeroes in the BPF map. Consider:
> > > $ git diff
> > > diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> > > @@ -2076,7 +2076,7 @@ dt_cg_load_var(dt_node_t *dst, dt_irlist_t
> > > *dlp, dt_regset_t *drp)
> > > dt_regset_free_args(drp);
> > >
> > > emit(dlp, BPF_BRANCH_IMM(BPF_JNE, BPF_REG_0, 0,
> > > lbl_notnull));
> > > - emit(dlp, BPF_MOV_IMM(dst->dn_reg, 0));
> > > + emit(dlp, BPF_MOV_IMM(dst->dn_reg, 12345678));
> > > emit(dlp, BPF_JUMP(lbl_done));
> > > emitl(dlp, lbl_notnull,
> > > BPF_MOV_REG(dst->dn_reg, BPF_REG_0));
> > > $ dtrace -n 'self x; BEGIN { trace(self->x); exit(0) }'
> > > dtrace: description 'self x' matched 1 probe
> > > CPU ID FUNCTION:NAME
> > > 1 1 :BEGIN 0
> > > That is, I hack the code to return 12345678 when a variable is
> > > unassigned, but the D script returns 0.
> > >
> > > Or, using unhacked bits, consider:
> > > dtrace -n 'self x, y, z; BEGIN { trace(self->x); }'
> > > dtrace -n 'self x, y, z; BEGIN { trace(self->x); trace(self->y); }'
> > > dtrace -n 'self x, y, z; BEGIN { trace(self->x); trace(self->y);
> > > trace(self->z); }'
> > > None of them exits, so I can run:
> > > bpftool map dump id `bpftool map | awk '/name dvars/ {print
> > > int($1)}'`
> > > I get
> > > key: 00 00 00 00 00 00 00 00 value: 00 00 00 00
> > > key: 00 00 00 00 09 66 00 00 value: 00 00 00 00
> > > Found 2 elements
> > > then
> > > key: 00 00 00 00 00 00 00 00 value: 00 00 00 00
> > > key: 01 00 00 00 15 66 00 00 value: 00 00 00 00
> > > key: 00 00 00 00 15 66 00 00 value: 00 00 00 00
> > > Found 3 elements
> > > then
> > > key: 00 00 00 00 00 00 00 00 value: 00 00 00 00
> > > key: 02 00 00 00 2d 66 00 00 value: 00 00 00 00
> > > key: 01 00 00 00 2d 66 00 00 value: 00 00 00 00
> > > key: 00 00 00 00 2d 66 00 00 value: 00 00 00 00
> > > Found 4 elements
> > > That is, the dvars map shows that unassigned variables are being
> > > allocated in the BPF map.
> > >
> > > Here is another way of seeing the problem. Although
> > > dtrace -xdynvarsize=15 -n 'self x, y, z, w; BEGIN {
> > > trace(self->x); trace(self->y); trace(self->z); trace(self->w);
> > > }'
> > > correctly shows the values are 0, that's because it's allocating
> > > zeroes in the BPF map until it runs out of room and then
> > > interprets "ran out of room" as zero values. Hmm.
> > > So next we try:
> > > dtrace -xdynvarsize=15 -n 'self x, y, z, w; BEGIN {
> > > trace(self->x); trace(self->y); trace(self->z); self->w = 1234;
> > > trace(self->w); }'
> > > dtrace: error on enabled probe ID 2 (ID 1: dtrace:::BEGIN): invalid
> > > address (0x0) in action #1
> > > This exposes our having run out of room.
> > Thanks for the detailed analysis, though you could have saved yourself some
> > time by limiting it to just a single instance of demonstrating the problem :)
> Ironically, it was the opposite. I started writing, kept feeling
> unsatisfied with my understanding, etc. In the end, I was too tired to edit
> what I had written. "I was too busy to write shorter." :^(
> > The logic for handling zero/non-zero is flawed (as you found) because of the
> > multiple modes for using get_tvar(). Fix in v5 (just finished that).
> >
> > > * What happens when I store a zero value? I guess dt_cg_store_var()
> > > calls get_tvar(varid, dnp->dn_reg). The secret sauce is when
> > > the contents of dnp->dn_reg are 0. Does that mean the deletion
> > > only occurs when, both, we store by value AND the value is 0?
> > > E.g., the following does not work; should it?
> > > # cat z.d
> > > #pragma D option dynvarsize=15
> > > struct foo {
> > > int x;
> > > };
> > > self struct foo a, b, c, d;
> > > BEGIN {
> > > self->a.x = 0;
> > > self->b.x = 0;
> > > self->c.x = 0;
> > > self->d.x = 0;
> > > exit(0);
> > > }
> > > ERROR {
> > > printf("ugh\n");
> > > exit(1);
> > > }
> > > # dtrace -s z.d
> > > dtrace: script 'z.d' matched 2 probes
> > > dtrace: error on enabled probe ID 3 (ID 1: dtrace:::BEGIN): invalid
> > > address (0x0) in action #1
> > > CPU ID FUNCTION:NAME
> > > 3 3 :ERROR ugh
> > Did you actually try this on the legacy version of DTrace? I assume not
> > because I just did and got:
> >
> > dtrace: script 'z.d' matched 2 probes
> > dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid address (0x0) in action #1 at DIF offset 8
> > CPU ID FUNCTION:NAME
> > 2 3 :ERROR ugh
> >
> > So I think that the behaviour of my implementation isn't unreasonable.
> I tried legacy DTrace, but the problems I encountered there seemed to be a
> different subject and so I decided not to bring them up. Code "like this"
> works on legacy DTrace... there are examples in the test suite. I do not
> understand the issues.
>
> Anyhow, you explain below some of the questions.
> > > There needs to be a more graceful failure (and a test) for the case that
> > > dt_options[DTRACEOPT_DYNVARSIZE] < dt_maxtlslen . Currently,
> > > # cat z.d
> > > #pragma D option dynvarsize=15
> > > struct foo {
> > > int x1, x2, x3, x4;
> > > };
> > > self struct foo a;
> > > BEGIN {
> > > trace(self->a.x1);
> > > printf("good\n");
> > > exit(0);
> > > }
> > > ERROR {
> > > printf("ugh\n");
> > > exit(1);
> > > }
> > > # dtrace -s z.d
> > > dtrace: script 'z.d' matched 2 probes
> > > BPF: fd -1 is not pointing to valid bpf_map
> > > BPF: verification time 18 usec
> > > BPF: stack depth
> > > BPF: processed 0 insns (limit 1000000) max_states_per_insn 0
> > > total_states 0 peak_states 0 mark_read 0
> > > dtrace: could not enable tracing: BPF program load for 'dtrace:::BEGIN'
> > > failed: Bad file descriptor
> > It will now cause an invalid pointer fault - which in the future will be a
> > dynamic variable drop. That would be valid behaviour because if you set the
> > dynvarsize too low, you do not have room to store any variable so any attempt
> > to store one would be a drop.
> Hmm. A characteristic of a good product is good diagnosibility. This
> particular case is such an easy condition to diagnose. We should help
> users. If they have TLS variables in their scripts and we know we will
> never handle any of them, we might as well let them know. "Valid behaviour"
> is a low bar. In this case, it is very easy for us to do much better.
Sure, but what about a script that uses 2 TLS variables in a clause, but it is
executed with a dynvarsize that only accomodates one variable? Then it is OK
that we let it go through without a problem and the user will see a dynamic
variable drop? And no, we cannot statically determine how many TLS variables
actually will be used in all cases.
You can file a github issue for this to suggest an enhacement to implement a
check on dynvarsize < maxtlslen, with appropriate error message being displayed.
But it won't be part of this patch - here I am implementing the behaviour that
exists in DTrace. Changes to that behaviour are not really part of the port of
TLS variables but rather an enhancement.
> > > I cannot seem to get test/unittest/variables/tvar/tst.struct.d to run on
> > > either Solaris or legacy DTrace.
> > Yes, that was passing because of the aforementioned issue with loading a
> > var that does not exist actually causing it to be created. With v5, this
> > test fails (and is replaced by a new one). Note though that TLS variables with
> > a struct datatype are a can of worms, both in the legacy version and with this
> > implementation due to semantics not being properly defined for it. E.g. there
> > is no way to delete a TLS variable with a struct datatype because you cannot
> > assign 0 (a scalar) to it. And you cannot assign a member in a TLS struct
> > unless that TLS variable has already been created (as this test failure shows).
> Ah. Good to know. I was asking about this earlier. I'd be in favor of
> mentioning such limitations in this patch's commit message.
The commit message is not documentation. This is an issue with hoe TLS vars
are handled in DTrace - it is not a limitation of this specific implementation.
This is again an issue that can be tracked as a github issue. I do not know
when (or how) we can address this - it seems to be a rather significant design
issue. But it is worth tracking in case we come up with sensible semantics
for this (and implememnt them).
More information about the DTrace-devel
mailing list