[DTrace-devel] [PATCH] ERROR probe implementation

Wed Jan 20 23:04:28 PST 2021

On Wed, Jan 20, 2021 at 11:23:59PM -0500, Eugene Loh wrote:
> I'm still wading through this, but I figured I'd send what I have.
> 
> I cannot tell if the following files need 2021 copyright notices or if 
> this patch will be applied after some other patch that already updates 
> those files.
>      include/dtrace/difo_defines.h
>      libdtrace/dt_as.c
>      libdtrace/dt_cc.c
>      libdtrace/dt_cg.c
>      libdtrace/dt_consume.c
>      libdtrace/dt_impl.h
>      libdtrace/dt_prov_dtrace.c
>      libdtrace/dtrace.h

Mostly already done in an earlier patch leading up to this.  Some still need
to be tweaked.

> The test .r results files are missing "at DIF offset" clauses.  Is this 
> a loss of functionality?  The commit message explicitly mentions offset 
> as being something we can record.

The commit message documents what the ERROR probe supports in terms of design
(i.e. what it supports as part of DTrace's design).  Right now, we cannot give
instruction offsets yet for the place where the ERROR probe is triggered, so
we pass in -1 which means there is no declared offset.  So, the .r files are
not missing the "at DIF offset" information inasmuch as they do not mention it
because the tests are not expected to emit it.  It does not invalidate the
test.  It can be seen as a minor loss of functionality but the legacy code
allows for this type of situation already.  I do plan on being able to report
the instruction offset at some point in the future - it is a WIP though.

> In dt_work.c, is "#if 0" supposed to be a "#ifdef FIXME" or dead code to 
> be removed?   Same question with regards to other dt_errprog references.
> 
> Same question in dt_cg_epilogue().  In any case, the #if should include 
> the block comment that describes the code;  if the code is dead, then so 
> are the comments.

Yes, that is cleanup that I still have to do.

> The function dt_cg_call_clause() has the typo "furthe".

Thanks.

> Since both dt_cg_tramp_epilogue() and dt_cg_tramp_epilogue_advance() 
> have grown -- and in exactly the same way -- perhaps it makes sense to 
> implement the former by calling the latter.  This could also simplify 
> dt_prov_dtrace.c, the only caller of dt_cg_tramp_epilogue_advance().

The problem is that they both implement the same code at their beginning, then
one implements extra code, and then they again share code.  So I cannot just
call one from the other.

> The two fprintf(stderr, ...) in dt_handle_err() look like debugging code 
> that was not supposed to remain in the patch.

Oops, yes, I missed those when I removed debugging code.  Thanks.

> In bpf/get_bvar.c, the "#define error()" strikes me as funny.  To me, it 
> does not feel like it buys much.  Anyhow, that's an opinion thing.  
> FWIW, if we are okay with the ({...}) construct, then the following 
> advice in CODING-STYLE should be updated:
>          Macros with multiple statements should
>          be enclosed in a do-while(0) block

Well, it makes it possible to write "return error(...)" which is similar to
the typical "return dt_probe_error(...)" that one would use.  To make that
possible we need a macro expression using the ({ ... }) syntax.  That is not
the same as a multi-statement code block.

> In dt_difo_copy():
> 
>    x You have multiple memory allocations.  If any of them fail, you go
>      to no_mem, which simply returns a NULL pointer.  I suppose that's
>      okay, but it's very much unlike related code paths, which diligently
>      free any successful allocations.  Wouldn't consistency dictate cleaning
>      up any allocations in the event that not all allocations are 
> successful?

Oh shoot, I forgot to add those at the end.  Fixing...

>    x The bulk of the function could be written much more compactly
>      and readably something like this:
> 
>          #define DIFO_COPY_FLD(LEN, FLD, TYP) \
>              if (odp->LEN > 0) { \
>                  dp->LEN = odp->LEN; \
>                  dp->FLD = dt_calloc(dtp, dp->LEN, sizeof(TYP)); \
>                  if (dp->FLD == NULL) \
>                      goto no_mem; \
>                  memcpy(dp->FLD, odp->FLD, dp->LEN * sizeof(TYP)); \
>              }
> 
>          DIFO_COPY_FLD(dtdo_len, dtdo_buf, struct bpf_insn)
>          DIFO_COPY_FLD(dtdo_strlen, dtdo_strtab, char)
>          DIFO_COPY_FLD(dtdo_varlen, dtdo_vartab, dtrace_difv_t)
>          DIFO_COPY_FLD(dtdo_brelen, dtdo_breltab, dof_relodesc_t)
>          DIFO_COPY_FLD(dtdo_krelen, dtdo_kreltab, dof_relodesc_t)
>          DIFO_COPY_FLD(dtdo_urelen, dtdo_ureltab, dof_relodesc_t)

Good idea - I will do something like that.

> In bpf/Build, why isn't probe_error.c inserted in alphabetical order?

It actually is in terms of functions we are actually using right now, with
the caveat that I sort get_* and set_* combinations on the get_* variant.
The memcpy.c and strnlen() are as yet not in use so I left them at the end.

> In dt_probe_error_clause():
> 
>    x two variables "name" are declared and used.
>      How about renaming the inner one (or just inlining its single use and
>      doing away with that variable altogether?

It's not really an issue, but yes, just having one is less confusing.

>    x In dt_probe_error_clause(), IMHO, the loop body would be clearer
>      like this:
>          if (ridp != NULL && ...) {
>              ... assign val ...
>          }
>          if (val > 0) {
>              ... do relocation ...
>          } else {
>              ... copy relocation ...
>          }
>      I realize that's an opinion thing, but if the code is going to be 
> left as
>      is, at least perhaps the val declaration can be moved into the 
> innermost
>      block that uses it.

I like you suggestion...  Thanks.

>    x Why "10"?  (I think I know, but a comment would be nice.)

Yes, it's to get past "dt_clause_".  I'll add a comment or perhaps I will
just make "dt_clause+" and "dt_error_" be represented by macros so I can
refer to those rather than constant strings that need to be kept in sync in
multiple places.

> Why was dt_cg_node()'s DT_TOK_PTR/DT_TOK_DOT case handling a NULL 
> pointer specially, and why has this handling been changed?  I realize 
> that NULL pointers are bad addresses, but they are certainly not the 
> only ones.  This is only one very special case.

BPF requires that pointers are validated to not be NULL.  And since a NULL
pointer is bad news, it is appropriate to report it as a BADADDR fault.

Sure, there are many other addresses that are bad addresses as well, but we
currently do not have a mechanism to validate them.  As we implement proper
kernel memory accesses vs BPF program data accesses, support for more generic
BADADDR faults will be added.

> Somewhat related to the above, why are tests changed from
>      -    *(char *)NULL;
>      +    trace(((struct task_struct *)NULL)->pid);
> I guess part of the issue is that some of these constructs will be 
> caught by the BPF verifier.

Very simple...  *(char *)NULL generates code that will not pass the BPF
verifier because we do not generate correct code for it (yet) whereas we
do generate valid code for ((struct task_struct *)NULL)->pid.  The test is
about the ERROR probe firing with a BADADDR fault, so the mechanism for
triggering that is not relevant - picking one that actually works for the
test is prudent.