[DTrace-devel] [PATCH] Optimize dt_cg_store_val() for string values

Mon Nov 15 19:18:13 UTC 2021

My main question is whether the simplified handling of the string-length 
prefix handles the prefix properly.  E.g.

     $ cat s.d
     #pragma D option rawbytes
     BEGIN {
trace("abcdefghijklmopnqrstuvwxyzABCDEFGHIJKLMOPNQRSTUVWXYZ");
       exit(0);
     }
     $ sudo dtrace -xstrsize=64 -s s.d
     dtrace: script 's.d' matched 1 probe
     CPU     ID                    FUNCTION:NAME
       1      1                           :BEGIN
              0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f 
0123456789abcdef
          0: 00 34 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6f 
.4abcdefghijklmo
         10: 70 6e 71 72 73 74 75 76 77 78 79 7a 41 42 43 44 
pnqrstuvwxyzABCD
         20: 45 46 47 48 49 4a 4b 4c 4d 4f 50 4e 51 52 53 54 
EFGHIJKLMOPNQRST
         30: 55 56 57 58 59 5a 00 00 00 00 00 00 00 00 00 00 
UVWXYZ..........
         40: 00 00 00                                         ...

     $ sudo dtrace -xstrsize=8 -s s.d
     dtrace: script 's.d' matched 1 probe
     CPU     ID                    FUNCTION:NAME
       2      1                           :BEGIN
              0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f 
0123456789abcdef
          0: 00 34 61 62 63 64 65 66 67 68 00 .4abcdefgh.

In the first run, the string has 52=0x34 chars and we see that in the 
output.  In the second run, even though the string is truncated to 8 
chars, the prefix is still 0x34.

I guess my second question is whether we're going to get rid of the 
string-length prefix altogether.  If the answer is either "yes" or 
"maybe," I think it'd make sense to do that first before worrying about 
optimizing such code.

Also...

On 11/15/21 12:31 PM, Kris Van Hees via DTrace-devel wrote:
> Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> ---
>   libdtrace/dt_cg.c | 48 +++++++++++++----------------------------------
>   1 file changed, 13 insertions(+), 35 deletions(-)
>
> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> index 455e5440..930355f8 100644
> --- a/libdtrace/dt_cg.c
> +++ b/libdtrace/dt_cg.c
> @@ -958,9 +958,6 @@ dt_cg_store_val(dt_pcb_t *pcb, dt_node_t *dnp, dtrace_actkind_t kind,
>   
>   		return 0;
>   	} else if (dt_node_is_string(dnp)) {
> -		uint_t		size_ok = dt_irlist_label(dlp);
> -		int		reg;
> -
>   		dt_cg_check_notnull(dlp, drp, dnp->dn_reg);
>   
>   		TRACE_REGSET("store_val(): Begin ");
> @@ -968,49 +965,30 @@ dt_cg_store_val(dt_pcb_t *pcb, dt_node_t *dnp, dtrace_actkind_t kind,
>   				 size + DT_STRLEN_BYTES + 1, 1, pfp, arg);
>   
>   		/*
> -		 * Retrieve the length of the string, limit it to the maximum
> -		 * string size, and store it in the buffer at [%r9 + off].
> +		 * Copy the length of the string from the source string as a
> +		 * half-word (2 bytes) into the buffer at [%r9 + off].
>   		 */
> -		reg = dt_regset_alloc(drp);
> -		if (reg == -1)
> -			longjmp(yypcb->pcb_jmpbuf, EDT_NOREG);
> -		dt_cg_strlen(dlp, drp, reg, dnp->dn_reg);
> -		dt_regset_xalloc(drp, BPF_REG_0);
> -		emit(dlp,  BPF_BRANCH_IMM(BPF_JLT, reg, size, size_ok));

FWIW (not that anyone cares about code that is working and isn't likely 
to survive in any case), that jump might as well be JLE.

> -		emit(dlp,  BPF_MOV_IMM(reg, size));
> -		emitl(dlp, size_ok,
> -			   BPF_MOV_REG(BPF_REG_0, reg));
> -		emit(dlp,  BPF_ALU64_IMM(BPF_RSH, BPF_REG_0, 8));
> -		emit(dlp,  BPF_STORE(BPF_B, BPF_REG_9, off, BPF_REG_0));
> -		emit(dlp,  BPF_STORE(BPF_B, BPF_REG_9, off + 1, reg));
> -		dt_regset_free(drp, BPF_REG_0);
> +		emit(dlp, BPF_LOAD(BPF_H, BPF_REG_0, dnp->dn_reg, 0));
> +		emit(dlp, BPF_STORE(BPF_H, BPF_REG_9, off, BPF_REG_0));

I know I've asked this multiple times before (and you've replied) but 
the use of %r0 without allocating it still confuses me. Specifically, 
let's say that that's a safe practice.  Then why, only a few lines later 
(and wherever we make function calls) do we xalloc/free %r0?

>   		/*
> -		 * Copy the string to the output buffer at
> -		 * [%r9 + off + DT_STRLEN_BYTES] because we need to skip the
> -		 * length prefix.
> +		 * Copy the string data (no more than STRSIZE + 1 bytes) to the
> +		 * buffer at [%r9 + off + DT_STRLEN_BYTES].  We (ab)use the

In what sense is this an abuse?  This seems to be totally the sort of 
thing probe_read_str() is for.  I suggest s/(ab)use/use/.

> +		 * fact that probe_read_str) stops at the terminating NUL byte.
>   		 */
>   		if (dt_regset_xalloc_args(drp) == -1)
>   			longjmp(yypcb->pcb_jmpbuf, EDT_NOREG);
>   
> -		emit(dlp,  BPF_MOV_REG(BPF_REG_1, BPF_REG_9));
> -		emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, off + DT_STRLEN_BYTES));
> -		emit(dlp,  BPF_MOV_REG(BPF_REG_2, reg));
> -		dt_regset_free(drp, reg);

Incidentally, as far as the pre-existing code goes... free(reg)? Looking 
a few lines ahead...

> -		emit(dlp,  BPF_MOV_REG(BPF_REG_3, dnp->dn_reg));
> +		emit(dlp, BPF_MOV_REG(BPF_REG_1, BPF_REG_9));
> +		emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, off + DT_STRLEN_BYTES));
> +		emit(dlp, BPF_MOV_IMM(BPF_REG_2, size + 1));
> +		emit(dlp, BPF_MOV_REG(BPF_REG_3, dnp->dn_reg));
>   		dt_regset_free(drp, dnp->dn_reg);
>   		dt_cg_tstring_free(pcb, dnp);
> -		emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, DT_STRLEN_BYTES));
> +		emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, DT_STRLEN_BYTES));
>   		dt_regset_xalloc(drp, BPF_REG_0);

Here is the xalloc(%r0), even though we've already used %r0 (without 
xalloc'ing it).

> -		emit(dlp,  BPF_CALL_HELPER(BPF_FUNC_probe_read));
> +		emit(dlp, BPF_CALL_HELPER(BPF_FUNC_probe_read_str));
>   		dt_regset_free_args(drp);
> -
> -		/*
> -		 * Write the NUL terminating byte.
> -		 */
> -		emit(dlp,  BPF_MOV_REG(BPF_REG_0, BPF_REG_9));
> -		emit(dlp,  BPF_ALU64_REG(BPF_ADD, BPF_REG_0, reg));

Here the pre-existing code uses reg, which it has already freed.

> -		emit(dlp,  BPF_STORE_IMM(BPF_B, BPF_REG_0, off + DT_STRLEN_BYTES, 0));
>   		dt_regset_free(drp, BPF_REG_0);
>   		TRACE_REGSET("store_val(): End   ");
>