[DTrace-devel] [PATCH 7/7] Add support for inet_ntoa6() subroutine

Tue Aug 1 05:40:40 UTC 2023

As much fun to review as, I imagine, it was to write!  Crazy stuff.  Anyhow:
Reviewed-by: Eugene Loh <eugene.loh at oracle.com>

As usual, a lot of comments.

First of all, is this supposed to work on UEKr6 (5.4.17)?  If not, that 
restriction should be noted in the commit message.  If so, there seem to 
be a few problems getting by the BPF verifier. Specifically, when you 
xor, I think the [0,1] bounds are lost.  I think this is easily fixed by 
and'ing with 1.  Also, when you return from write_hex16(), %r0 is 
unbounded.  I think this is easily fixed by and'ing with some other 
value... 7 or something. The big problem remains that the BPF verifier 
still complains about "variable stack access".  This is probably tough 
to solve. So, I think it makes sense to say that 5.4.17 is not supported.

Also, it'd be great if the code had some high-level roadmap of where 
this very difficult minimal-branching implementation is going.  
Hopefully, it's 100% correct and no one will ever have to look at this 
code again.  Otherwise, it's a nightmare.  Maybe there could be some C 
code in the comments that illustrates what's going on in the BPF where 
each C statement maps to a set of BPF instructions, even if the 
branch-free BPF instructions take some head-scratching to figure out.  
E.g., "d = (c || s) ? 1 : 0" maps to "add %r5, %r0; neg %r5; rsh %r5, 63".

Maybe it should also acknowledge that there are subtle ways that the 
implemented algorithm differs from legacy DTrace (dtrace_dif.c).  Namely:

*)  There is a subtle difference in how ties in the search for numzero 
are broken.  Let's say your previous best was 2 halfwords.  Now, you 
have 5 bytes.  The old way, the new candidate wins, even though we'll 
only record it as 4 bytes, tying with the previous best.  The new way, 
we say the new candidate is no better than the old and so the incumbent 
wins.  I don't think either behavior is more correct than the other, and 
the old behavior is frankly a little weird.  It just means that there 
are theoretically cases where the two implementations might produce 
(equally legal?) results.

*)  There seem to be differences in the ways the two implementations 
decide between ipv4 and v6.  I don't understand this stuff well enough 
to judge if there are any errors here.

Neither of those issues impacts the test cases.

This patch uses bpf/inet_ntoa.S.  That file claims the function is void, 
but inet_ntoa6() uses the return value.  So, that file should document 
the function correctly.  Also, that file claims to fill with ":".  It 
actually fills with ".".

More...

On 7/27/23 11:31, Kris Van Hees via DTrace-devel wrote:
> Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> ---
>   bpf/Build                            |   1 +
>   bpf/inet_ntoa6.S                     | 246 +++++++++++++++++++++++++++
>   libdtrace/dt_cg.c                    |  52 +++++-
>   test/unittest/funcs/tst.inet_ntoa6.d |   1 -
>   4 files changed, 298 insertions(+), 2 deletions(-)
>   create mode 100644 bpf/inet_ntoa6.S
>
> diff --git a/bpf/Build b/bpf/Build
> index 5af8616d..7d15f96c 100644
> --- a/bpf/Build
> +++ b/bpf/Build
> @@ -29,6 +29,7 @@ bpf_dlib_SOURCES = \
>   	get_dvar.c \
>   	index.S \
>   	inet_ntoa.S \
> +	inet_ntoa6.S \
>   	lltostr.S \
>   	mutex_owned.S \
>   	mutex_owner.S \
> diff --git a/bpf/inet_ntoa6.S b/bpf/inet_ntoa6.S
> new file mode 100644
> index 00000000..07f595a0
> --- /dev/null
> +++ b/bpf/inet_ntoa6.S
> @@ -0,0 +1,246 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.

Year?

> + */
> +
> +#define BPF_FUNC_probe_read		4
> +
> +#define DCTX_RODATA			56

56 hardwired?

> +
> +	.text
> +
> +/*
> + * void write_hex16(const uint16_t src, char *dst)

void?  You set and use the return value.

> + */
> +	.align	4
> +	.global	write_hex16
> +	.type	write_hex16, @function
> +write_hex16:
> +	mov	%r0, 0
> +
> +	/*
> +	 * Branch-free implementation of num-to-hex print function.  Given a
> +	 * number from 0 to 15, this will output a hex digit (0-9a-f) in the
> +	 * output buffer.  It also supports suppression of leading 0s if it is
> +	 * used to output a sequence of digits.
> +	 *
> +	 * Given: c in [0, 15]
> +	 * Then, (c - 9) will be greater than 0 for c in [10, 15].

s/for/only for/

> +	 * Therefore, (-(c - 9)) will have its highest bit set if c in [10, 15].

s/if/iff/

> +	 * Thus, ((-(c - 9)) >> 63) will be 1 if c in [10, 15], and otherwise 0.
> +	 * Therefore, the hex digit (character) representing c can be computed
> +	 * as:
> +	 *	c + '0' + ((-(c - 9)) >> 63) * ('a' - '0' - 10)
> +	 *
> +	 * Let s be a global suppression flag across the digits (initially 0).
> +	 * Let d be (c + s), and therefore ((-d) >> 63) will be 0 iff c and s
> +	 * are both 0, and otherwise 1.  This is used to determine whether the
> +	 * output pointer should be advanced, and also to update the global
> +	 & suppression flag (s += ((-d) >> 63)).

s/&/*/

Also, there is nothing interesting about c+s.  How about defining 
instead d=(-(c+s))>>63 -- that is,
d = (c || s) ? 1 : 0.

Incidentally, is s really a flag?  Or it's a variable that either is or 
isn't zero?  (To me, a "flag" sounds like a particular value, like bit 
17 or something.)

> +	 */

Some variables are named (c, s, d).  The code does not refer to them.  
Either say in comments which registers are which variables or maybe 
define some macros so one can say, e.g., "s" rather than "%r0" or whatever.

> +.macro WRITE_DIGIT n
> +	mov	%r3, %r1
> +	rsh	%r3, 4 * (3 - \n)
> +	and	%r3, 0xf

How about a blank line here?

> +	mov	%r4, %r3
> +	mov	%r5, %r3

No need for "mov %r5, %r3".  See below.

> +	sub	%r4, 9
> +	neg	%r4
> +	rsh	%r4, 63
> +	mul	%r4, 0x27
> +	add	%r4, 0x30

s/0x27/'a' - '0' - 10/
s/0x30/'0'/

> +	add	%r3, %r4
> +	stxb	[%r2 + 0], %r3

How about another blank line here?  Also, instead of "add %r3, %r4", how 
about "add %r4, %r3" and then storing %r3.  That way, you don't clobber 
%r3.  I think it's cleaner to work just in %r4.  Plus, you don't need to 
save %r3 in %r5:  save an instruction.  In the remaining instructions, 
just use %r3 instead of %r5.

> +	add	%r5, %r0
> +	neg	%r5
> +	rsh	%r5, 63				/* 0 if '0', 1 otherwise */

Maybe another blank line here?

> +	add	%r2, %r5
> +	add	%r0, %r5
> +.endm
> +
> +	WRITE_DIGIT 0
> +	WRITE_DIGIT 1
> +	WRITE_DIGIT 2
> +	WRITE_DIGIT 3
> +
> +	/*
> +	 * It is possible that all digits are 0, in which case the output
> +	 * pointer did not advance from its initial value.  We do want a single
> +	 * 0 digit as output though.
> +	 *
> +	 * Since in this case, %r5 will be zero if all digits are zero, and 1
> +	 * otherwise, we can simply use %r0 + (%r5 ^ 1) to ensure that when all
> +	 * digits are 0, we retain the last one.
> +	 */
> +	xor	%r5, 1
> +	add	%r0, %r5
> +	exit
> +	.size	write_hex16, .-write_hex16
> +
> +/*
> + * void inet_ntoa6(const dt_dctx_t *dctx, const uint8_t *src, char *dst,
> + *		   uint32 tbloff)
> + */
> +	.align	4
> +	.global	dt_inet_ntoa6
> +	.type	dt_inet_ntoa6, @function
> +dt_inet_ntoa6:
> +	/* %r9 = dst, %r6 = dctx, %r8 = dctx->rodata + tbloff */

Why %r6=dctx?  Is the cached value used?

> +	mov	%r9, %r3		/* %r9 = dst */
> +	mov	%r6, %r1		/* %r6 = dctx */

Why %r6=dctx?

> +	mov	%r8, %r1		/* %r8 = dctx->rodata + tbloff */
> +	ldxdw	%r8, [%r8 + DCTX_RODATA]

Why "mov %r8, %r1"?  Why not just "ldxdw %r8, [%r1 + DCTX_RODATA]"?

> +	jge	%r4, RODATA_SIZE, .Ldone
> +	add	%r8, %r4
> +
> +	mov	%r3, %r2
> +	mov	%r2, 16
> +	mov	%r1, %fp
> +	add	%r1, -16
> +	call	BPF_FUNC_probe_read	/* rc = probe_read(&fp[-16], 16, src) */
> +	jne	%r0, 0, .Ldone
> +
> +	/*
> +	 * Read the 8 words (16-bit values), and build a bitmap in %r7
> +	 * indicating which words are non-zero.
> +	 *
> +	 * We use an implementation that does not involve branches to reduce
> +	 * complexity for the BPF verifier.
> +	 *
> +	 * The IPv6 address has words in network byte order which may differ
> +	 * from the host byte order.  We store a 2nd copy od the words, with

s/od/of/

> +	 * the byte order reversed (if needed).  We shouldn't need the 2nd copy
> +	 * if the byte order is the same but since the BPF verifier chokes on
> +	 * the output code below due to lack of tracking of relations between
> +	 * register values, the 2nd copy is needed anyway.
> +	 */
> +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
> +# define NTOH(reg)
> +#else
> +# define NTOH(reg)	endbe	reg, 16
> +#endif
> +.macro GETWORD n
> +	ldxh	%r0, [%fp + (-16 + \n * 2)]
> +	NTOH(%r0)
> +	stxh	[%fp + (-32 + \n * 2)], %r0
> +	neg	%r0
> +	rsh	%r0, 63
> +	lsh	%r0, 7 - \n
> +	or	%r7, %r0
> +.endm
> +
> +	mov	%r7, 0
> +	GETWORD 0
> +	GETWORD 1
> +	GETWORD 2
> +	GETWORD 3
> +	GETWORD 4
> +	GETWORD 5
> +	GETWORD 6
> +	GETWORD 7
> +
> +	/* Set the upper bound for %r7. */
> +	and	%r7, 0xff
> +
> +	/* Handle mapped and embedded IPv4 addresses. */
> +	mov	%r0, %r7
> +	and	%r0, 0xfe
> +	jeq	%r0, 2, .Lipv4
> +	jne	%r0, 6, .Lnotipv4
> +	ldxh	%r0, [%fp + -22]
> +	jeq	%r0, 0xffff, .Lipv4
> +.Lnotipv4:
> +
> +	/*
> +	 * Perform a table lookup to determine the location and length of the
> +	 * longest run of 0-words (if any).
> +	 */
> +	mov	%r0, %r8
> +	add	%r0, %r7
> +	ldxb	%r7, [%r0 + 0]

No need to spend that extra instruction doing this in %r0.  Just "add 
%r7, %r8; ldxb %r7, [%r7 + 0]".

> +
> +	/*
> +	 * Determine the number of words to output at the start of the address.
> +	 * It is found in the upper 4 bits in %r7 (result of the table lookup
> +	 * above).  We place an upper bound to make the BPF verifier happy.
> +	 */
> +	mov	%r6, %r7
> +	rsh	%r6, 4
> +	jgt	%r6, 7, .Ldone
> +	mov	%r8, %fp
> +	add	%r8, -32
> +
> +	/* Write out the %r6 words at the beginning of the address. */
> +.Lpref_loop:
> +	jle	%r6, 0, .Lpref_done
> +	ldxh	%r1, [%r8 + 0]
> +	mov	%r2, %r9
> +	call	write_hex16
> +	add	%r9, %r0
> +	stb	[%r9 + 0], 0x3a
> +	add	%r9, 1
> +	add	%r8, 2
> +	sub	%r6, 1
> +	ja	.Lpref_loop
> +
> +.Lpref_done:
> +	stb	[%r9 + 0], 0x3a
> +
> +	/*
> +	 * Determine the number of zero-words that can be collapsed.  It is
> +	 * found in the lower 4 bits of %r7 (result of the table lookup above).
> +	 * (Note: it is either 0 or >1.)

That is a really confusing comment when the next instructions calculate 
%r7>>4.

> +	 */
> +	mov	%r1, %r7
> +	rsh	%r1, 4				/* #(leading words) */
> +	mov	%r0, %r1
> +	neg	%r0
> +	rsh	%r0, 63
> +	xor	%r0, 1
> +	add	%r9, %r0
> +	mov	%r2, %r7
> +	and	%r2, 0xf			/* #(zero words to collapse) */
> +	add	%r1, %r2			/* #(words used) */
> +	jgt	%r1, 8, .Ldone
> +
> +	mov	%r8, %fp
> +	add	%r8, -32
> +	mov	%r0, %r1
> +	mul	%r0, 2
> +	add	%r8, %r0
> +	mov	%r6, 8
> +	sub	%r6, %r1			/* #(words left) */
> +
> +	stb	[%r9 + 0], 0x3a
> +	mov	%r0, %r6
> +	neg	%r0
> +	rsh	%r0, 63
> +	xor	%r0, 1
> +	add	%r9, %r0
> +
> +.Lpost_loop:
> +	jle	%r6, 0, .Ldone
> +	stb	[%r9 + 0], 0x3a
> +	add	%r9, 1
> +	ldxh	%r1, [%r8 + 0]
> +	mov	%r2, %r9
> +	call	write_hex16
> +	add	%r9, %r0
> +	add	%r8, 2
> +	sub	%r6, 1
> +	ja	.Lpost_loop
> +
> +.Ldone:
> +	stb	[%r9 + 0], 0
> +	mov	%r0, 0

Why?  Who uses that return value?

> +	exit
> +
> +.Lipv4:
> +	mov	%r1, %r6
> +	mov	%r2, %fp
> +	add	%r2, -4
> +	mov	%r3, %r9
> +	call	dt_inet_ntoa
> +	add	%r9, %r0
> +	ja	.Ldone
> +	.size	dt_inet_ntoa6, .-dt_inet_ntoa6
> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> index ebe2e043..61f58a99 100644
> --- a/libdtrace/dt_cg.c
> +++ b/libdtrace/dt_cg.c
> @@ -5878,6 +5878,56 @@ dt_cg_subr_inet_ntoa(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
>   	dt_cg_subr_arg_to_tstring(dnp, dlp, drp, "dt_inet_ntoa", DT_ISIGN, 0);
>   }
>   
> +static void
> +dt_cg_subr_inet_ntoa6(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
> +{
> +	dtrace_hdl_t	*dtp = yypcb->pcb_hdl;
> +	ssize_t		tbloff;
> +	const char	tbl[] = {

Add a comment so that someone understands what this is about and can 
judge whether the entries are correct or not.

> +				0x08, 0x07, 0x06, 0x06, 0x05, 0x05, 0x05, 0x05,
> +				0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04,
> +				0x44, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03,
> +				0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03,
> +				0x35, 0x34, 0x33, 0x33, 0x02, 0x02, 0x02, 0x02,
> +				0x53, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> +				0x44, 0x43, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> +				0x53, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> +				0x26, 0x25, 0x24, 0x24, 0x23, 0x23, 0x23, 0x23,
> +				0x53, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22,
> +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +				0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +				0x17, 0x16, 0x15, 0x15, 0x14, 0x14, 0x14, 0x14,
> +				0x13, 0x13, 0x13, 0x13, 0x13, 0x13, 0x13, 0x13,
> +				0x44, 0x43, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12,
> +				0x53, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12,
> +				0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +				0x26, 0x25, 0x24, 0x24, 0x23, 0x23, 0x23, 0x23,
> +				0x53, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22,
> +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +				0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> +			};
> +
> +	tbloff = dt_rodata_insert(dtp->dt_rodata, tbl, 256);
> +	if (tbloff == -1L)
> +		longjmp(yypcb->pcb_jmpbuf, EDT_NOMEM);
> +	if (tbloff > DIF_STROFF_MAX)
> +		longjmp(yypcb->pcb_jmpbuf, EDT_STR2BIG);
> +
> +	dt_cg_subr_arg_to_tstring(dnp, dlp, drp, "dt_inet_ntoa6",
> +				  DT_ISIMM, tbloff);
> +}
> +
>   typedef void dt_cg_subr_f(dt_node_t *, dt_irlist_t *, dt_regset_t *);
>   
>   static dt_cg_subr_f *_dt_cg_subr[DIF_SUBR_MAX + 1] = {
> @@ -5924,7 +5974,7 @@ static dt_cg_subr_f *_dt_cg_subr[DIF_SUBR_MAX + 1] = {
>   	[DIF_SUBR_NTOHLL]		= &dt_cg_subr_htonll,
>   	[DIF_SUBR_INET_NTOP]		= NULL,
>   	[DIF_SUBR_INET_NTOA]		= &dt_cg_subr_inet_ntoa,
> -	[DIF_SUBR_INET_NTOA6]		= NULL,
> +	[DIF_SUBR_INET_NTOA6]		= &dt_cg_subr_inet_ntoa6,
>   	[DIF_SUBR_D_PATH]		= NULL,
>   	[DIF_SUBR_LINK_NTOP]		= NULL,
>   };
> diff --git a/test/unittest/funcs/tst.inet_ntoa6.d b/test/unittest/funcs/tst.inet_ntoa6.d
> index 971456b1..d62269de 100644
> --- a/test/unittest/funcs/tst.inet_ntoa6.d
> +++ b/test/unittest/funcs/tst.inet_ntoa6.d
> @@ -4,7 +4,6 @@
>    * Licensed under the Universal Permissive License v 1.0 as shown at
>    * http://oss.oracle.com/licenses/upl.
>    */
> -/* @@xfail: dtv2 */
>   
>   #pragma D option quiet
>