[DTrace-devel] [PATCH 7/7] Add support for inet_ntoa6() subroutine

Tue Aug 1 20:57:40 UTC 2023

On Tue, Aug 01, 2023 at 01:40:40AM -0400, Eugene Loh via DTrace-devel wrote:
> As much fun to review as, I imagine, it was to write!  Crazy stuff.  Anyhow:
> Reviewed-by: Eugene Loh <eugene.loh at oracle.com>

Thanks.

> As usual, a lot of comments.
> 
> First of all, is this supposed to work on UEKr6 (5.4.17)?  If not, that
> restriction should be noted in the commit message.  If so, there seem to be
> a few problems getting by the BPF verifier. Specifically, when you xor, I
> think the [0,1] bounds are lost.  I think this is easily fixed by and'ing
> with 1.  Also, when you return from write_hex16(), %r0 is unbounded.  I
> think this is easily fixed by and'ing with some other value... 7 or
> something. The big problem remains that the BPF verifier still complains
> about "variable stack access".  This is probably tough to solve. So, I think
> it makes sense to say that 5.4.17 is not supported.

I already made changes after posting this to accomodate the BPF verifier on
UEK6 (5.4.17 kernels) because it isn't as advanced as the one in newer kernels.

> Also, it'd be great if the code had some high-level roadmap of where this
> very difficult minimal-branching implementation is going.  Hopefully, it's
> 100% correct and no one will ever have to look at this code again. 
> Otherwise, it's a nightmare.  Maybe there could be some C code in the
> comments that illustrates what's going on in the BPF where each C statement
> maps to a set of BPF instructions, even if the branch-free BPF instructions
> take some head-scratching to figure out.  E.g., "d = (c || s) ? 1 : 0" maps
> to "add %r5, %r0; neg %r5; rsh %r5, 63".

The problem with C code is that it is about as obscure as this assembly code
unless you write it as a higher level implementation, which then again will
show branches and therefore become less useful.

I'll add a bit more commentary to explain the use of the bitmap of non-zero
words and the lookup table.

> Maybe it should also acknowledge that there are subtle ways that the
> implemented algorithm differs from legacy DTrace (dtrace_dif.c).  Namely:
> 
> *)  There is a subtle difference in how ties in the search for numzero are
> broken.  Let's say your previous best was 2 halfwords.  Now, you have 5
> bytes.  The old way, the new candidate wins, even though we'll only record
> it as 4 bytes, tying with the previous best.  The new way, we say the new
> candidate is no better than the old and so the incumbent wins.  I don't
> think either behavior is more correct than the other, and the old behavior
> is frankly a little weird.  It just means that there are theoretically cases
> where the two implementations might produce (equally legal?) results.

The legacy version is wrong.  The RFCs all talk in terms of 16-bit words rather
than bytes, so when optimizing the representation by collapsing strings of
zero words using word count rather than byte count make sense.  There is not
a single authority on what the exact behaviour ought to be when there are
muleiple runs of 0 of the same length.  I went with the behaviour provided by
glibc as referencec.
> 
> *)  There seem to be differences in the ways the two implementations decide
> between ipv4 and v6.  I don't understand this stuff well enough to judge if
> there are any errors here.

I do not see where there are differences between legacy and this version when
it comes to IPv4 and IPv6 addresses.  As far as I can see, the behaviour for
mapped and embedded IPv4 addresses is the sme.

> Neither of those issues impacts the test cases.

Correct.

> This patch uses bpf/inet_ntoa.S.  That file claims the function is void, but
> inet_ntoa6() uses the return value.  So, that file should document the
> function correctly.  Also, that file claims to fill with ":".  It actually
> fills with ".".

The code now no longer expects inet_ntoa() to return a value, so this no
longer matters.  I dropped the patch adding a return value from the series.

If bpf/inet_ntoa.S documents is behaviour wrong (i.e. you point out how your
comment claim ':' as separator whereas the implementation correctly uses '.'),
you should subnmit a patch to fix that.  Or I can do it - but outside this
series because it is independent.

> More...
> 
> On 7/27/23 11:31, Kris Van Hees via DTrace-devel wrote:
> > Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> > ---
> >   bpf/Build                            |   1 +
> >   bpf/inet_ntoa6.S                     | 246 +++++++++++++++++++++++++++
> >   libdtrace/dt_cg.c                    |  52 +++++-
> >   test/unittest/funcs/tst.inet_ntoa6.d |   1 -
> >   4 files changed, 298 insertions(+), 2 deletions(-)
> >   create mode 100644 bpf/inet_ntoa6.S
> > 
> > diff --git a/bpf/Build b/bpf/Build
> > index 5af8616d..7d15f96c 100644
> > --- a/bpf/Build
> > +++ b/bpf/Build
> > @@ -29,6 +29,7 @@ bpf_dlib_SOURCES = \
> >   	get_dvar.c \
> >   	index.S \
> >   	inet_ntoa.S \
> > +	inet_ntoa6.S \
> >   	lltostr.S \
> >   	mutex_owned.S \
> >   	mutex_owner.S \
> > diff --git a/bpf/inet_ntoa6.S b/bpf/inet_ntoa6.S
> > new file mode 100644
> > index 00000000..07f595a0
> > --- /dev/null
> > +++ b/bpf/inet_ntoa6.S
> > @@ -0,0 +1,246 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
> 
> Year?

Fixed.

> > + */
> > +
> > +#define BPF_FUNC_probe_read		4
> > +
> > +#define DCTX_RODATA			56
> 
> 56 hardwired?

The trouble is that assembler input files cannot contain struct declarations
and use offsetof(), so the only way to avoid this is to add a utility that
writes out a .h file at build time to provide constants like this.  Worth
considering - though since this is the only occurence (for now) maybe a
comment might suffice?

> > +
> > +	.text
> > +
> > +/*
> > + * void write_hex16(const uint16_t src, char *dst)
> 
> void?  You set and use the return value.

Fixed.

> > + */
> > +	.align	4
> > +	.global	write_hex16
> > +	.type	write_hex16, @function
> > +write_hex16:
> > +	mov	%r0, 0
> > +
> > +	/*
> > +	 * Branch-free implementation of num-to-hex print function.  Given a
> > +	 * number from 0 to 15, this will output a hex digit (0-9a-f) in the
> > +	 * output buffer.  It also supports suppression of leading 0s if it is
> > +	 * used to output a sequence of digits.
> > +	 *
> > +	 * Given: c in [0, 15]
> > +	 * Then, (c - 9) will be greater than 0 for c in [10, 15].
> 
> s/for/only for/

I think the specification of a range already implies the 'only'.

> > +	 * Therefore, (-(c - 9)) will have its highest bit set if c in [10, 15].
> 
> s/if/iff/

Sure.

> > +	 * Thus, ((-(c - 9)) >> 63) will be 1 if c in [10, 15], and otherwise 0.
> > +	 * Therefore, the hex digit (character) representing c can be computed
> > +	 * as:
> > +	 *	c + '0' + ((-(c - 9)) >> 63) * ('a' - '0' - 10)
> > +	 *
> > +	 * Let s be a global suppression flag across the digits (initially 0).
> > +	 * Let d be (c + s), and therefore ((-d) >> 63) will be 0 iff c and s
> > +	 * are both 0, and otherwise 1.  This is used to determine whether the
> > +	 * output pointer should be advanced, and also to update the global
> > +	 & suppression flag (s += ((-d) >> 63)).
> 
> s/&/*/

Thanks.

> Also, there is nothing interesting about c+s.  How about defining instead
> d=(-(c+s))>>63 -- that is,
> d = (c || s) ? 1 : 0.
> 
> Incidentally, is s really a flag?  Or it's a variable that either is or
> isn't zero?  (To me, a "flag" sounds like a particular value, like bit 17 or
> something.)

Rewritten to be (hopefully) a bit more clear.

> > +	 */
> 
> Some variables are named (c, s, d).  The code does not refer to them. 
> Either say in comments which registers are which variables or maybe define
> some macros so one can say, e.g., "s" rather than "%r0" or whatever.

I tried to clarify this a bit more in the new version.

> > +.macro WRITE_DIGIT n
> > +	mov	%r3, %r1
> > +	rsh	%r3, 4 * (3 - \n)
> > +	and	%r3, 0xf
> 
> How about a blank line here?

Sure.

> > +	mov	%r4, %r3
> > +	mov	%r5, %r3
> 
> No need for "mov %r5, %r3".  See below.

Correct.

> > +	sub	%r4, 9
> > +	neg	%r4
> > +	rsh	%r4, 63
> > +	mul	%r4, 0x27
> > +	add	%r4, 0x30
> 
> s/0x27/'a' - '0' - 10/
> s/0x30/'0'/

Good idea.

> > +	add	%r3, %r4
> > +	stxb	[%r2 + 0], %r3
> 
> How about another blank line here?  Also, instead of "add %r3, %r4", how
> about "add %r4, %r3" and then storing %r3.  That way, you don't clobber
> %r3.  I think it's cleaner to work just in %r4.  Plus, you don't need to
> save %r3 in %r5:  save an instruction.  In the remaining instructions, just
> use %r3 instead of %r5.

Done.

> > +	add	%r5, %r0
> > +	neg	%r5
> > +	rsh	%r5, 63				/* 0 if '0', 1 otherwise */
> 
> 
> Maybe another blank line here?

Sure.

> > +	add	%r2, %r5
> > +	add	%r0, %r5
> > +.endm
> > +
> > +	WRITE_DIGIT 0
> > +	WRITE_DIGIT 1
> > +	WRITE_DIGIT 2
> > +	WRITE_DIGIT 3
> > +
> > +	/*
> > +	 * It is possible that all digits are 0, in which case the output
> > +	 * pointer did not advance from its initial value.  We do want a single
> > +	 * 0 digit as output though.
> > +	 *
> > +	 * Since in this case, %r5 will be zero if all digits are zero, and 1
> > +	 * otherwise, we can simply use %r0 + (%r5 ^ 1) to ensure that when all
> > +	 * digits are 0, we retain the last one.
> > +	 */
> > +	xor	%r5, 1
> > +	add	%r0, %r5
> > +	exit
> > +	.size	write_hex16, .-write_hex16
> > +
> > +/*
> > + * void inet_ntoa6(const dt_dctx_t *dctx, const uint8_t *src, char *dst,
> > + *		   uint32 tbloff)
> > + */
> > +	.align	4
> > +	.global	dt_inet_ntoa6
> > +	.type	dt_inet_ntoa6, @function
> > +dt_inet_ntoa6:
> > +	/* %r9 = dst, %r6 = dctx, %r8 = dctx->rodata + tbloff */
> 
> Why %r6=dctx?  Is the cached value used?

Yes, the call to inet_ntoa() needs it.  But I reworked the register use a bit
in v2 since I needed to accomodate some changes because of older kernels.
> 
> > +	mov	%r9, %r3		/* %r9 = dst */
> > +	mov	%r6, %r1		/* %r6 = dctx */
> 
> Why %r6=dctx?
> 
> > +	mov	%r8, %r1		/* %r8 = dctx->rodata + tbloff */
> > +	ldxdw	%r8, [%r8 + DCTX_RODATA]
> 
> Why "mov %r8, %r1"?  Why not just "ldxdw %r8, [%r1 + DCTX_RODATA]"?

No longer an issue (code got moved).

> > +	jge	%r4, RODATA_SIZE, .Ldone
> > +	add	%r8, %r4
> > +
> > +	mov	%r3, %r2
> > +	mov	%r2, 16
> > +	mov	%r1, %fp
> > +	add	%r1, -16
> > +	call	BPF_FUNC_probe_read	/* rc = probe_read(&fp[-16], 16, src) */
> > +	jne	%r0, 0, .Ldone
> > +
> > +	/*
> > +	 * Read the 8 words (16-bit values), and build a bitmap in %r7
> > +	 * indicating which words are non-zero.
> > +	 *
> > +	 * We use an implementation that does not involve branches to reduce
> > +	 * complexity for the BPF verifier.
> > +	 *
> > +	 * The IPv6 address has words in network byte order which may differ
> > +	 * from the host byte order.  We store a 2nd copy od the words, with
> 
> s/od/of/
> 
> > +	 * the byte order reversed (if needed).  We shouldn't need the 2nd copy
> > +	 * if the byte order is the same but since the BPF verifier chokes on
> > +	 * the output code below due to lack of tracking of relations between
> > +	 * register values, the 2nd copy is needed anyway.
> > +	 */
> > +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
> > +# define NTOH(reg)
> > +#else
> > +# define NTOH(reg)	endbe	reg, 16
> > +#endif
> > +.macro GETWORD n
> > +	ldxh	%r0, [%fp + (-16 + \n * 2)]
> > +	NTOH(%r0)
> > +	stxh	[%fp + (-32 + \n * 2)], %r0
> > +	neg	%r0
> > +	rsh	%r0, 63
> > +	lsh	%r0, 7 - \n
> > +	or	%r7, %r0
> > +.endm
> > +
> > +	mov	%r7, 0
> > +	GETWORD 0
> > +	GETWORD 1
> > +	GETWORD 2
> > +	GETWORD 3
> > +	GETWORD 4
> > +	GETWORD 5
> > +	GETWORD 6
> > +	GETWORD 7
> > +
> > +	/* Set the upper bound for %r7. */
> > +	and	%r7, 0xff
> > +
> > +	/* Handle mapped and embedded IPv4 addresses. */
> > +	mov	%r0, %r7
> > +	and	%r0, 0xfe
> > +	jeq	%r0, 2, .Lipv4
> > +	jne	%r0, 6, .Lnotipv4
> > +	ldxh	%r0, [%fp + -22]
> > +	jeq	%r0, 0xffff, .Lipv4
> > +.Lnotipv4:
> > +
> > +	/*
> > +	 * Perform a table lookup to determine the location and length of the
> > +	 * longest run of 0-words (if any).
> > +	 */
> > +	mov	%r0, %r8
> > +	add	%r0, %r7
> > +	ldxb	%r7, [%r0 + 0]
> 
> No need to spend that extra instruction doing this in %r0.  Just "add %r7,
> %r8; ldxb %r7, [%r7 + 0]".

Already done in v2.

> > +
> > +	/*
> > +	 * Determine the number of words to output at the start of the address.
> > +	 * It is found in the upper 4 bits in %r7 (result of the table lookup
> > +	 * above).  We place an upper bound to make the BPF verifier happy.
> > +	 */
> > +	mov	%r6, %r7
> > +	rsh	%r6, 4
> > +	jgt	%r6, 7, .Ldone
> > +	mov	%r8, %fp
> > +	add	%r8, -32
> > +
> > +	/* Write out the %r6 words at the beginning of the address. */
> > +.Lpref_loop:
> > +	jle	%r6, 0, .Lpref_done
> > +	ldxh	%r1, [%r8 + 0]
> > +	mov	%r2, %r9
> > +	call	write_hex16
> > +	add	%r9, %r0
> > +	stb	[%r9 + 0], 0x3a
> > +	add	%r9, 1
> > +	add	%r8, 2
> > +	sub	%r6, 1
> > +	ja	.Lpref_loop
> > +
> > +.Lpref_done:
> > +	stb	[%r9 + 0], 0x3a
> > +
> > +	/*
> > +	 * Determine the number of zero-words that can be collapsed.  It is
> > +	 * found in the lower 4 bits of %r7 (result of the table lookup above).
> > +	 * (Note: it is either 0 or >1.)
> 
> That is a really confusing comment when the next instructions calculate
> %r7>>4.

Yes, that comment needs a bit more work because we ended up doing more work
here than anticipated.

> > +	 */
> > +	mov	%r1, %r7
> > +	rsh	%r1, 4				/* #(leading words) */
> > +	mov	%r0, %r1
> > +	neg	%r0
> > +	rsh	%r0, 63
> > +	xor	%r0, 1
> > +	add	%r9, %r0
> > +	mov	%r2, %r7
> > +	and	%r2, 0xf			/* #(zero words to collapse) */
> > +	add	%r1, %r2			/* #(words used) */
> > +	jgt	%r1, 8, .Ldone
> > +
> > +	mov	%r8, %fp
> > +	add	%r8, -32
> > +	mov	%r0, %r1
> > +	mul	%r0, 2
> > +	add	%r8, %r0
> > +	mov	%r6, 8
> > +	sub	%r6, %r1			/* #(words left) */
> > +
> > +	stb	[%r9 + 0], 0x3a
> > +	mov	%r0, %r6
> > +	neg	%r0
> > +	rsh	%r0, 63
> > +	xor	%r0, 1
> > +	add	%r9, %r0
> > +
> > +.Lpost_loop:
> > +	jle	%r6, 0, .Ldone
> > +	stb	[%r9 + 0], 0x3a
> > +	add	%r9, 1
> > +	ldxh	%r1, [%r8 + 0]
> > +	mov	%r2, %r9
> > +	call	write_hex16
> > +	add	%r9, %r0
> > +	add	%r8, 2
> > +	sub	%r6, 1
> > +	ja	.Lpost_loop
> > +
> > +.Ldone:
> > +	stb	[%r9 + 0], 0
> > +	mov	%r0, 0
> 
> Why?  Who uses that return value?

Mostly because it is clean (BPF functions generally return 0 if they do not
return any value).  It also helps the verifier because it can count on the
return value to be a constant - since the verifier doesn't know that this is
effectively a void function.

> > +	exit
> > +
> > +.Lipv4:
> > +	mov	%r1, %r6
> > +	mov	%r2, %fp
> > +	add	%r2, -4
> > +	mov	%r3, %r9
> > +	call	dt_inet_ntoa
> > +	add	%r9, %r0
> > +	ja	.Ldone
> > +	.size	dt_inet_ntoa6, .-dt_inet_ntoa6
> > diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> > index ebe2e043..61f58a99 100644
> > --- a/libdtrace/dt_cg.c
> > +++ b/libdtrace/dt_cg.c
> > @@ -5878,6 +5878,56 @@ dt_cg_subr_inet_ntoa(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
> >   	dt_cg_subr_arg_to_tstring(dnp, dlp, drp, "dt_inet_ntoa", DT_ISIGN, 0);
> >   }
> > +static void
> > +dt_cg_subr_inet_ntoa6(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
> > +{
> > +	dtrace_hdl_t	*dtp = yypcb->pcb_hdl;
> > +	ssize_t		tbloff;
> > +	const char	tbl[] = {
> 
> Add a comment so that someone understands what this is about and can judge
> whether the entries are correct or not.

Yes, good idea.

> > +				0x08, 0x07, 0x06, 0x06, 0x05, 0x05, 0x05, 0x05,
> > +				0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04,
> > +				0x44, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03,
> > +				0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03,
> > +				0x35, 0x34, 0x33, 0x33, 0x02, 0x02, 0x02, 0x02,
> > +				0x53, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> > +				0x44, 0x43, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> > +				0x53, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> > +				0x26, 0x25, 0x24, 0x24, 0x23, 0x23, 0x23, 0x23,
> > +				0x53, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22,
> > +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +				0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +				0x17, 0x16, 0x15, 0x15, 0x14, 0x14, 0x14, 0x14,
> > +				0x13, 0x13, 0x13, 0x13, 0x13, 0x13, 0x13, 0x13,
> > +				0x44, 0x43, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12,
> > +				0x53, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12,
> > +				0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +				0x26, 0x25, 0x24, 0x24, 0x23, 0x23, 0x23, 0x23,
> > +				0x53, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22,
> > +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +				0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +				0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > +				0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > +			};
> > +
> > +	tbloff = dt_rodata_insert(dtp->dt_rodata, tbl, 256);
> > +	if (tbloff == -1L)
> > +		longjmp(yypcb->pcb_jmpbuf, EDT_NOMEM);
> > +	if (tbloff > DIF_STROFF_MAX)
> > +		longjmp(yypcb->pcb_jmpbuf, EDT_STR2BIG);
> > +
> > +	dt_cg_subr_arg_to_tstring(dnp, dlp, drp, "dt_inet_ntoa6",
> > +				  DT_ISIMM, tbloff);
> > +}
> > +
> >   typedef void dt_cg_subr_f(dt_node_t *, dt_irlist_t *, dt_regset_t *);
> >   static dt_cg_subr_f *_dt_cg_subr[DIF_SUBR_MAX + 1] = {
> > @@ -5924,7 +5974,7 @@ static dt_cg_subr_f *_dt_cg_subr[DIF_SUBR_MAX + 1] = {
> >   	[DIF_SUBR_NTOHLL]		= &dt_cg_subr_htonll,
> >   	[DIF_SUBR_INET_NTOP]		= NULL,
> >   	[DIF_SUBR_INET_NTOA]		= &dt_cg_subr_inet_ntoa,
> > -	[DIF_SUBR_INET_NTOA6]		= NULL,
> > +	[DIF_SUBR_INET_NTOA6]		= &dt_cg_subr_inet_ntoa6,
> >   	[DIF_SUBR_D_PATH]		= NULL,
> >   	[DIF_SUBR_LINK_NTOP]		= NULL,
> >   };
> > diff --git a/test/unittest/funcs/tst.inet_ntoa6.d b/test/unittest/funcs/tst.inet_ntoa6.d
> > index 971456b1..d62269de 100644
> > --- a/test/unittest/funcs/tst.inet_ntoa6.d
> > +++ b/test/unittest/funcs/tst.inet_ntoa6.d
> > @@ -4,7 +4,6 @@
> >    * Licensed under the Universal Permissive License v 1.0 as shown at
> >    * http://oss.oracle.com/licenses/upl.
> >    */
> > -/* @@xfail: dtv2 */
> >   #pragma D option quiet
> 
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel