[DTrace-devel] [PATCH 7/7] Add support for inet_ntoa6() subroutine
Kris Van Hees
kris.van.hees at oracle.com
Tue Aug 1 20:57:40 UTC 2023
On Tue, Aug 01, 2023 at 01:40:40AM -0400, Eugene Loh via DTrace-devel wrote:
> As much fun to review as, I imagine, it was to write! Crazy stuff. Anyhow:
> Reviewed-by: Eugene Loh <eugene.loh at oracle.com>
Thanks.
> As usual, a lot of comments.
>
> First of all, is this supposed to work on UEKr6 (5.4.17)? If not, that
> restriction should be noted in the commit message. If so, there seem to be
> a few problems getting by the BPF verifier. Specifically, when you xor, I
> think the [0,1] bounds are lost. I think this is easily fixed by and'ing
> with 1. Also, when you return from write_hex16(), %r0 is unbounded. I
> think this is easily fixed by and'ing with some other value... 7 or
> something. The big problem remains that the BPF verifier still complains
> about "variable stack access". This is probably tough to solve. So, I think
> it makes sense to say that 5.4.17 is not supported.
I already made changes after posting this to accomodate the BPF verifier on
UEK6 (5.4.17 kernels) because it isn't as advanced as the one in newer kernels.
> Also, it'd be great if the code had some high-level roadmap of where this
> very difficult minimal-branching implementation is going. Hopefully, it's
> 100% correct and no one will ever have to look at this code again.
> Otherwise, it's a nightmare. Maybe there could be some C code in the
> comments that illustrates what's going on in the BPF where each C statement
> maps to a set of BPF instructions, even if the branch-free BPF instructions
> take some head-scratching to figure out. E.g., "d = (c || s) ? 1 : 0" maps
> to "add %r5, %r0; neg %r5; rsh %r5, 63".
The problem with C code is that it is about as obscure as this assembly code
unless you write it as a higher level implementation, which then again will
show branches and therefore become less useful.
I'll add a bit more commentary to explain the use of the bitmap of non-zero
words and the lookup table.
> Maybe it should also acknowledge that there are subtle ways that the
> implemented algorithm differs from legacy DTrace (dtrace_dif.c). Namely:
>
> *) There is a subtle difference in how ties in the search for numzero are
> broken. Let's say your previous best was 2 halfwords. Now, you have 5
> bytes. The old way, the new candidate wins, even though we'll only record
> it as 4 bytes, tying with the previous best. The new way, we say the new
> candidate is no better than the old and so the incumbent wins. I don't
> think either behavior is more correct than the other, and the old behavior
> is frankly a little weird. It just means that there are theoretically cases
> where the two implementations might produce (equally legal?) results.
The legacy version is wrong. The RFCs all talk in terms of 16-bit words rather
than bytes, so when optimizing the representation by collapsing strings of
zero words using word count rather than byte count make sense. There is not
a single authority on what the exact behaviour ought to be when there are
muleiple runs of 0 of the same length. I went with the behaviour provided by
glibc as referencec.
>
> *) There seem to be differences in the ways the two implementations decide
> between ipv4 and v6. I don't understand this stuff well enough to judge if
> there are any errors here.
I do not see where there are differences between legacy and this version when
it comes to IPv4 and IPv6 addresses. As far as I can see, the behaviour for
mapped and embedded IPv4 addresses is the sme.
> Neither of those issues impacts the test cases.
Correct.
> This patch uses bpf/inet_ntoa.S. That file claims the function is void, but
> inet_ntoa6() uses the return value. So, that file should document the
> function correctly. Also, that file claims to fill with ":". It actually
> fills with ".".
The code now no longer expects inet_ntoa() to return a value, so this no
longer matters. I dropped the patch adding a return value from the series.
If bpf/inet_ntoa.S documents is behaviour wrong (i.e. you point out how your
comment claim ':' as separator whereas the implementation correctly uses '.'),
you should subnmit a patch to fix that. Or I can do it - but outside this
series because it is independent.
> More...
>
> On 7/27/23 11:31, Kris Van Hees via DTrace-devel wrote:
> > Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> > ---
> > bpf/Build | 1 +
> > bpf/inet_ntoa6.S | 246 +++++++++++++++++++++++++++
> > libdtrace/dt_cg.c | 52 +++++-
> > test/unittest/funcs/tst.inet_ntoa6.d | 1 -
> > 4 files changed, 298 insertions(+), 2 deletions(-)
> > create mode 100644 bpf/inet_ntoa6.S
> >
> > diff --git a/bpf/Build b/bpf/Build
> > index 5af8616d..7d15f96c 100644
> > --- a/bpf/Build
> > +++ b/bpf/Build
> > @@ -29,6 +29,7 @@ bpf_dlib_SOURCES = \
> > get_dvar.c \
> > index.S \
> > inet_ntoa.S \
> > + inet_ntoa6.S \
> > lltostr.S \
> > mutex_owned.S \
> > mutex_owner.S \
> > diff --git a/bpf/inet_ntoa6.S b/bpf/inet_ntoa6.S
> > new file mode 100644
> > index 00000000..07f595a0
> > --- /dev/null
> > +++ b/bpf/inet_ntoa6.S
> > @@ -0,0 +1,246 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
>
> Year?
Fixed.
> > + */
> > +
> > +#define BPF_FUNC_probe_read 4
> > +
> > +#define DCTX_RODATA 56
>
> 56 hardwired?
The trouble is that assembler input files cannot contain struct declarations
and use offsetof(), so the only way to avoid this is to add a utility that
writes out a .h file at build time to provide constants like this. Worth
considering - though since this is the only occurence (for now) maybe a
comment might suffice?
> > +
> > + .text
> > +
> > +/*
> > + * void write_hex16(const uint16_t src, char *dst)
>
> void? You set and use the return value.
Fixed.
> > + */
> > + .align 4
> > + .global write_hex16
> > + .type write_hex16, @function
> > +write_hex16:
> > + mov %r0, 0
> > +
> > + /*
> > + * Branch-free implementation of num-to-hex print function. Given a
> > + * number from 0 to 15, this will output a hex digit (0-9a-f) in the
> > + * output buffer. It also supports suppression of leading 0s if it is
> > + * used to output a sequence of digits.
> > + *
> > + * Given: c in [0, 15]
> > + * Then, (c - 9) will be greater than 0 for c in [10, 15].
>
> s/for/only for/
I think the specification of a range already implies the 'only'.
> > + * Therefore, (-(c - 9)) will have its highest bit set if c in [10, 15].
>
> s/if/iff/
Sure.
> > + * Thus, ((-(c - 9)) >> 63) will be 1 if c in [10, 15], and otherwise 0.
> > + * Therefore, the hex digit (character) representing c can be computed
> > + * as:
> > + * c + '0' + ((-(c - 9)) >> 63) * ('a' - '0' - 10)
> > + *
> > + * Let s be a global suppression flag across the digits (initially 0).
> > + * Let d be (c + s), and therefore ((-d) >> 63) will be 0 iff c and s
> > + * are both 0, and otherwise 1. This is used to determine whether the
> > + * output pointer should be advanced, and also to update the global
> > + & suppression flag (s += ((-d) >> 63)).
>
> s/&/*/
Thanks.
> Also, there is nothing interesting about c+s. How about defining instead
> d=(-(c+s))>>63 -- that is,
> d = (c || s) ? 1 : 0.
>
> Incidentally, is s really a flag? Or it's a variable that either is or
> isn't zero? (To me, a "flag" sounds like a particular value, like bit 17 or
> something.)
Rewritten to be (hopefully) a bit more clear.
> > + */
>
> Some variables are named (c, s, d). The code does not refer to them.
> Either say in comments which registers are which variables or maybe define
> some macros so one can say, e.g., "s" rather than "%r0" or whatever.
I tried to clarify this a bit more in the new version.
> > +.macro WRITE_DIGIT n
> > + mov %r3, %r1
> > + rsh %r3, 4 * (3 - \n)
> > + and %r3, 0xf
>
> How about a blank line here?
Sure.
> > + mov %r4, %r3
> > + mov %r5, %r3
>
> No need for "mov %r5, %r3". See below.
Correct.
> > + sub %r4, 9
> > + neg %r4
> > + rsh %r4, 63
> > + mul %r4, 0x27
> > + add %r4, 0x30
>
> s/0x27/'a' - '0' - 10/
> s/0x30/'0'/
Good idea.
> > + add %r3, %r4
> > + stxb [%r2 + 0], %r3
>
> How about another blank line here? Also, instead of "add %r3, %r4", how
> about "add %r4, %r3" and then storing %r3. That way, you don't clobber
> %r3. I think it's cleaner to work just in %r4. Plus, you don't need to
> save %r3 in %r5: save an instruction. In the remaining instructions, just
> use %r3 instead of %r5.
Done.
> > + add %r5, %r0
> > + neg %r5
> > + rsh %r5, 63 /* 0 if '0', 1 otherwise */
>
>
> Maybe another blank line here?
Sure.
> > + add %r2, %r5
> > + add %r0, %r5
> > +.endm
> > +
> > + WRITE_DIGIT 0
> > + WRITE_DIGIT 1
> > + WRITE_DIGIT 2
> > + WRITE_DIGIT 3
> > +
> > + /*
> > + * It is possible that all digits are 0, in which case the output
> > + * pointer did not advance from its initial value. We do want a single
> > + * 0 digit as output though.
> > + *
> > + * Since in this case, %r5 will be zero if all digits are zero, and 1
> > + * otherwise, we can simply use %r0 + (%r5 ^ 1) to ensure that when all
> > + * digits are 0, we retain the last one.
> > + */
> > + xor %r5, 1
> > + add %r0, %r5
> > + exit
> > + .size write_hex16, .-write_hex16
> > +
> > +/*
> > + * void inet_ntoa6(const dt_dctx_t *dctx, const uint8_t *src, char *dst,
> > + * uint32 tbloff)
> > + */
> > + .align 4
> > + .global dt_inet_ntoa6
> > + .type dt_inet_ntoa6, @function
> > +dt_inet_ntoa6:
> > + /* %r9 = dst, %r6 = dctx, %r8 = dctx->rodata + tbloff */
>
> Why %r6=dctx? Is the cached value used?
Yes, the call to inet_ntoa() needs it. But I reworked the register use a bit
in v2 since I needed to accomodate some changes because of older kernels.
>
> > + mov %r9, %r3 /* %r9 = dst */
> > + mov %r6, %r1 /* %r6 = dctx */
>
> Why %r6=dctx?
>
> > + mov %r8, %r1 /* %r8 = dctx->rodata + tbloff */
> > + ldxdw %r8, [%r8 + DCTX_RODATA]
>
> Why "mov %r8, %r1"? Why not just "ldxdw %r8, [%r1 + DCTX_RODATA]"?
No longer an issue (code got moved).
> > + jge %r4, RODATA_SIZE, .Ldone
> > + add %r8, %r4
> > +
> > + mov %r3, %r2
> > + mov %r2, 16
> > + mov %r1, %fp
> > + add %r1, -16
> > + call BPF_FUNC_probe_read /* rc = probe_read(&fp[-16], 16, src) */
> > + jne %r0, 0, .Ldone
> > +
> > + /*
> > + * Read the 8 words (16-bit values), and build a bitmap in %r7
> > + * indicating which words are non-zero.
> > + *
> > + * We use an implementation that does not involve branches to reduce
> > + * complexity for the BPF verifier.
> > + *
> > + * The IPv6 address has words in network byte order which may differ
> > + * from the host byte order. We store a 2nd copy od the words, with
>
> s/od/of/
>
> > + * the byte order reversed (if needed). We shouldn't need the 2nd copy
> > + * if the byte order is the same but since the BPF verifier chokes on
> > + * the output code below due to lack of tracking of relations between
> > + * register values, the 2nd copy is needed anyway.
> > + */
> > +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
> > +# define NTOH(reg)
> > +#else
> > +# define NTOH(reg) endbe reg, 16
> > +#endif
> > +.macro GETWORD n
> > + ldxh %r0, [%fp + (-16 + \n * 2)]
> > + NTOH(%r0)
> > + stxh [%fp + (-32 + \n * 2)], %r0
> > + neg %r0
> > + rsh %r0, 63
> > + lsh %r0, 7 - \n
> > + or %r7, %r0
> > +.endm
> > +
> > + mov %r7, 0
> > + GETWORD 0
> > + GETWORD 1
> > + GETWORD 2
> > + GETWORD 3
> > + GETWORD 4
> > + GETWORD 5
> > + GETWORD 6
> > + GETWORD 7
> > +
> > + /* Set the upper bound for %r7. */
> > + and %r7, 0xff
> > +
> > + /* Handle mapped and embedded IPv4 addresses. */
> > + mov %r0, %r7
> > + and %r0, 0xfe
> > + jeq %r0, 2, .Lipv4
> > + jne %r0, 6, .Lnotipv4
> > + ldxh %r0, [%fp + -22]
> > + jeq %r0, 0xffff, .Lipv4
> > +.Lnotipv4:
> > +
> > + /*
> > + * Perform a table lookup to determine the location and length of the
> > + * longest run of 0-words (if any).
> > + */
> > + mov %r0, %r8
> > + add %r0, %r7
> > + ldxb %r7, [%r0 + 0]
>
> No need to spend that extra instruction doing this in %r0. Just "add %r7,
> %r8; ldxb %r7, [%r7 + 0]".
Already done in v2.
> > +
> > + /*
> > + * Determine the number of words to output at the start of the address.
> > + * It is found in the upper 4 bits in %r7 (result of the table lookup
> > + * above). We place an upper bound to make the BPF verifier happy.
> > + */
> > + mov %r6, %r7
> > + rsh %r6, 4
> > + jgt %r6, 7, .Ldone
> > + mov %r8, %fp
> > + add %r8, -32
> > +
> > + /* Write out the %r6 words at the beginning of the address. */
> > +.Lpref_loop:
> > + jle %r6, 0, .Lpref_done
> > + ldxh %r1, [%r8 + 0]
> > + mov %r2, %r9
> > + call write_hex16
> > + add %r9, %r0
> > + stb [%r9 + 0], 0x3a
> > + add %r9, 1
> > + add %r8, 2
> > + sub %r6, 1
> > + ja .Lpref_loop
> > +
> > +.Lpref_done:
> > + stb [%r9 + 0], 0x3a
> > +
> > + /*
> > + * Determine the number of zero-words that can be collapsed. It is
> > + * found in the lower 4 bits of %r7 (result of the table lookup above).
> > + * (Note: it is either 0 or >1.)
>
> That is a really confusing comment when the next instructions calculate
> %r7>>4.
Yes, that comment needs a bit more work because we ended up doing more work
here than anticipated.
> > + */
> > + mov %r1, %r7
> > + rsh %r1, 4 /* #(leading words) */
> > + mov %r0, %r1
> > + neg %r0
> > + rsh %r0, 63
> > + xor %r0, 1
> > + add %r9, %r0
> > + mov %r2, %r7
> > + and %r2, 0xf /* #(zero words to collapse) */
> > + add %r1, %r2 /* #(words used) */
> > + jgt %r1, 8, .Ldone
> > +
> > + mov %r8, %fp
> > + add %r8, -32
> > + mov %r0, %r1
> > + mul %r0, 2
> > + add %r8, %r0
> > + mov %r6, 8
> > + sub %r6, %r1 /* #(words left) */
> > +
> > + stb [%r9 + 0], 0x3a
> > + mov %r0, %r6
> > + neg %r0
> > + rsh %r0, 63
> > + xor %r0, 1
> > + add %r9, %r0
> > +
> > +.Lpost_loop:
> > + jle %r6, 0, .Ldone
> > + stb [%r9 + 0], 0x3a
> > + add %r9, 1
> > + ldxh %r1, [%r8 + 0]
> > + mov %r2, %r9
> > + call write_hex16
> > + add %r9, %r0
> > + add %r8, 2
> > + sub %r6, 1
> > + ja .Lpost_loop
> > +
> > +.Ldone:
> > + stb [%r9 + 0], 0
> > + mov %r0, 0
>
> Why? Who uses that return value?
Mostly because it is clean (BPF functions generally return 0 if they do not
return any value). It also helps the verifier because it can count on the
return value to be a constant - since the verifier doesn't know that this is
effectively a void function.
> > + exit
> > +
> > +.Lipv4:
> > + mov %r1, %r6
> > + mov %r2, %fp
> > + add %r2, -4
> > + mov %r3, %r9
> > + call dt_inet_ntoa
> > + add %r9, %r0
> > + ja .Ldone
> > + .size dt_inet_ntoa6, .-dt_inet_ntoa6
> > diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> > index ebe2e043..61f58a99 100644
> > --- a/libdtrace/dt_cg.c
> > +++ b/libdtrace/dt_cg.c
> > @@ -5878,6 +5878,56 @@ dt_cg_subr_inet_ntoa(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
> > dt_cg_subr_arg_to_tstring(dnp, dlp, drp, "dt_inet_ntoa", DT_ISIGN, 0);
> > }
> > +static void
> > +dt_cg_subr_inet_ntoa6(dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
> > +{
> > + dtrace_hdl_t *dtp = yypcb->pcb_hdl;
> > + ssize_t tbloff;
> > + const char tbl[] = {
>
> Add a comment so that someone understands what this is about and can judge
> whether the entries are correct or not.
Yes, good idea.
> > + 0x08, 0x07, 0x06, 0x06, 0x05, 0x05, 0x05, 0x05,
> > + 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04,
> > + 0x44, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03,
> > + 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03,
> > + 0x35, 0x34, 0x33, 0x33, 0x02, 0x02, 0x02, 0x02,
> > + 0x53, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> > + 0x44, 0x43, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> > + 0x53, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02,
> > + 0x26, 0x25, 0x24, 0x24, 0x23, 0x23, 0x23, 0x23,
> > + 0x53, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22,
> > + 0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + 0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + 0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + 0x17, 0x16, 0x15, 0x15, 0x14, 0x14, 0x14, 0x14,
> > + 0x13, 0x13, 0x13, 0x13, 0x13, 0x13, 0x13, 0x13,
> > + 0x44, 0x43, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12,
> > + 0x53, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12, 0x12,
> > + 0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + 0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + 0x26, 0x25, 0x24, 0x24, 0x23, 0x23, 0x23, 0x23,
> > + 0x53, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22,
> > + 0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + 0x35, 0x34, 0x33, 0x33, 0x32, 0x32, 0x32, 0x32,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + 0x44, 0x43, 0x42, 0x42, 0x62, 0x70, 0x70, 0x70,
> > + 0x53, 0x52, 0x70, 0x70, 0x62, 0x70, 0x70, 0x70,
> > + };
> > +
> > + tbloff = dt_rodata_insert(dtp->dt_rodata, tbl, 256);
> > + if (tbloff == -1L)
> > + longjmp(yypcb->pcb_jmpbuf, EDT_NOMEM);
> > + if (tbloff > DIF_STROFF_MAX)
> > + longjmp(yypcb->pcb_jmpbuf, EDT_STR2BIG);
> > +
> > + dt_cg_subr_arg_to_tstring(dnp, dlp, drp, "dt_inet_ntoa6",
> > + DT_ISIMM, tbloff);
> > +}
> > +
> > typedef void dt_cg_subr_f(dt_node_t *, dt_irlist_t *, dt_regset_t *);
> > static dt_cg_subr_f *_dt_cg_subr[DIF_SUBR_MAX + 1] = {
> > @@ -5924,7 +5974,7 @@ static dt_cg_subr_f *_dt_cg_subr[DIF_SUBR_MAX + 1] = {
> > [DIF_SUBR_NTOHLL] = &dt_cg_subr_htonll,
> > [DIF_SUBR_INET_NTOP] = NULL,
> > [DIF_SUBR_INET_NTOA] = &dt_cg_subr_inet_ntoa,
> > - [DIF_SUBR_INET_NTOA6] = NULL,
> > + [DIF_SUBR_INET_NTOA6] = &dt_cg_subr_inet_ntoa6,
> > [DIF_SUBR_D_PATH] = NULL,
> > [DIF_SUBR_LINK_NTOP] = NULL,
> > };
> > diff --git a/test/unittest/funcs/tst.inet_ntoa6.d b/test/unittest/funcs/tst.inet_ntoa6.d
> > index 971456b1..d62269de 100644
> > --- a/test/unittest/funcs/tst.inet_ntoa6.d
> > +++ b/test/unittest/funcs/tst.inet_ntoa6.d
> > @@ -4,7 +4,6 @@
> > * Licensed under the Universal Permissive License v 1.0 as shown at
> > * http://oss.oracle.com/licenses/upl.
> > */
> > -/* @@xfail: dtv2 */
> > #pragma D option quiet
>
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel
More information about the DTrace-devel
mailing list