[DTrace-devel] [DTrace] DTrace userspace branch dm/2.0-branch-dev-aggs updated. 2ffdae6376cbe280267577edbfc5dc93a776398f

Wed Nov 25 09:41:33 PST 2020

I see this is WIP and you're not ready for review, but I happened to 
take a peek anyhow to see where stddev is headed.  Comments below.

On 11/24/2020 08:57 PM, nick.alcock at oracle.com wrote:

> commit 2ffdae6376cbe280267577edbfc5dc93a776398f
> Author: David  Mc Lean <david.mclean at oracle.com>
> Date:   Thu Nov 19 12:04:39 2020 -0800
>
>      aggregations max(), min(), sum(), and stddev() WIP
>
> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
>
> +static void
> +dt_cg_agg_stddev_impl(dt_irlist_t *dlp, dt_regset_t *drp, int dreg, int vreg)
> +{
> +	int idxreg, accreg, hi_reg, medreg, lowreg;
> +
> +	/* labels */
> +	uint_t		Lpos = dt_irlist_label(dlp);
> +
> +
> +	TRACE_REGSET("            Impl: Begin");
> +
> +	/* get some registers to work with */
> +	if ((idxreg = dt_regset_alloc(drp)) == -1 ||
> +	    (accreg = dt_regset_alloc(drp)) == -1 ||
> +	    (hi_reg = dt_regset_alloc(drp)) == -1 ||
> +	    (medreg = dt_regset_alloc(drp)) == -1 ||
> +	    (lowreg = dt_regset_alloc(drp)) == -1 ||)
> +		longjmp(yypcb->pcb_jmpbuf, EDT_NOREG);
> +
> +	/* dst[0]++; increment count of calls eq. value count */
> +	BPFCODE(BPF_MOV_IMM(idxreg, 1))
> +	BPFCODE(BPF_XADD_REG(BPF_DW, dreg, 0, idxreg))
> +	dt_regset_free(drp, idxreg);

1.  You use idxreg below, so you cannot free it here.  (Or you'd have to 
reallocate it or something funny.)  Not sure why it's called idxreg.

2.  The aggregation functions started with this foo() and foo_impl() 
approach, where foo_impl() would get called twice for each update (since 
there were two copies to update).  With the *quantize() functions, which 
have "complicated" conversions from values to bin numbers, there is a 
third function.  We call that other function once and the foo_impl() 
function twice.  In your case, I would imagine this new function to 
determine the 128-bit square, placing this result in a pair of 
registers, and to have the std_impl() function simply do the +1, +val, 
and +valsquared aggregation.

> +	/* dst[1] += val; add value to the total sum */
> +	BPFCODE(BPF_XADD_REG(BPF_DW, dreg, 8, vreg))
> +
> +	/* Handle double-dst[3-4] = sq(val) starting here */
> +
> +	// If value is negative make it positive, sq() is positive anyhow
> +	BPFCODE(BPF_MOV_REG(accreg, vreg))
> +	BPFCODE(BPF_MOV_REG(idxreg, vreg))
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, accreg, 0x8000000000000000))
> +	BPFCODE(BPF_BRANCH_IMM(BPF_JEQ, accreg, 0, Lpos))

You cannot IMM a 64-bit value.  I think you can just 
BPF_BRANCH_IMM(BPF_JSGE, vreg, 0, Lpos).

> +
> +	// Change to a positive value
> +	BPFCODE(BPF_NEG_REG(idxreg))

That changes the magnitude.  Probably need BPF_MUL_IMM(regfoo, -1) or 
something.  Note that x and -x are not 2s complements.  If x and y are 
2s complements, they add up to -1.  x and -x add up to 0.

> +	BPFCODE_lbl = Lpos;

I like blank lines above labels, not below them.  Not sure if Kris has a 
strong preference.

> +
> +	// Now we are sure idxreg holds our absolute value
> +	// Split our value into two pieces
> +
> +	// Put least significant half in lowreg
> +	BPFCODE(BPF_MOV_REG(accreg, idxreg))
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, accreg, 0xFFFFFFFF))
> +	BPFCODE(BPF_MOV_REG(lowreg, accreg))

No need for the accreg intermediate register.  Just go to lowreg directly.

> +
> +	// Put most significant half in hi_reg
> +	BPFCODE(BPF_MOV_REG(accreg, idxreg))
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, accreg, 0xFFFFFFFF00000000))
> +	BPFCODE(BPF_ALU64_IMM(BPF_RSH, accreg, 32))
> +	BPFCODE(BPF_MOV_REG(hi_reg, accreg))

Again, no need for intermediate register; go to hi_reg directly. Also, 
no need for AND since the RSH will pick up the correct bits.

> +	// Multiply using algebraic FOIL, to contain interim values to 64 bits
> +

For me personally, FOIL is not helpful... I had to look it up. There is 
no ironclad notion of which factor is first, which term is first, etc.  
Personally, high, medium, and low are more descriptive.  That's all 
subjective.  I'm just voting.  Anyhow, the complexity is not 
"multiplying two binomials" but all the cutting and stitching that needs 
to be done for 128-bit multiplies with 64-bit registers.

> +	// Calculate 'O' (== I) first
> +	// This result represents a value shifted 32 bits to the right
> +	BPFCODE(BPF_MOV_REG(medreg, lowreg))
> +	BPFCODE(BPF_ALU64_REG(BPF_MUL, medreg, hi_reg))
> +
> +	// Calculate 'L' next
> +	BPFCODE(BPF_MOV_REG(accreg, lowreg))
> +	BPFCODE(BPF_ALU64_REG(BPF_MUL, lowreg, accreg))

For L, maybe BPF_ALU64_REG(BPF_MUL, lowreg, lowreg) just works?  No need 
for accreg?

> +	// Calculate 'F' next; this result represents shifted 64 to the right
> +	BPFCODE(BPF_MOV_REG(accreg, hi_reg))
> +	BPFCODE(BPF_ALU64_REG(BPF_MUL, hi_reg, accreg))
> +

Same thing... I'm thinking no need for accreg; just hi *= hi.

> +	// Now add the pieces and carefully place them in the right place
> +
> +	// Shave the high nibble off the low half value to allow for no loss
> +	// by avoiding overflow in the coming addition.
> +
> +	BPFCODE(BPF_MOV_REG(idxreg, lowreg))
> +	// RSH effectively does this for me anyhow
> +	// BPFCODE(BPF_ALU64_IMM(BPF_AND, idxreg, 0xF000000000000000))
> +	BPFCODE(BPF_ALU64_IMM(BPF_RSH, idxreg, 60))
> +	// Clear the saved nibble in the source value
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, lowreg, 0x0FFFFFFFFFFFFFFF))

Again, IMM cannot be a 64-bit value.  Loading a 64-bit value costs two 
slots.  Then you need to AND.  So, one alternative is to LSH 4 bits and 
then RSH 4 bits.

 From here on out, it's a little hard for me to follow the code.  I'm 
not sure how to solve that.

Incidentally, the DTv1 code has that implementation in terms of 
dtrace_[multiply|shift|add]_128 in dtrace_probe_ctx.c.  Did you decide 
not to go that route?  It's intended to be more general (multiply two 
different factors rather than squaring just one; shift by an arbitrary 
amount rather than a known 32), so it could be simplified a lot before 
implementing in BPF.

> +	BPFCODE(BPF_MOV_REG(accreg, medreg)) //simmilar nibble thing here
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, accreg, 0x00000000F0000000))
> +	BPFCODE(BPF_ALU64_IMM(BPF_RSH, accreg, 28))
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, medreg, 0xFFFFFFFF0FFFFFFF))
> +	// The values for 'O' are doubled to represent 'I' too.
> +	BPFCODE(BPF_ALU64_IMM(BPF_MUL, accreg, 2))
> +
> +	// Get the low value part from medreg and add it into lowreg
> +	BPFCODE(BPF_ALU64_REG(BPF_ADD, accreg, idxreg))
> +	BPFCODE(BPF_MOV_REG(idxreg, medreg))
> +	BPFCODE(BPF_ALU64_IMM(BPF_LSH, idxreg, 32))
> +	BPFCODE(BPF_ALU64_IMM(BPF_MUL, idxreg, 2)) // values for 'O' are doubled
> +	BPFCODE(BPF_ALU64_REG(BPF_ADD, lowreg, idxreg))
> +
> +	// Get that last possible high nibble from lowreg
> +	BPFCODE(BPF_MOV_REG(idxreg, lowreg))
> +	BPFCODE(BPF_ALU64_IMM(BPF_RSH, idxreg, 60))
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, lowreg, 0x0FFFFFFFFFFFFFFF))
> +
> +	// most significant nibble of the low half, + overflow above that nibble
> +	BPFCODE(BPF_ALU64_REG(BPF_ADD, accreg, idxreg))
> +	// Put the final nibble in place for the low half
> +	BPFCODE(BPF_MOV_REG(idxreg, accreg))
> +	BPFCODE(BPF_ALU64_IMM(BPF_AND, lowreg, 0xF))
> +	BPFCODE(BPF_ALU64_IMM(BPF_LSH, idxreg, 60))
> +	BPFCODE(BPF_ALU64_REG(BPF_ADD, lowreg, idxreg))
> +
> +	// Begin putting together the high half value
> +	BPFCODE(BPF_ALU64_IMM(BPF_RSH, accreg, 4))
> +	BPFCODE(BPF_MOV_REG(idxreg, medreg))
> +	BPFCODE(BPF_ALU64_IMM(BPF_RSH, idxreg, 32))
> +	BPFCODE(BPF_ALU64_IMM(BPF_MUL, idxreg, 2)) // values for 'O' are doubled
> +	BPFCODE(BPF_ALU64_REG(BPF_ADD, hi_reg, accreg))
> +	BPFCODE(BPF_ALU64_REG(BPF_ADD, hi_reg, idxreg))
> +
> +	// I might need to account for overflow from adding to the value
> +	// already in dreg location offset 24
> +	BPFCODE(BPF_XADD_REG(BPF_DW, dreg, 16, hi_reg))
> +	BPFCODE(BPF_XADD_REG(BPF_DW, dreg, 24, lowreg))

Again, this XADD stuff makes sense for the std_impl() function, which is 
called twice, but the 128-bit square should perhaps be done only once, 
before std_impl() is called.

> +	dt_regset_free(drp, idxreg);
> +	dt_regset_free(drp, accreg);
> +	dt_regset_free(drp, hi_reg);
> +	dt_regset_free(drp, medreg);
> +	dt_regset_free(drp, lowreg);
> +
> +	TRACE_REGSET("            Impl: End  ");
> +}
> +
>   static void
>   dt_cg_agg_stddev(dt_pcb_t *pcb, dt_ident_t *aid, dt_node_t *dnp,
>   		 dt_irlist_t *dlp, dt_regset_t *drp)
>   {
>   	int	sz = 4 * sizeof(uint64_t);
>   
> -	TRACE_REGSET("    AggStd : Begin");
> +	TRACE_REGSET("    AggStddev : Begin");
>   
>   	if (aid->di_offset == -1)
>   		aid->di_offset = DT_CG_AGG_OFFSET(pcb, sz);
>   
> -	TRACE_REGSET("    AggStd : End  ");
> +	dt_cg_node(dnp->dn_aggfun->dn_args, dlp, drp);
> +	DT_CG_AGG_IMPL(aid, sz, dlp, drp, dt_cg_agg_stddev_impl,
> +		       dnp->dn_aggfun->dn_args->dn_reg);
> +	dt_regset_free(drp, dnp->dn_aggfun->dn_args->dn_reg);
> +
> +	TRACE_REGSET("    AggStddev : End  ");
> +}
I think Kris wants fixed width on the TRACE_REGSET label widths. So:  
"AggStd: End  " or something.  Same for Sum.  And the Begin will come 
after the aid->di_offset stuff.  Once the other aggregation functions 
have stabilized (I'm working on that), hopefully all this will become 
more clear.