[DTrace-devel] [PATCH v4] Fix records for aggregation descriptions

Wed Aug 31 02:32:31 UTC 2022

On Mon, Aug 29, 2022 at 04:47:58PM -0400, Eugene Loh via DTrace-devel wrote:
> The patch uses "keys" in two different ways:
>     - aggregation keys (e.g., see the commit message)
>     - BPF map keys (which adds varID)
> This is confusing.  Arguably, one approach is to break varID off as its own
> record... e.g., dtagd_irec (to add to dtagd_krecs and dtagd_vrecs).  I
> suspect no one would be enthusiastic about this approach, but changing the
> meaning of "keys" from one place to another is not clean.

We cannot avoid the fact that 'keys' is a very generic term.  I would use tuple
but I believe you have indicated before that is not consistent with the docs.
But since both the DTrace docs and the BPF docs refer to keys, we are kind of
stuck with that.

I don't think it is confusing though...  clearly we are talking about the
aggregation keys when we refer to the indexed aggregations, and we talk about
BPF map keys at the code level when dealing with indexing a hashmap.

> The "latch sequence number" raises lots of questions.  First of all, what
> does it mean?  I assume, despite its name, that it is no longer a latch
> sequence number.  The patch also calls it a "data counter", though it
> apparently is not that either -- e.g., dt_agg_one_agg() copies the value for
> MIN/MAX but accumulates it otherwise.  So is it a flag to indicate the
> presence of data?  Is it useful in dtada_data, or will its value always be
> set to indicate the presence of data?  In the long run, will it even be used
> in the BPF map, or will its value always be set even in the BPF map to
> indicate the presence of data?  If the value only ever has one setting, why
> have DT_AGGDATA_COUNTER at all?

At this point in the patch series it is still the latch sequence number.  In
fact, your own patches still refer to it that way at this point in the series.
So I retained that as well.  It is not used beyond the initial check when the
data is retrieved by the consumer, and it is therefore not relevant beyond that
scope.  I chose to use the aggregation function semantics for it because it
helps with debugging at this point in the patch series and is absolutely
harmless.  The min and max functions are data-replacement functions whereas
all other aggregtion functions are truly calculating an aggregate result.
So, for min and max, the latch sequence number is replaced by the new data, and
for the other functions, we add the values for the latch sequence number as
we do with the aggregation data.

It is present in the copied data at the consumer level simply because it is
accounted for in the record descriptions and we want to keep the data at its
reported offset.

> On 8/25/22 17:22, Kris Van Hees via DTrace-devel wrote:
> > On Thu, Aug 25, 2022 at 04:15:01PM -0400, Kris Van Hees via DTrace-devel wrote:
> > > The implementation of aggregations erronously registered the value of
> > > an aggregation as a series of 64-bit values, each with their own record
> > > description.  However, aggregatios in their most generic form should
> > > comprise the following records:
> > > 
> > > (1) variable ID
> > > (2) keys (values of the index tuple, if any)
> > > (3) latch sequence number
> > > (4) aggregation data
> > > 
> > > (1) and (2) are records that identify the aggregation, whereas (3) and
> > > (4) are data that is generated by the producer.
> 
> What does this mean?  E.g., (2) is also data that is generated by the
> producer.  The real point is presumably that (1) and (2) are used to form
> BPF map keys while (3) and (4) are part of BPF map values.

When we're talking about aggregations, generated data is clearly the actual
aggregation data that is generated by means of the aggregation function.  I
doubt there is confusion here.

> Incidentally, these typos:
>   erronously
>   aggregatios
>   aggregtion
> 
> 
> > > Change the aggregation record descriptions to match the layout described
> > > above.
> > > 
> > > Note that the variable ID is now stored as a 32-bit value whereas DTrace
> > > used to store it as a 64-bit value.  This change is valid because the
> > > variable ID is an identifier ID and therefore cannot exceed the range of
> > > a 32-bit integer,
> > > 
> > > Support for aggregtion keys will be added in later patches.
> > > 
> > > Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> > > Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
> > > Reviewed-by: Kris Van Hees <kris.van.hees at oracle.com>
> > > ---
> > >   cmd/dtrace.c              |  2 +-
> > >   include/dtrace/metadesc.h |  8 +++--
> > >   libdtrace/dt_aggregate.c  | 74 +++++++++++++++++++--------------------
> > >   libdtrace/dt_cg.c         |  9 ++++-
> > >   libdtrace/dt_consume.c    | 41 +++++++++-------------
> > >   libdtrace/dt_impl.h       |  3 ++
> > >   libdtrace/dt_map.c        | 65 ++++++++++++++++++++--------------
> > >   libdtrace/dt_printf.c     | 25 ++++++-------
> > >   8 files changed, 119 insertions(+), 108 deletions(-)
> > > 
> > > diff --git a/cmd/dtrace.c b/cmd/dtrace.c
> > > index 22ac28c3..350cab94 100644
> > > --- a/cmd/dtrace.c
> > > +++ b/cmd/dtrace.c
> > > @@ -682,7 +682,7 @@ bufhandler(const dtrace_bufdata_t *bufdata, void *arg)
> > >   		BUFDUMPSTR(desc, dtagd_name);
> > >   		BUFDUMP(desc, dtagd_varid);
> > >   		BUFDUMP(desc, dtagd_id);
> > > -		BUFDUMP(desc, dtagd_nrecs);
> > > +		BUFDUMP(desc, dtagd_nkrecs + 1);
> > >   		BUFDUMPHDR("");
> > >   	}
> 
> Why nkrecs+1 rather than simply nkrecs?  It seems that we do not use or test
> "dtrace -B" in the first place.  Nevertheless, it is hard for me to see how
> output like
>     dtagd_nkrecs + 1 = 2
> is more interesting than output like
>     dtagd_nkrecs = 1

Yes, I made a mistake here, forgetting that those macros actual print the 2nd
argument as the field name and evaluate it to get its value.  The whole
buffering output stuff is not at all tested (and never has been - not even in
the legacy version) and/or well defined.

I'll keep the (incorrect) existing code and add a FIXME.

> 
> > > diff --git a/include/dtrace/metadesc.h b/include/dtrace/metadesc.h
> > > index 20a5b8a0..f75b0a8f 100644
> > > --- a/include/dtrace/metadesc.h
> > > +++ b/include/dtrace/metadesc.h
> > > @@ -57,9 +57,11 @@ typedef struct dtrace_aggdesc {
> > >   	dtrace_aggid_t dtagd_id;		/* aggregation ID */
> > >   	uint64_t dtagd_sig;			/* aggregation signature */
> > >   	uint64_t dtagd_normal;			/* aggregation normalization */
> > > -	uint32_t dtagd_size;			/* size in bytes */
> > > -	int dtagd_nrecs;			/* number of records */
> > > -	dtrace_recdesc_t *dtagd_recs;		/* record descriptions */
> > > +	uint32_t dtagd_ksize;			/* keys size in bytes */
> > > +	uint32_t dtagd_dsize;			/* data size in bytes */
> > > +	int dtagd_nkrecs;			/* number of key records */
> > > +	dtrace_recdesc_t *dtagd_krecs;		/* key record descriptions */
> > > +	dtrace_recdesc_t *dtagd_drecs;		/* data record descriptions */
> > >   } dtrace_aggdesc_t;
> > >   typedef struct dtrace_fmtdesc {
> > > diff --git a/libdtrace/dt_aggregate.c b/libdtrace/dt_aggregate.c
> > > index fa65ecd3..f810efd3 100644
> > > --- a/libdtrace/dt_aggregate.c
> > > +++ b/libdtrace/dt_aggregate.c
> > > @@ -390,7 +390,7 @@ dt_aggregate_aggid(dt_ahashent_t *ent)
> > >   {
> > >   	dtrace_aggdesc_t *agg = ent->dtahe_data.dtada_desc;
> > >   	caddr_t data = ent->dtahe_data.dtada_data;
> > > -	dtrace_recdesc_t *rec = agg->dtagd_recs;
> > > +	dtrace_recdesc_t *rec = agg->dtagd_krecs;
> > >   	/*
> > >   	 * First, we'll check the variable ID in the aggdesc.  If it's valid,
> > > @@ -406,8 +406,8 @@ dt_aggregate_aggid(dt_ahashent_t *ent)
> > >   	return agg->dtagd_varid;
> > >   }
> > > -typedef void (*agg_cpu_f)(dt_ident_t *aid,
> > > -    int64_t *dst, int64_t *src, uint_t realsz);
> > > +typedef void (*agg_cpu_f)(dt_ident_t *aid, int64_t *dst, int64_t *src,
> > > +			  uint_t datasz);
> > >   typedef struct dt_snapstate {
> > >   	dtrace_hdl_t	*dtp;
> > > @@ -418,34 +418,34 @@ typedef struct dt_snapstate {
> > >   } dt_snapstate_t;
> > >   static void
> > > -dt_agg_one_copy(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t realsz)
> > > +dt_agg_one_copy(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t datasz)
> > >   {
> > > -	src++;  /* skip latch sequence number */
> > > -	memcpy(dst, src, realsz);
> > > +	memcpy(dst, src, datasz);
> > >   }
> > >   static void
> > > -dt_agg_one_agg(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t realsz)
> > > +dt_agg_one_agg(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t datasz)
> > >   {
> > >   	uint_t i, cnt;
> > >   	if (*src == 0)
> > >   		return;
> > > -	src++;  /* skip latch sequence number */
> > >   	switch (((dt_ident_t *)aid->di_iarg)->di_id) {
> > >   	case DT_AGG_MAX:
> > > +		*dst++ = *src++;	/* copy latch sequence number */
> > >   		if (*src > *dst)
> > >   			*dst = *src;
> > >   		break;
> > >   	case DT_AGG_MIN:
> > > +		*dst++ = *src++;	/* copy latch sequence number */
> > >   		if (*src < *dst)
> > >   			*dst = *src;
> > >   		break;
> > >   	default:
> > > -		for (i = 0, cnt = realsz / sizeof(int64_t);
> > > +		for (i = 0, cnt = datasz / sizeof(int64_t);
> > >   		     i < cnt; i++, dst++, src++)
> > >   			*dst += *src;
> > >   	}
> 
> Here is what I mentioned earlier:  the first word is copied for MIN/MAX but
> accumulated otherwise.  So, how should we describe that first word?  Neither
> latch sequence number nor data counter (the two phrases used by this patch)
> makes sense.

At this point, it does not matter.

> > > @@ -461,7 +461,7 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
> > >   	uint64_t		hval = aid->di_id;
> > >   	size_t			ndx = hval % agh->dtah_size;
> > >   	int			rval;
> > > -	uint_t			i, realsz;
> > > +	uint_t			i, datasz;
> > >   	int64_t			*src;
> > >   	rval = dt_aggid_lookup(st->dtp, aid->di_id, &agg);
> > > @@ -471,23 +471,22 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
> > >   	/* point to the latch sequence number */
> > >   	src = (int64_t *)(st->buf + aid->di_offset);
> > > -	/* real size excludes latch sequence number and second data copy, if any */
> > > -	realsz = (aid->di_size - sizeof(uint64_t)) / DT_AGG_NUM_COPIES;
> > > +	datasz = agg->dtagd_dsize;
> > >   	/* See if we already have an entry for this aggregation. */
> > >   	for (h = agh->dtah_hash[ndx]; h != NULL; h = h->dtahe_next) {
> > > -		if (h->dtahe_hval != hval || h->dtahe_size != realsz)
> > > +		if (h->dtahe_hval != hval || h->dtahe_size != datasz)
> > >   			continue;
> > >   		/* Entry found - process the data. */
> > >   		agd = &h->dtahe_data;
> > > -		st->fun(aid, (int64_t *)agd->dtada_data, src, realsz);
> > > +		st->fun(aid, (int64_t *)agd->dtada_data, src, datasz);
> > >   		/* If we keep per-CPU data - process that as well. */
> > >   		if (agd->dtada_percpu != NULL)
> > >   			st->fun(aid, (int64_t *)agd->dtada_percpu[st->cpu],
> > > -			    src, realsz);
> > > +			    src, datasz);
> > >   		return 0;
> > >   	}
> > > @@ -502,20 +501,19 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
> > >   		return dt_set_errno(st->dtp, EDT_NOMEM);
> > >   	agd = &h->dtahe_data;
> > > -	agd->dtada_data = dt_alloc(st->dtp, realsz);
> > > +	agd->dtada_data = dt_alloc(st->dtp, datasz);
> > >   	if (agd->dtada_data == NULL) {
> > >   		dt_free(st->dtp, h);
> > >   		return dt_set_errno(st->dtp, EDT_NOMEM);
> > >   	}
> > > -	src++;
> > > -	memcpy(agd->dtada_data, src, realsz);
> > > -	agd->dtada_size = realsz;
> > > +	memcpy(agd->dtada_data, src, datasz);
> > > +	agd->dtada_size = datasz;
> > >   	agd->dtada_desc = agg;
> > >   	agd->dtada_hdl = st->dtp;
> > >   	h->dtahe_hval = hval;
> > > -	h->dtahe_size = realsz;
> > > +	h->dtahe_size = datasz;
> > >   	if (st->agp->dtat_flags & DTRACE_A_PERCPU) {
> > >   		char	**percpu = dt_calloc(st->dtp,
> > > @@ -530,7 +528,7 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
> > >   		}
> > >   		for (i = 0; i <= st->dtp->dt_conf.max_cpuid; i++) {
> > > -			percpu[i] = dt_zalloc(st->dtp, realsz);
> > > +			percpu[i] = dt_zalloc(st->dtp, datasz);
> > >   			if (percpu[i] == NULL) {
> > >   				while (--i >= 0)
> > >   					dt_free(st->dtp, percpu[i]);
> > > @@ -541,7 +539,7 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
> > >   			}
> > >   		}
> > > -		memcpy(percpu[st->cpu], src, realsz);
> > > +		memcpy(percpu[st->cpu], src, datasz);
> > >   		agd->dtada_percpu = percpu;
> > >   	}
> > > @@ -617,10 +615,10 @@ dt_aggregate_hashcmp(const void *lhs, const void *rhs)
> > >   	dtrace_aggdesc_t *lagg = lh->dtahe_data.dtada_desc;
> > >   	dtrace_aggdesc_t *ragg = rh->dtahe_data.dtada_desc;
> > > -	if (lagg->dtagd_nrecs < ragg->dtagd_nrecs)
> > > +	if (lagg->dtagd_nkrecs < ragg->dtagd_nkrecs)
> > >   		return DT_LESSTHAN;
> > > -	if (lagg->dtagd_nrecs > ragg->dtagd_nrecs)
> > > +	if (lagg->dtagd_nkrecs > ragg->dtagd_nkrecs)
> > >   		return DT_GREATERTHAN;
> > >   	return 0;
> > > @@ -659,8 +657,8 @@ dt_aggregate_keycmp(const void *lhs, const void *rhs)
> > >   	if ((rval = dt_aggregate_hashcmp(lhs, rhs)) != 0)
> > >   		return rval;
> > > -	nrecs = lagg->dtagd_nrecs - 1;
> > > -	assert(nrecs == ragg->dtagd_nrecs - 1);
> > > +	nrecs = lagg->dtagd_nkrecs;
> > > +	assert(nrecs == ragg->dtagd_nkrecs);
> > >   	keypos = dt_keypos + 1 >= nrecs ? 0 : dt_keypos;
> > > @@ -671,8 +669,8 @@ dt_aggregate_keycmp(const void *lhs, const void *rhs)
> > >   		if (ndx >= nrecs)
> > >   			ndx = ndx - nrecs + 1;
> > > -		lrec = &lagg->dtagd_recs[ndx];
> > > -		rrec = &ragg->dtagd_recs[ndx];
> > > +		lrec = &lagg->dtagd_krecs[ndx];
> > > +		rrec = &ragg->dtagd_krecs[ndx];
> > >   		ldata = lh->dtahe_data.dtada_data + lrec->dtrd_offset;
> > >   		rdata = rh->dtahe_data.dtada_data + rrec->dtrd_offset;
> > > @@ -772,18 +770,18 @@ dt_aggregate_valcmp(const void *lhs, const void *rhs)
> > >   	if ((rval = dt_aggregate_hashcmp(lhs, rhs)) != 0)
> > >   		return rval;
> > > -	if (lagg->dtagd_nrecs > ragg->dtagd_nrecs)
> > > +	if (lagg->dtagd_nkrecs > ragg->dtagd_nkrecs)
> > >   		return DT_GREATERTHAN;
> > > -	if (lagg->dtagd_nrecs < ragg->dtagd_nrecs)
> > > +	if (lagg->dtagd_nkrecs < ragg->dtagd_nkrecs)
> > >   		return DT_LESSTHAN;
> > > -	if (lagg->dtagd_nrecs <= 0)
> > > +	if (lagg->dtagd_nkrecs <= 0)
> > >   	    return 0;
> > > -	for (i = 0; i < lagg->dtagd_nrecs; i++) {
> > > -		lrec = &lagg->dtagd_recs[i];
> > > -		rrec = &ragg->dtagd_recs[i];
> > > +	for (i = 0; i < lagg->dtagd_nkrecs; i++) {
> > > +		lrec = &lagg->dtagd_krecs[i];
> > > +		rrec = &ragg->dtagd_krecs[i];
> > >   		if (lrec->dtrd_offset < rrec->dtrd_offset)
> > >   			return DT_LESSTHAN;
> 
> We're cutting the loop short to exclude iteration over the data record.  I'm
> a little unclear what the ramifications are:

Well no, we're not cutting the loop short - it loops over the key records.  But
the original code was combining checking keys and the data record.  I'll add a
check for the data record explicitly.

> 1)  At the very least, we do not set lrec and rrec to correspond to the data
> record.  I think you fix that in a later patch, but it should be done here
> in this patch.

Yes.

> 2)  We omit the offset and action checks for the data record... are there
> any repercussions to this?

See above.

I'll have to check but I am pretty sure that there is no point in checking the
action of the key records anymore because those are always the same.  But it
would make sense to check the sizes, I think.

There is other stuff that needs to be fixed here n terms of pre-existing odd
code.  E.g. dt_aggregate_hashcmp() already compares the number of key records,
so there is no point in doing it again in dt_aggregate_valcmp().

> > > @@ -1089,7 +1087,7 @@ dt_aggwalk_rval(dtrace_hdl_t *dtp, dt_ahashent_t *h, int rval)
> > >   		uint32_t size, offs = 0;
> > >   		aggdesc = h->dtahe_data.dtada_desc;
> > > -		rec = &aggdesc->dtagd_recs[aggdesc->dtagd_nrecs - 1];
> > > +		rec = &aggdesc->dtagd_drecs[DT_AGGDATA_RECORD];
> > >   		size = rec->dtrd_size;
> > >   		data = &h->dtahe_data;
> > > @@ -1514,11 +1512,11 @@ dtrace_aggregate_walk_joined(dtrace_hdl_t *dtp, dtrace_aggid_t *aggvars,
> > >   				 * cons up the zaggdata entry for it.
> > >   				 */
> > >   				aggdata = &zaggdata[i].dtahe_data;
> > > -				aggdata->dtada_size = agg->dtagd_size;
> > > +				aggdata->dtada_size = agg->dtagd_dsize;
> > >   				aggdata->dtada_desc = agg;
> > >   				aggdata->dtada_hdl = dtp;
> > >   				zaggdata[i].dtahe_hval = 0;
> > > -				zaggdata[i].dtahe_size = agg->dtagd_size;
> > > +				zaggdata[i].dtahe_size = agg->dtagd_dsize;
> > >   				break;
> > >   			}
> > > @@ -1773,7 +1771,7 @@ dtrace_aggregate_clear(dtrace_hdl_t *dtp)
> > >   	for (h = hash->dtah_all; h != NULL; h = h->dtahe_nextall) {
> > >   		aggdesc = h->dtahe_data.dtada_desc;
> > > -		rec = &aggdesc->dtagd_recs[aggdesc->dtagd_nrecs - 1];
> > > +		rec = &aggdesc->dtagd_drecs[DT_AGGDATA_RECORD];
> > >   		data = &h->dtahe_data;
> > >   		memset(&data->dtada_data[rec->dtrd_offset], 0, rec->dtrd_size);
> > > diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> > > index ebd41906..7bff5895 100644
> > > --- a/libdtrace/dt_cg.c
> > > +++ b/libdtrace/dt_cg.c
> > > @@ -6712,6 +6712,7 @@ static dt_cg_aggfunc_f *_dt_cg_agg[DT_AGG_NUM] = {
> > >   static void
> > >   dt_cg_agg(dt_pcb_t *pcb, dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
> > >   {
> > > +	dtrace_hdl_t	*dtp = pcb->pcb_hdl;
> > >   	dt_ident_t	*aid, *fid;
> > >   	dt_cg_aggfunc_f	*aggfp;
> > > @@ -6744,7 +6745,13 @@ dt_cg_agg(dt_pcb_t *pcb, dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
> > >   	assert(aggfp != NULL);
> > >   	(*aggfp)(pcb, aid, dnp, dlp, drp);
> > > -	dt_aggid_add(pcb->pcb_hdl, aid);
> > > +
> > > +	/*
> > > +	 * Add this aggid if we see it for the first time.  We do this after
> > > +	 * the BPF code generation because aggfp() sets aid->di_size.
> > > +	 */
> > > +	if (dt_aggid_lookup(dtp, aid->di_id, NULL) == -1)
> > > +		dt_aggid_add(dtp, aid);
> > >   }
> > >   void
> > > diff --git a/libdtrace/dt_consume.c b/libdtrace/dt_consume.c
> > > index 973084d6..7e6622d1 100644
> > > --- a/libdtrace/dt_consume.c
> > > +++ b/libdtrace/dt_consume.c
> > > @@ -1001,8 +1001,8 @@ dt_print_bytes(dtrace_hdl_t *dtp, FILE *fp, caddr_t addr,
> > >   		return dt_printf(dtp, fp, "  %-*s", width, s);
> > >   	}
> > > -        /* print the bytes raw */
> > > -        return dt_print_rawbytes(dtp, fp, addr, nbytes);
> > > +	/* print the bytes raw */
> > > +	return dt_print_rawbytes(dtp, fp, addr, nbytes);
> > >   }
> > >   #ifdef FIXME
> > > @@ -1519,7 +1519,7 @@ dt_clear_agg(const dtrace_aggdata_t *aggdata, void *arg)
> > >   	dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
> > >   	dtrace_aggid_t		id = *((dtrace_aggid_t *)arg);
> > > -	if (agg->dtagd_nrecs == 0)
> > > +	if (agg->dtagd_nkrecs == 0)
> > >   		return DTRACE_AGGWALK_NEXT;
> > >   	if (agg->dtagd_varid != id)
> > > @@ -1540,7 +1540,7 @@ dt_trunc_agg(const dtrace_aggdata_t *aggdata, void *arg)
> > >   	dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
> > >   	dtrace_aggid_t		id = trunc->dttd_id;
> > > -	if (agg->dtagd_nrecs == 0)
> > > +	if (agg->dtagd_nkrecs == 0)
> > >   		return DTRACE_AGGWALK_NEXT;
> > >   	if (agg->dtagd_varid != id)
> > > @@ -1620,10 +1620,14 @@ dt_trunc(dtrace_hdl_t *dtp, caddr_t base, dtrace_recdesc_t *rec)
> > >   static int
> > >   dt_print_datum(dtrace_hdl_t *dtp, FILE *fp, dtrace_recdesc_t *rec,
> > > -	       caddr_t addr, size_t size, uint64_t normal, uint64_t sig)
> > > +	       caddr_t addr, uint64_t normal, uint64_t sig)
> > >   {
> > >   	int			err;
> > >   	dtrace_actkind_t	act = rec->dtrd_action;
> > > +	size_t			size = rec->dtrd_size;
> > > +
> > > +	/* Apply the record offset to the base address. */
> > > +	addr += rec->dtrd_offset;
> > >   	switch (act) {
> > >   	case DTRACEACT_STACK:
> > > @@ -1697,31 +1701,22 @@ dt_print_datum(dtrace_hdl_t *dtp, FILE *fp, dtrace_recdesc_t *rec,
> > >   static int
> > >   dt_print_aggs(const dtrace_aggdata_t **aggsdata, int naggvars, void *arg)
> > >   {
> > > -	int			i, aggact = 0;
> > > +	int			i;
> > >   	dtrace_print_aggdata_t	*pd = arg;
> > >   	const dtrace_aggdata_t	*aggdata = aggsdata[0];
> > >   	dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
> > >   	FILE			*fp = pd->dtpa_fp;
> > >   	dtrace_hdl_t		*dtp = pd->dtpa_dtp;
> > >   	dtrace_recdesc_t	*rec;
> > > -	caddr_t			addr;
> > > -	size_t			size;
> > >   	/*
> > >   	 * Iterate over each record description in the key, printing the traced
> > > -	 * data.
> > > +	 * data.  The first record is skipped because it holds the variable ID.
> > >   	 */
> > > -	for (i = 0; i < agg->dtagd_nrecs; i++) {
> > > -		rec = &agg->dtagd_recs[i];
> > > -		addr = aggdata->dtada_data + rec->dtrd_offset;
> > > -		size = rec->dtrd_size;
> > > -
> > > -		if (DTRACEACT_ISAGG(rec->dtrd_action)) {
> > > -			aggact = i;
> > > -			break;
> > > -		}
> > > +	for (i = 1; i < agg->dtagd_nkrecs; i++) {
> > > +		rec = &agg->dtagd_krecs[i];
> > > -		if (dt_print_datum(dtp, fp, rec, addr, size, 1, 0) < 0)
> > > +		if (dt_print_datum(dtp, fp, rec, aggdata->dtada_data, 1, 0) < 0)
> > >   			return DTRACE_AGGWALK_ERROR;
> > >   		if (dt_buffered_flush(dtp, NULL, rec, aggdata,
> > > @@ -1734,14 +1729,12 @@ dt_print_aggs(const dtrace_aggdata_t **aggsdata, int naggvars, void *arg)
> > >   		aggdata = aggsdata[i];
> > >   		agg = aggdata->dtada_desc;
> > > -		rec = &agg->dtagd_recs[aggact];
> > > -		addr = aggdata->dtada_data + rec->dtrd_offset;
> > > -		size = agg->dtagd_size;
> > > +		rec = &agg->dtagd_drecs[DT_AGGDATA_RECORD];
> > >   		assert(DTRACEACT_ISAGG(rec->dtrd_action));
> > >   		normal = aggdata->dtada_desc->dtagd_normal;
> > > -		if (dt_print_datum(dtp, fp, rec, addr, size, normal,
> > > +		if (dt_print_datum(dtp, fp, rec, aggdata->dtada_data, normal,
> > >   				   agg->dtagd_sig) < 0)
> > >   			return DTRACE_AGGWALK_ERROR;
> > > @@ -1781,7 +1774,7 @@ dt_print_agg(const dtrace_aggdata_t *aggdata, void *arg)
> > >   		 * variable that we should print -- skip any other aggregations
> > >   		 * that we encounter.
> > >   		 */
> > > -		if (agg->dtagd_nrecs == 0)
> > > +		if (agg->dtagd_nkrecs == 0)
> > >   			return DTRACE_AGGWALK_NEXT;
> > >   		if (aggvarid != agg->dtagd_varid)
> 
> Here and a few other places... what does "nkrecs==0" mean?  How could this
> happen?

That remains to be seen.  It is an artifact from the legacy version and I have
not quite determined how that was even possible in that original version.

> > > diff --git a/libdtrace/dt_impl.h b/libdtrace/dt_impl.h
> > > index 0e3cf218..704cfed0 100644
> > > --- a/libdtrace/dt_impl.h
> > > +++ b/libdtrace/dt_impl.h
> > > @@ -188,6 +188,9 @@ typedef struct dt_ahash {
> > >   	size_t		dtah_size;		/* size of hash table */
> > >   } dt_ahash_t;
> > > +#define DT_AGGDATA_COUNTER	0
> > > +#define DT_AGGDATA_RECORD	1
> > > +
> > >   /*
> > >    * Why do we need (only) 4 slots?  The maximum amount of string arguments to
> > >    * any function is 2, and if the result is a string as well, that means we may
> > > diff --git a/libdtrace/dt_map.c b/libdtrace/dt_map.c
> > > index 0098e374..cc09fff0 100644
> > > --- a/libdtrace/dt_map.c
> > > +++ b/libdtrace/dt_map.c
> > > @@ -241,11 +241,9 @@ dt_aggid_add(dtrace_hdl_t *dtp, const dt_ident_t *aid)
> > >   {
> > >   	dtrace_id_t		max;
> > >   	dtrace_aggdesc_t	*agg;
> > > -	dtrace_recdesc_t	*recs;
> > > +	dtrace_recdesc_t	*recs, *drecs;
> > >   	dtrace_aggid_t		id = aid->di_id;
> > > -	dt_ident_t		*fid = aid->di_iarg;
> > > -	int			i;
> > > -	uint_t			off = 0;
> > > +	int			nrecs = ((dt_idsig_t *)aid->di_data)->dis_argc;
> 
> nrecs is... nkrecs?  ndrecs?  Their sum?  Ah... none of these!  It isn't
> really a record count at all.  Other inadequate names include nargs (ick,
> no) or nkeys (ick, that gets us back into the problem that "keys" can refer
> to agg keys or BPF map keys).

nrecs is indeed nkrecs here, so yes, I will change the name.  And init it as
argc + 1 because that makes sense.  This is where we move from aggregation key
count to a count of records that comprise the BPF hashmap key.

> Ah, how about instead of "nrecs=argc" using "nkrecs=argc+1"? Afterall, this
> function only uses nrecs in the expression "nrecs+1".
> 
> > >   	while (id >= (max = dtp->dt_maxagg) || dtp->dt_adesc == NULL) {
> > >   		dtrace_id_t		nmax = max ? (max << 1) : 1;
> > > @@ -274,39 +272,51 @@ dt_aggid_add(dtrace_hdl_t *dtp, const dt_ident_t *aid)
> > >   	if (agg == NULL)
> > >   		return dt_set_errno(dtp, EDT_NOMEM);
> > > -	/*
> > > -	 * Note the relationship between the aggregation storage
> > > -	 * size (di_size) and the aggregation data size (dtagd_size):
> > > -	 *     di_size = dtagd_size * (num copies) + (size of latch seq #)
> > > -	 */
> > >   	agg->dtagd_id = id;
> > >   	agg->dtagd_name = aid->di_name;
> > >   	agg->dtagd_sig = ((dt_idsig_t *)aid->di_data)->dis_auxinfo;
> > >   	agg->dtagd_normal = 1;
> > >   	agg->dtagd_varid = aid->di_id;
> > > -	agg->dtagd_size = (aid->di_size - sizeof(uint64_t)) / DT_AGG_NUM_COPIES;
> > > -	agg->dtagd_nrecs = agg->dtagd_size / sizeof(uint64_t);
> > > +	agg->dtagd_nkrecs = 1 + nrecs;		/* var ID and keys */
> > > -	recs = dt_calloc(dtp, agg->dtagd_nrecs, sizeof(dtrace_recdesc_t));
> > > +	/* We allocate all records at once (varid, keys, data counter, data). */
> > > +	recs = dt_calloc(dtp, 1 + nrecs + 2, sizeof(dtrace_recdesc_t));
> 
> The comment, referring to four items (varid, keys, data counter, data), does
> not line up with the code (1 + nrecs + 2).  If you go with the

Yes, it does:

	varid = 1
	keys = nrecs
	data counter = 1
	data = 1

So, all together it is 1 + nrecs + 1 + 1 = 1 + nrecs + 2

But as mentioned above, I will change nrecs to be nkrecs, defined as 1 + argc,
so the comment and code here will be updated accordingly.

> nkrecs=nargs+1 suggestion, the code becomes nkrecs+2.  So maybe the comment
> could become:
>         /* We allocate krecs and drecs together. */
> 
> > >   	if (recs == NULL) {
> > >   		dt_free(dtp, agg);
> > >   		return dt_set_errno(dtp, EDT_NOMEM);
> > >   	}
> > > -	agg->dtagd_recs = recs;
> > > -
> > > -	for (i = 0; i < agg->dtagd_nrecs; i++) {
> > > -		dtrace_recdesc_t	*rec = &recs[i];
> > > -
> > > -		rec->dtrd_action = fid->di_id;
> > > -		rec->dtrd_size = sizeof(uint64_t);
> > > -		rec->dtrd_offset = off;
> > > -		rec->dtrd_alignment = sizeof(uint64_t);
> > > -		rec->dtrd_format = NULL;
> > > -		rec->dtrd_arg = 1;
> > > -
> > > -		off += sizeof(uint64_t);
> > > -	}
> > > +	/* Set up the variable ID record (first record in the keys). */
> > > +	recs[0].dtrd_action = DTRACEACT_DIFEXPR;
> > > +	recs[0].dtrd_size = sizeof(uint32_t);
> > > +	recs[0].dtrd_offset = 0;
> > > +	recs[0].dtrd_alignment = sizeof(uint32_t);
> > > +	recs[0].dtrd_format = NULL;
> > > +	recs[0].dtrd_arg = 1;
> > > +
> > > +	agg->dtagd_krecs = recs;
> > > +	agg->dtagd_ksize = recs[0].dtrd_size;	/* Updated when adding keys */
> > > +
> > > +	/* Set up the data records. */
> > > +	drecs = &recs[nrecs];
> > This needs to be:
> > 
> > 	drecs = &recs[1 + nrecs];
> 
> Yeah, or s/1 + nrecs/nkrecs/.
> 
> > > +
> > > +	drecs[DT_AGGDATA_COUNTER].dtrd_action = ((dt_ident_t *) aid->di_iarg)->di_id;
> > > +	drecs[DT_AGGDATA_COUNTER].dtrd_size = sizeof(uint64_t);
> > > +	drecs[DT_AGGDATA_COUNTER].dtrd_offset = 0;
> > > +	drecs[DT_AGGDATA_COUNTER].dtrd_alignment = sizeof(uint64_t);
> > > +	drecs[DT_AGGDATA_COUNTER].dtrd_format = NULL;
> > > +	drecs[DT_AGGDATA_COUNTER].dtrd_arg = 1;
> > > +
> > > +	drecs[DT_AGGDATA_RECORD].dtrd_action = ((dt_ident_t *) aid->di_iarg)->di_id;
> > > +	drecs[DT_AGGDATA_RECORD].dtrd_size = aid->di_size - sizeof(uint64_t);
> 
> For the time being, I guess you have to divide by DT_AGG_NUM_COPIES.

Ah yes, thanks.

> > > +	drecs[DT_AGGDATA_RECORD].dtrd_offset = sizeof(uint64_t);
> > > +	drecs[DT_AGGDATA_RECORD].dtrd_alignment = sizeof(uint64_t);
> > > +	drecs[DT_AGGDATA_RECORD].dtrd_format = NULL;
> > > +	drecs[DT_AGGDATA_RECORD].dtrd_arg = 1;
> > > +
> > > +	agg->dtagd_drecs = drecs;
> > > +	agg->dtagd_dsize = drecs[DT_AGGDATA_COUNTER].dtrd_size +
> > > +			   drecs[DT_AGGDATA_RECORD].dtrd_size;
> > >   	dtp->dt_adesc[id] = agg;
> > > @@ -338,7 +348,8 @@ dt_aggid_destroy(dtrace_hdl_t *dtp)
> > >   	for (i = 0; i < dtp->dt_maxagg; i++) {
> > >   		if (dtp->dt_adesc[i] != NULL) {
> > > -			dt_free(dtp, dtp->dt_adesc[i]->dtagd_recs);
> > > +			/* Freeing dtagd_recs also frees dtagd_drecs. */
> 
> s/dtagd_recs/dtagd_krecs/
> 
> > > +			dt_free(dtp, dtp->dt_adesc[i]->dtagd_krecs);
> > >   			dt_free(dtp, dtp->dt_adesc[i]);
> > >   		}
> > >   	}
> > > diff --git a/libdtrace/dt_printf.c b/libdtrace/dt_printf.c
> > > index 4fb20b23..4f0e6b8a 100644
> > > --- a/libdtrace/dt_printf.c
> > > +++ b/libdtrace/dt_printf.c
> > > @@ -1264,10 +1264,12 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
> > >   	dtrace_aggdesc_t *agg;
> > >   	caddr_t lim = (caddr_t)buf + len, limit;
> > >   	char format[64] = "%";
> > > -	int i, aggrec = 0, curagg = -1;
> > > +	int i, curagg = -1;
> > >   	uint64_t normal, sig;
> > >   	/*
> > > +	 * FIXME: Comment needs rewrite.
> > > +	 *
> 
> Comment needs rewrite.
> 
> > >   	 * If we are formatting an aggregation, set 'aggrec' to the index of
> > >   	 * the final record description (the aggregation result) so we can use
> > >   	 * this record index with any conversion where DT_PFCONV_AGG is set.
> > > @@ -1280,13 +1282,8 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
> > >   		assert(aggsdata != NULL);
> > >   		assert(naggvars > 0);
> > > -		if (nrecs == 0)
> > > -			return dt_set_errno(dtp, EDT_DMISMATCH);
> > > -
> > >   		curagg = naggvars > 1 ? 1 : 0;
> > >   		aggdata = aggsdata[0];
> > > -		aggrec = aggdata->dtada_desc->dtagd_nrecs - 1;
> > > -		nrecs--;
> > >   	}
> > >   	for (i = 0; i < pfv->pfv_argc; i++, pfd = pfd->pfd_next) {
> > > @@ -1380,11 +1377,10 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
> > >   			if (curagg < naggvars - 1)
> > >   				curagg++;
> > > -			rec = &agg->dtagd_recs[aggrec];
> > > -			addr = aggdata->dtada_data;
> > > +			rec = &agg->dtagd_drecs[DT_AGGDATA_RECORD];
> > > +			addr = aggdata->dtada_data + rec->dtrd_offset;
> > >   			limit = addr + aggdata->dtada_size;
> > >   			normal = agg->dtagd_normal;
> > > -			size = agg->dtagd_size;
> > >   			sig = agg->dtagd_sig;
> > >   			flags = DTRACE_BUFDATA_AGGVAL;
> > >   		} else {
> > > @@ -1409,10 +1405,11 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
> > >   			addr = (caddr_t)buf + rec->dtrd_offset;
> > >   			limit = lim;
> > >   			normal = 1;
> > > -			size = rec->dtrd_size;
> > >   			sig = 0;
> > >   		}
> > > +		size = rec->dtrd_size;
> > > +
> > >   		if (addr + size > limit) {
> > >   			dt_dprintf("bad size: addr=%p size=0x%x lim=%p\n",
> > >   			    (void *)addr, rec->dtrd_size, (void *)lim);
> > > @@ -1809,8 +1806,8 @@ static int
> > >   dt_fprinta(const dtrace_aggdata_t *adp, void *arg)
> > >   {
> > >   	const dtrace_aggdesc_t	*agg = adp->dtada_desc;
> > > -	const dtrace_recdesc_t	*rec = agg->dtagd_recs;
> > > -	uint_t			nrecs = agg->dtagd_nrecs;
> > > +	const dtrace_recdesc_t	*rec = agg->dtagd_krecs;
> > > +	uint_t			nrecs = agg->dtagd_nkrecs;
> > >   	dt_pfwalk_t		*pfw = arg;
> > >   	dtrace_hdl_t		*dtp = pfw->pfw_argv->pfv_dtp;
> 
> rec should perhaps not start with the varID record but with the first (agg)
> key record.  So rec has to be bumped up, and nrecs has to be decremented
> accordingly.  Admittedly, this bug predates your patch, but you fix the
> corresponding problem in dt_fprintas()...
> 
> > > @@ -1837,8 +1834,8 @@ dt_fprintas(const dtrace_aggdata_t **aggsdata, int naggvars, void *arg)
> > >   {
> > >   	const dtrace_aggdata_t	*aggdata = aggsdata[0];
> > >   	const dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
> > > -	const dtrace_recdesc_t	*rec = agg->dtagd_recs;
> > > -	uint_t			nrecs = agg->dtagd_nrecs;
> > > +	const dtrace_recdesc_t	*rec = &agg->dtagd_krecs[1];
> > > +	uint_t			nrecs = agg->dtagd_nkrecs - 1;
> > >   	dt_pfwalk_t		*pfw = arg;
> > >   	dtrace_hdl_t		*dtp = pfw->pfw_argv->pfv_dtp;
> > >   	int			i;
> > > -- 
> > > 2.34.1
> > > 
> > > 
> > > _______________________________________________
> > > DTrace-devel mailing list
> > > DTrace-devel at oss.oracle.com
> > > https://oss.oracle.com/mailman/listinfo/dtrace-devel
> > _______________________________________________
> > DTrace-devel mailing list
> > DTrace-devel at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/dtrace-devel
> 
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel