[DTrace-devel] [PATCH v4] Fix records for aggregation descriptions

Mon Aug 29 20:47:58 UTC 2022

The patch uses "keys" in two different ways:
     - aggregation keys (e.g., see the commit message)
     - BPF map keys (which adds varID)
This is confusing.  Arguably, one approach is to break varID off as its 
own record... e.g., dtagd_irec (to add to dtagd_krecs and dtagd_vrecs).  
I suspect no one would be enthusiastic about this approach, but changing 
the meaning of "keys" from one place to another is not clean.

The "latch sequence number" raises lots of questions.  First of all, 
what does it mean?  I assume, despite its name, that it is no longer a 
latch sequence number.  The patch also calls it a "data counter", though 
it apparently is not that either -- e.g., dt_agg_one_agg() copies the 
value for MIN/MAX but accumulates it otherwise.  So is it a flag to 
indicate the presence of data?  Is it useful in dtada_data, or will its 
value always be set to indicate the presence of data?  In the long run, 
will it even be used in the BPF map, or will its value always be set 
even in the BPF map to indicate the presence of data?  If the value only 
ever has one setting, why have DT_AGGDATA_COUNTER at all?

On 8/25/22 17:22, Kris Van Hees via DTrace-devel wrote:
> On Thu, Aug 25, 2022 at 04:15:01PM -0400, Kris Van Hees via DTrace-devel wrote:
>> The implementation of aggregations erronously registered the value of
>> an aggregation as a series of 64-bit values, each with their own record
>> description.  However, aggregatios in their most generic form should
>> comprise the following records:
>>
>> (1) variable ID
>> (2) keys (values of the index tuple, if any)
>> (3) latch sequence number
>> (4) aggregation data
>>
>> (1) and (2) are records that identify the aggregation, whereas (3) and
>> (4) are data that is generated by the producer.

What does this mean?  E.g., (2) is also data that is generated by the 
producer.  The real point is presumably that (1) and (2) are used to 
form BPF map keys while (3) and (4) are part of BPF map values.

Incidentally, these typos:
   erronously
   aggregatios
   aggregtion

>> Change the aggregation record descriptions to match the layout described
>> above.
>>
>> Note that the variable ID is now stored as a 32-bit value whereas DTrace
>> used to store it as a 64-bit value.  This change is valid because the
>> variable ID is an identifier ID and therefore cannot exceed the range of
>> a 32-bit integer,
>>
>> Support for aggregtion keys will be added in later patches.
>>
>> Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
>> Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
>> Reviewed-by: Kris Van Hees <kris.van.hees at oracle.com>
>> ---
>>   cmd/dtrace.c              |  2 +-
>>   include/dtrace/metadesc.h |  8 +++--
>>   libdtrace/dt_aggregate.c  | 74 +++++++++++++++++++--------------------
>>   libdtrace/dt_cg.c         |  9 ++++-
>>   libdtrace/dt_consume.c    | 41 +++++++++-------------
>>   libdtrace/dt_impl.h       |  3 ++
>>   libdtrace/dt_map.c        | 65 ++++++++++++++++++++--------------
>>   libdtrace/dt_printf.c     | 25 ++++++-------
>>   8 files changed, 119 insertions(+), 108 deletions(-)
>>
>> diff --git a/cmd/dtrace.c b/cmd/dtrace.c
>> index 22ac28c3..350cab94 100644
>> --- a/cmd/dtrace.c
>> +++ b/cmd/dtrace.c
>> @@ -682,7 +682,7 @@ bufhandler(const dtrace_bufdata_t *bufdata, void *arg)
>>   		BUFDUMPSTR(desc, dtagd_name);
>>   		BUFDUMP(desc, dtagd_varid);
>>   		BUFDUMP(desc, dtagd_id);
>> -		BUFDUMP(desc, dtagd_nrecs);
>> +		BUFDUMP(desc, dtagd_nkrecs + 1);
>>   		BUFDUMPHDR("");
>>   	}
>>   

Why nkrecs+1 rather than simply nkrecs?  It seems that we do not use or 
test "dtrace -B" in the first place.  Nevertheless, it is hard for me to 
see how output like
     dtagd_nkrecs + 1 = 2
is more interesting than output like
     dtagd_nkrecs = 1

>> diff --git a/include/dtrace/metadesc.h b/include/dtrace/metadesc.h
>> index 20a5b8a0..f75b0a8f 100644
>> --- a/include/dtrace/metadesc.h
>> +++ b/include/dtrace/metadesc.h
>> @@ -57,9 +57,11 @@ typedef struct dtrace_aggdesc {
>>   	dtrace_aggid_t dtagd_id;		/* aggregation ID */
>>   	uint64_t dtagd_sig;			/* aggregation signature */
>>   	uint64_t dtagd_normal;			/* aggregation normalization */
>> -	uint32_t dtagd_size;			/* size in bytes */
>> -	int dtagd_nrecs;			/* number of records */
>> -	dtrace_recdesc_t *dtagd_recs;		/* record descriptions */
>> +	uint32_t dtagd_ksize;			/* keys size in bytes */
>> +	uint32_t dtagd_dsize;			/* data size in bytes */
>> +	int dtagd_nkrecs;			/* number of key records */
>> +	dtrace_recdesc_t *dtagd_krecs;		/* key record descriptions */
>> +	dtrace_recdesc_t *dtagd_drecs;		/* data record descriptions */
>>   } dtrace_aggdesc_t;
>>   
>>   typedef struct dtrace_fmtdesc {
>> diff --git a/libdtrace/dt_aggregate.c b/libdtrace/dt_aggregate.c
>> index fa65ecd3..f810efd3 100644
>> --- a/libdtrace/dt_aggregate.c
>> +++ b/libdtrace/dt_aggregate.c
>> @@ -390,7 +390,7 @@ dt_aggregate_aggid(dt_ahashent_t *ent)
>>   {
>>   	dtrace_aggdesc_t *agg = ent->dtahe_data.dtada_desc;
>>   	caddr_t data = ent->dtahe_data.dtada_data;
>> -	dtrace_recdesc_t *rec = agg->dtagd_recs;
>> +	dtrace_recdesc_t *rec = agg->dtagd_krecs;
>>   
>>   	/*
>>   	 * First, we'll check the variable ID in the aggdesc.  If it's valid,
>> @@ -406,8 +406,8 @@ dt_aggregate_aggid(dt_ahashent_t *ent)
>>   	return agg->dtagd_varid;
>>   }
>>   
>> -typedef void (*agg_cpu_f)(dt_ident_t *aid,
>> -    int64_t *dst, int64_t *src, uint_t realsz);
>> +typedef void (*agg_cpu_f)(dt_ident_t *aid, int64_t *dst, int64_t *src,
>> +			  uint_t datasz);
>>   
>>   typedef struct dt_snapstate {
>>   	dtrace_hdl_t	*dtp;
>> @@ -418,34 +418,34 @@ typedef struct dt_snapstate {
>>   } dt_snapstate_t;
>>   
>>   static void
>> -dt_agg_one_copy(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t realsz)
>> +dt_agg_one_copy(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t datasz)
>>   {
>> -	src++;  /* skip latch sequence number */
>> -	memcpy(dst, src, realsz);
>> +	memcpy(dst, src, datasz);
>>   }
>>   
>>   static void
>> -dt_agg_one_agg(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t realsz)
>> +dt_agg_one_agg(dt_ident_t *aid, int64_t *dst, int64_t *src, uint_t datasz)
>>   {
>>   	uint_t i, cnt;
>>   
>>   	if (*src == 0)
>>   		return;
>>   
>> -	src++;  /* skip latch sequence number */
>>   	switch (((dt_ident_t *)aid->di_iarg)->di_id) {
>>   	case DT_AGG_MAX:
>> +		*dst++ = *src++;	/* copy latch sequence number */
>>   		if (*src > *dst)
>>   			*dst = *src;
>>   
>>   		break;
>>   	case DT_AGG_MIN:
>> +		*dst++ = *src++;	/* copy latch sequence number */
>>   		if (*src < *dst)
>>   			*dst = *src;
>>   
>>   		break;
>>   	default:
>> -		for (i = 0, cnt = realsz / sizeof(int64_t);
>> +		for (i = 0, cnt = datasz / sizeof(int64_t);
>>   		     i < cnt; i++, dst++, src++)
>>   			*dst += *src;
>>   	}

Here is what I mentioned earlier:  the first word is copied for MIN/MAX 
but accumulated otherwise.  So, how should we describe that first word?  
Neither latch sequence number nor data counter (the two phrases used by 
this patch) makes sense.

>> @@ -461,7 +461,7 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
>>   	uint64_t		hval = aid->di_id;
>>   	size_t			ndx = hval % agh->dtah_size;
>>   	int			rval;
>> -	uint_t			i, realsz;
>> +	uint_t			i, datasz;
>>   	int64_t			*src;
>>   
>>   	rval = dt_aggid_lookup(st->dtp, aid->di_id, &agg);
>> @@ -471,23 +471,22 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
>>   	/* point to the latch sequence number */
>>   	src = (int64_t *)(st->buf + aid->di_offset);
>>   
>> -	/* real size excludes latch sequence number and second data copy, if any */
>> -	realsz = (aid->di_size - sizeof(uint64_t)) / DT_AGG_NUM_COPIES;
>> +	datasz = agg->dtagd_dsize;
>>   
>>   	/* See if we already have an entry for this aggregation. */
>>   	for (h = agh->dtah_hash[ndx]; h != NULL; h = h->dtahe_next) {
>> -		if (h->dtahe_hval != hval || h->dtahe_size != realsz)
>> +		if (h->dtahe_hval != hval || h->dtahe_size != datasz)
>>   			continue;
>>   
>>   		/* Entry found - process the data. */
>>   		agd = &h->dtahe_data;
>>   
>> -		st->fun(aid, (int64_t *)agd->dtada_data, src, realsz);
>> +		st->fun(aid, (int64_t *)agd->dtada_data, src, datasz);
>>   
>>   		/* If we keep per-CPU data - process that as well. */
>>   		if (agd->dtada_percpu != NULL)
>>   			st->fun(aid, (int64_t *)agd->dtada_percpu[st->cpu],
>> -			    src, realsz);
>> +			    src, datasz);
>>   
>>   		return 0;
>>   	}
>> @@ -502,20 +501,19 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
>>   		return dt_set_errno(st->dtp, EDT_NOMEM);
>>   
>>   	agd = &h->dtahe_data;
>> -	agd->dtada_data = dt_alloc(st->dtp, realsz);
>> +	agd->dtada_data = dt_alloc(st->dtp, datasz);
>>   	if (agd->dtada_data == NULL) {
>>   		dt_free(st->dtp, h);
>>   		return dt_set_errno(st->dtp, EDT_NOMEM);
>>   	}
>>   
>> -	src++;
>> -	memcpy(agd->dtada_data, src, realsz);
>> -	agd->dtada_size = realsz;
>> +	memcpy(agd->dtada_data, src, datasz);
>> +	agd->dtada_size = datasz;
>>   	agd->dtada_desc = agg;
>>   	agd->dtada_hdl = st->dtp;
>>   
>>   	h->dtahe_hval = hval;
>> -	h->dtahe_size = realsz;
>> +	h->dtahe_size = datasz;
>>   
>>   	if (st->agp->dtat_flags & DTRACE_A_PERCPU) {
>>   		char	**percpu = dt_calloc(st->dtp,
>> @@ -530,7 +528,7 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
>>   		}
>>   
>>   		for (i = 0; i <= st->dtp->dt_conf.max_cpuid; i++) {
>> -			percpu[i] = dt_zalloc(st->dtp, realsz);
>> +			percpu[i] = dt_zalloc(st->dtp, datasz);
>>   			if (percpu[i] == NULL) {
>>   				while (--i >= 0)
>>   					dt_free(st->dtp, percpu[i]);
>> @@ -541,7 +539,7 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
>>   			}
>>   		}
>>   
>> -		memcpy(percpu[st->cpu], src, realsz);
>> +		memcpy(percpu[st->cpu], src, datasz);
>>   		agd->dtada_percpu = percpu;
>>   	}
>>   
>> @@ -617,10 +615,10 @@ dt_aggregate_hashcmp(const void *lhs, const void *rhs)
>>   	dtrace_aggdesc_t *lagg = lh->dtahe_data.dtada_desc;
>>   	dtrace_aggdesc_t *ragg = rh->dtahe_data.dtada_desc;
>>   
>> -	if (lagg->dtagd_nrecs < ragg->dtagd_nrecs)
>> +	if (lagg->dtagd_nkrecs < ragg->dtagd_nkrecs)
>>   		return DT_LESSTHAN;
>>   
>> -	if (lagg->dtagd_nrecs > ragg->dtagd_nrecs)
>> +	if (lagg->dtagd_nkrecs > ragg->dtagd_nkrecs)
>>   		return DT_GREATERTHAN;
>>   
>>   	return 0;
>> @@ -659,8 +657,8 @@ dt_aggregate_keycmp(const void *lhs, const void *rhs)
>>   	if ((rval = dt_aggregate_hashcmp(lhs, rhs)) != 0)
>>   		return rval;
>>   
>> -	nrecs = lagg->dtagd_nrecs - 1;
>> -	assert(nrecs == ragg->dtagd_nrecs - 1);
>> +	nrecs = lagg->dtagd_nkrecs;
>> +	assert(nrecs == ragg->dtagd_nkrecs);
>>   
>>   	keypos = dt_keypos + 1 >= nrecs ? 0 : dt_keypos;
>>   
>> @@ -671,8 +669,8 @@ dt_aggregate_keycmp(const void *lhs, const void *rhs)
>>   		if (ndx >= nrecs)
>>   			ndx = ndx - nrecs + 1;
>>   
>> -		lrec = &lagg->dtagd_recs[ndx];
>> -		rrec = &ragg->dtagd_recs[ndx];
>> +		lrec = &lagg->dtagd_krecs[ndx];
>> +		rrec = &ragg->dtagd_krecs[ndx];
>>   
>>   		ldata = lh->dtahe_data.dtada_data + lrec->dtrd_offset;
>>   		rdata = rh->dtahe_data.dtada_data + rrec->dtrd_offset;
>> @@ -772,18 +770,18 @@ dt_aggregate_valcmp(const void *lhs, const void *rhs)
>>   	if ((rval = dt_aggregate_hashcmp(lhs, rhs)) != 0)
>>   		return rval;
>>   
>> -	if (lagg->dtagd_nrecs > ragg->dtagd_nrecs)
>> +	if (lagg->dtagd_nkrecs > ragg->dtagd_nkrecs)
>>   		return DT_GREATERTHAN;
>>   
>> -	if (lagg->dtagd_nrecs < ragg->dtagd_nrecs)
>> +	if (lagg->dtagd_nkrecs < ragg->dtagd_nkrecs)
>>   		return DT_LESSTHAN;
>>   
>> -	if (lagg->dtagd_nrecs <= 0)
>> +	if (lagg->dtagd_nkrecs <= 0)
>>   	    return 0;
>>   
>> -	for (i = 0; i < lagg->dtagd_nrecs; i++) {
>> -		lrec = &lagg->dtagd_recs[i];
>> -		rrec = &ragg->dtagd_recs[i];
>> +	for (i = 0; i < lagg->dtagd_nkrecs; i++) {
>> +		lrec = &lagg->dtagd_krecs[i];
>> +		rrec = &ragg->dtagd_krecs[i];
>>   
>>   		if (lrec->dtrd_offset < rrec->dtrd_offset)
>>   			return DT_LESSTHAN;

We're cutting the loop short to exclude iteration over the data record.  
I'm a little unclear what the ramifications are:

1)  At the very least, we do not set lrec and rrec to correspond to the 
data record.  I think you fix that in a later patch, but it should be 
done here in this patch.

2)  We omit the offset and action checks for the data record... are 
there any repercussions to this?

>> @@ -1089,7 +1087,7 @@ dt_aggwalk_rval(dtrace_hdl_t *dtp, dt_ahashent_t *h, int rval)
>>   		uint32_t size, offs = 0;
>>   
>>   		aggdesc = h->dtahe_data.dtada_desc;
>> -		rec = &aggdesc->dtagd_recs[aggdesc->dtagd_nrecs - 1];
>> +		rec = &aggdesc->dtagd_drecs[DT_AGGDATA_RECORD];
>>   		size = rec->dtrd_size;
>>   		data = &h->dtahe_data;
>>   
>> @@ -1514,11 +1512,11 @@ dtrace_aggregate_walk_joined(dtrace_hdl_t *dtp, dtrace_aggid_t *aggvars,
>>   				 * cons up the zaggdata entry for it.
>>   				 */
>>   				aggdata = &zaggdata[i].dtahe_data;
>> -				aggdata->dtada_size = agg->dtagd_size;
>> +				aggdata->dtada_size = agg->dtagd_dsize;
>>   				aggdata->dtada_desc = agg;
>>   				aggdata->dtada_hdl = dtp;
>>   				zaggdata[i].dtahe_hval = 0;
>> -				zaggdata[i].dtahe_size = agg->dtagd_size;
>> +				zaggdata[i].dtahe_size = agg->dtagd_dsize;
>>   				break;
>>   			}
>>   
>> @@ -1773,7 +1771,7 @@ dtrace_aggregate_clear(dtrace_hdl_t *dtp)
>>   
>>   	for (h = hash->dtah_all; h != NULL; h = h->dtahe_nextall) {
>>   		aggdesc = h->dtahe_data.dtada_desc;
>> -		rec = &aggdesc->dtagd_recs[aggdesc->dtagd_nrecs - 1];
>> +		rec = &aggdesc->dtagd_drecs[DT_AGGDATA_RECORD];
>>   		data = &h->dtahe_data;
>>   
>>   		memset(&data->dtada_data[rec->dtrd_offset], 0, rec->dtrd_size);
>> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
>> index ebd41906..7bff5895 100644
>> --- a/libdtrace/dt_cg.c
>> +++ b/libdtrace/dt_cg.c
>> @@ -6712,6 +6712,7 @@ static dt_cg_aggfunc_f *_dt_cg_agg[DT_AGG_NUM] = {
>>   static void
>>   dt_cg_agg(dt_pcb_t *pcb, dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
>>   {
>> +	dtrace_hdl_t	*dtp = pcb->pcb_hdl;
>>   	dt_ident_t	*aid, *fid;
>>   	dt_cg_aggfunc_f	*aggfp;
>>   
>> @@ -6744,7 +6745,13 @@ dt_cg_agg(dt_pcb_t *pcb, dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
>>   	assert(aggfp != NULL);
>>   
>>   	(*aggfp)(pcb, aid, dnp, dlp, drp);
>> -	dt_aggid_add(pcb->pcb_hdl, aid);
>> +
>> +	/*
>> +	 * Add this aggid if we see it for the first time.  We do this after
>> +	 * the BPF code generation because aggfp() sets aid->di_size.
>> +	 */
>> +	if (dt_aggid_lookup(dtp, aid->di_id, NULL) == -1)
>> +		dt_aggid_add(dtp, aid);
>>   }
>>   
>>   void
>> diff --git a/libdtrace/dt_consume.c b/libdtrace/dt_consume.c
>> index 973084d6..7e6622d1 100644
>> --- a/libdtrace/dt_consume.c
>> +++ b/libdtrace/dt_consume.c
>> @@ -1001,8 +1001,8 @@ dt_print_bytes(dtrace_hdl_t *dtp, FILE *fp, caddr_t addr,
>>   		return dt_printf(dtp, fp, "  %-*s", width, s);
>>   	}
>>   
>> -        /* print the bytes raw */
>> -        return dt_print_rawbytes(dtp, fp, addr, nbytes);
>> +	/* print the bytes raw */
>> +	return dt_print_rawbytes(dtp, fp, addr, nbytes);
>>   }
>>   
>>   #ifdef FIXME
>> @@ -1519,7 +1519,7 @@ dt_clear_agg(const dtrace_aggdata_t *aggdata, void *arg)
>>   	dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
>>   	dtrace_aggid_t		id = *((dtrace_aggid_t *)arg);
>>   
>> -	if (agg->dtagd_nrecs == 0)
>> +	if (agg->dtagd_nkrecs == 0)
>>   		return DTRACE_AGGWALK_NEXT;
>>   
>>   	if (agg->dtagd_varid != id)
>> @@ -1540,7 +1540,7 @@ dt_trunc_agg(const dtrace_aggdata_t *aggdata, void *arg)
>>   	dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
>>   	dtrace_aggid_t		id = trunc->dttd_id;
>>   
>> -	if (agg->dtagd_nrecs == 0)
>> +	if (agg->dtagd_nkrecs == 0)
>>   		return DTRACE_AGGWALK_NEXT;
>>   
>>   	if (agg->dtagd_varid != id)
>> @@ -1620,10 +1620,14 @@ dt_trunc(dtrace_hdl_t *dtp, caddr_t base, dtrace_recdesc_t *rec)
>>   
>>   static int
>>   dt_print_datum(dtrace_hdl_t *dtp, FILE *fp, dtrace_recdesc_t *rec,
>> -	       caddr_t addr, size_t size, uint64_t normal, uint64_t sig)
>> +	       caddr_t addr, uint64_t normal, uint64_t sig)
>>   {
>>   	int			err;
>>   	dtrace_actkind_t	act = rec->dtrd_action;
>> +	size_t			size = rec->dtrd_size;
>> +
>> +	/* Apply the record offset to the base address. */
>> +	addr += rec->dtrd_offset;
>>   
>>   	switch (act) {
>>   	case DTRACEACT_STACK:
>> @@ -1697,31 +1701,22 @@ dt_print_datum(dtrace_hdl_t *dtp, FILE *fp, dtrace_recdesc_t *rec,
>>   static int
>>   dt_print_aggs(const dtrace_aggdata_t **aggsdata, int naggvars, void *arg)
>>   {
>> -	int			i, aggact = 0;
>> +	int			i;
>>   	dtrace_print_aggdata_t	*pd = arg;
>>   	const dtrace_aggdata_t	*aggdata = aggsdata[0];
>>   	dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
>>   	FILE			*fp = pd->dtpa_fp;
>>   	dtrace_hdl_t		*dtp = pd->dtpa_dtp;
>>   	dtrace_recdesc_t	*rec;
>> -	caddr_t			addr;
>> -	size_t			size;
>>   
>>   	/*
>>   	 * Iterate over each record description in the key, printing the traced
>> -	 * data.
>> +	 * data.  The first record is skipped because it holds the variable ID.
>>   	 */
>> -	for (i = 0; i < agg->dtagd_nrecs; i++) {
>> -		rec = &agg->dtagd_recs[i];
>> -		addr = aggdata->dtada_data + rec->dtrd_offset;
>> -		size = rec->dtrd_size;
>> -
>> -		if (DTRACEACT_ISAGG(rec->dtrd_action)) {
>> -			aggact = i;
>> -			break;
>> -		}
>> +	for (i = 1; i < agg->dtagd_nkrecs; i++) {
>> +		rec = &agg->dtagd_krecs[i];
>>   
>> -		if (dt_print_datum(dtp, fp, rec, addr, size, 1, 0) < 0)
>> +		if (dt_print_datum(dtp, fp, rec, aggdata->dtada_data, 1, 0) < 0)
>>   			return DTRACE_AGGWALK_ERROR;
>>   
>>   		if (dt_buffered_flush(dtp, NULL, rec, aggdata,
>> @@ -1734,14 +1729,12 @@ dt_print_aggs(const dtrace_aggdata_t **aggsdata, int naggvars, void *arg)
>>   
>>   		aggdata = aggsdata[i];
>>   		agg = aggdata->dtada_desc;
>> -		rec = &agg->dtagd_recs[aggact];
>> -		addr = aggdata->dtada_data + rec->dtrd_offset;
>> -		size = agg->dtagd_size;
>> +		rec = &agg->dtagd_drecs[DT_AGGDATA_RECORD];
>>   
>>   		assert(DTRACEACT_ISAGG(rec->dtrd_action));
>>   		normal = aggdata->dtada_desc->dtagd_normal;
>>   
>> -		if (dt_print_datum(dtp, fp, rec, addr, size, normal,
>> +		if (dt_print_datum(dtp, fp, rec, aggdata->dtada_data, normal,
>>   				   agg->dtagd_sig) < 0)
>>   			return DTRACE_AGGWALK_ERROR;
>>   
>> @@ -1781,7 +1774,7 @@ dt_print_agg(const dtrace_aggdata_t *aggdata, void *arg)
>>   		 * variable that we should print -- skip any other aggregations
>>   		 * that we encounter.
>>   		 */
>> -		if (agg->dtagd_nrecs == 0)
>> +		if (agg->dtagd_nkrecs == 0)
>>   			return DTRACE_AGGWALK_NEXT;
>>   
>>   		if (aggvarid != agg->dtagd_varid)

Here and a few other places... what does "nkrecs==0" mean?  How could 
this happen?

>> diff --git a/libdtrace/dt_impl.h b/libdtrace/dt_impl.h
>> index 0e3cf218..704cfed0 100644
>> --- a/libdtrace/dt_impl.h
>> +++ b/libdtrace/dt_impl.h
>> @@ -188,6 +188,9 @@ typedef struct dt_ahash {
>>   	size_t		dtah_size;		/* size of hash table */
>>   } dt_ahash_t;
>>   
>> +#define DT_AGGDATA_COUNTER	0
>> +#define DT_AGGDATA_RECORD	1
>> +
>>   /*
>>    * Why do we need (only) 4 slots?  The maximum amount of string arguments to
>>    * any function is 2, and if the result is a string as well, that means we may
>> diff --git a/libdtrace/dt_map.c b/libdtrace/dt_map.c
>> index 0098e374..cc09fff0 100644
>> --- a/libdtrace/dt_map.c
>> +++ b/libdtrace/dt_map.c
>> @@ -241,11 +241,9 @@ dt_aggid_add(dtrace_hdl_t *dtp, const dt_ident_t *aid)
>>   {
>>   	dtrace_id_t		max;
>>   	dtrace_aggdesc_t	*agg;
>> -	dtrace_recdesc_t	*recs;
>> +	dtrace_recdesc_t	*recs, *drecs;
>>   	dtrace_aggid_t		id = aid->di_id;
>> -	dt_ident_t		*fid = aid->di_iarg;
>> -	int			i;
>> -	uint_t			off = 0;
>> +	int			nrecs = ((dt_idsig_t *)aid->di_data)->dis_argc;

nrecs is... nkrecs?  ndrecs?  Their sum?  Ah... none of these!  It isn't 
really a record count at all.  Other inadequate names include nargs 
(ick, no) or nkeys (ick, that gets us back into the problem that "keys" 
can refer to agg keys or BPF map keys).

Ah, how about instead of "nrecs=argc" using "nkrecs=argc+1"? Afterall, 
this function only uses nrecs in the expression "nrecs+1".

>>   	while (id >= (max = dtp->dt_maxagg) || dtp->dt_adesc == NULL) {
>>   		dtrace_id_t		nmax = max ? (max << 1) : 1;
>> @@ -274,39 +272,51 @@ dt_aggid_add(dtrace_hdl_t *dtp, const dt_ident_t *aid)
>>   	if (agg == NULL)
>>   		return dt_set_errno(dtp, EDT_NOMEM);
>>   
>> -	/*
>> -	 * Note the relationship between the aggregation storage
>> -	 * size (di_size) and the aggregation data size (dtagd_size):
>> -	 *     di_size = dtagd_size * (num copies) + (size of latch seq #)
>> -	 */
>>   	agg->dtagd_id = id;
>>   	agg->dtagd_name = aid->di_name;
>>   	agg->dtagd_sig = ((dt_idsig_t *)aid->di_data)->dis_auxinfo;
>>   	agg->dtagd_normal = 1;
>>   	agg->dtagd_varid = aid->di_id;
>> -	agg->dtagd_size = (aid->di_size - sizeof(uint64_t)) / DT_AGG_NUM_COPIES;
>> -	agg->dtagd_nrecs = agg->dtagd_size / sizeof(uint64_t);
>> +	agg->dtagd_nkrecs = 1 + nrecs;		/* var ID and keys */
>>   
>> -	recs = dt_calloc(dtp, agg->dtagd_nrecs, sizeof(dtrace_recdesc_t));
>> +	/* We allocate all records at once (varid, keys, data counter, data). */
>> +	recs = dt_calloc(dtp, 1 + nrecs + 2, sizeof(dtrace_recdesc_t));

The comment, referring to four items (varid, keys, data counter, data), 
does not line up with the code (1 + nrecs + 2).  If you go with the 
nkrecs=nargs+1 suggestion, the code becomes nkrecs+2.  So maybe the 
comment could become:
         /* We allocate krecs and drecs together. */

>>   	if (recs == NULL) {
>>   		dt_free(dtp, agg);
>>   		return dt_set_errno(dtp, EDT_NOMEM);
>>   	}
>>   
>> -	agg->dtagd_recs = recs;
>> -
>> -	for (i = 0; i < agg->dtagd_nrecs; i++) {
>> -		dtrace_recdesc_t	*rec = &recs[i];
>> -
>> -		rec->dtrd_action = fid->di_id;
>> -		rec->dtrd_size = sizeof(uint64_t);
>> -		rec->dtrd_offset = off;
>> -		rec->dtrd_alignment = sizeof(uint64_t);
>> -		rec->dtrd_format = NULL;
>> -		rec->dtrd_arg = 1;
>> -
>> -		off += sizeof(uint64_t);
>> -	}
>> +	/* Set up the variable ID record (first record in the keys). */
>> +	recs[0].dtrd_action = DTRACEACT_DIFEXPR;
>> +	recs[0].dtrd_size = sizeof(uint32_t);
>> +	recs[0].dtrd_offset = 0;
>> +	recs[0].dtrd_alignment = sizeof(uint32_t);
>> +	recs[0].dtrd_format = NULL;
>> +	recs[0].dtrd_arg = 1;
>> +
>> +	agg->dtagd_krecs = recs;
>> +	agg->dtagd_ksize = recs[0].dtrd_size;	/* Updated when adding keys */
>> +
>> +	/* Set up the data records. */
>> +	drecs = &recs[nrecs];
> This needs to be:
>
> 	drecs = &recs[1 + nrecs];

Yeah, or s/1 + nrecs/nkrecs/.

>> +
>> +	drecs[DT_AGGDATA_COUNTER].dtrd_action = ((dt_ident_t *) aid->di_iarg)->di_id;
>> +	drecs[DT_AGGDATA_COUNTER].dtrd_size = sizeof(uint64_t);
>> +	drecs[DT_AGGDATA_COUNTER].dtrd_offset = 0;
>> +	drecs[DT_AGGDATA_COUNTER].dtrd_alignment = sizeof(uint64_t);
>> +	drecs[DT_AGGDATA_COUNTER].dtrd_format = NULL;
>> +	drecs[DT_AGGDATA_COUNTER].dtrd_arg = 1;
>> +
>> +	drecs[DT_AGGDATA_RECORD].dtrd_action = ((dt_ident_t *) aid->di_iarg)->di_id;
>> +	drecs[DT_AGGDATA_RECORD].dtrd_size = aid->di_size - sizeof(uint64_t);

For the time being, I guess you have to divide by DT_AGG_NUM_COPIES.

>> +	drecs[DT_AGGDATA_RECORD].dtrd_offset = sizeof(uint64_t);
>> +	drecs[DT_AGGDATA_RECORD].dtrd_alignment = sizeof(uint64_t);
>> +	drecs[DT_AGGDATA_RECORD].dtrd_format = NULL;
>> +	drecs[DT_AGGDATA_RECORD].dtrd_arg = 1;
>> +
>> +	agg->dtagd_drecs = drecs;
>> +	agg->dtagd_dsize = drecs[DT_AGGDATA_COUNTER].dtrd_size +
>> +			   drecs[DT_AGGDATA_RECORD].dtrd_size;
>>   
>>   	dtp->dt_adesc[id] = agg;
>>   
>> @@ -338,7 +348,8 @@ dt_aggid_destroy(dtrace_hdl_t *dtp)
>>   
>>   	for (i = 0; i < dtp->dt_maxagg; i++) {
>>   		if (dtp->dt_adesc[i] != NULL) {
>> -			dt_free(dtp, dtp->dt_adesc[i]->dtagd_recs);
>> +			/* Freeing dtagd_recs also frees dtagd_drecs. */

s/dtagd_recs/dtagd_krecs/

>> +			dt_free(dtp, dtp->dt_adesc[i]->dtagd_krecs);
>>   			dt_free(dtp, dtp->dt_adesc[i]);
>>   		}
>>   	}
>> diff --git a/libdtrace/dt_printf.c b/libdtrace/dt_printf.c
>> index 4fb20b23..4f0e6b8a 100644
>> --- a/libdtrace/dt_printf.c
>> +++ b/libdtrace/dt_printf.c
>> @@ -1264,10 +1264,12 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
>>   	dtrace_aggdesc_t *agg;
>>   	caddr_t lim = (caddr_t)buf + len, limit;
>>   	char format[64] = "%";
>> -	int i, aggrec = 0, curagg = -1;
>> +	int i, curagg = -1;
>>   	uint64_t normal, sig;
>>   
>>   	/*
>> +	 * FIXME: Comment needs rewrite.
>> +	 *

Comment needs rewrite.

>>   	 * If we are formatting an aggregation, set 'aggrec' to the index of
>>   	 * the final record description (the aggregation result) so we can use
>>   	 * this record index with any conversion where DT_PFCONV_AGG is set.
>> @@ -1280,13 +1282,8 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
>>   		assert(aggsdata != NULL);
>>   		assert(naggvars > 0);
>>   
>> -		if (nrecs == 0)
>> -			return dt_set_errno(dtp, EDT_DMISMATCH);
>> -
>>   		curagg = naggvars > 1 ? 1 : 0;
>>   		aggdata = aggsdata[0];
>> -		aggrec = aggdata->dtada_desc->dtagd_nrecs - 1;
>> -		nrecs--;
>>   	}
>>   
>>   	for (i = 0; i < pfv->pfv_argc; i++, pfd = pfd->pfd_next) {
>> @@ -1380,11 +1377,10 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
>>   			if (curagg < naggvars - 1)
>>   				curagg++;
>>   
>> -			rec = &agg->dtagd_recs[aggrec];
>> -			addr = aggdata->dtada_data;
>> +			rec = &agg->dtagd_drecs[DT_AGGDATA_RECORD];
>> +			addr = aggdata->dtada_data + rec->dtrd_offset;
>>   			limit = addr + aggdata->dtada_size;
>>   			normal = agg->dtagd_normal;
>> -			size = agg->dtagd_size;
>>   			sig = agg->dtagd_sig;
>>   			flags = DTRACE_BUFDATA_AGGVAL;
>>   		} else {
>> @@ -1409,10 +1405,11 @@ dt_printf_format(dtrace_hdl_t *dtp, FILE *fp, const dt_pfargv_t *pfv,
>>   			addr = (caddr_t)buf + rec->dtrd_offset;
>>   			limit = lim;
>>   			normal = 1;
>> -			size = rec->dtrd_size;
>>   			sig = 0;
>>   		}
>>   
>> +		size = rec->dtrd_size;
>> +
>>   		if (addr + size > limit) {
>>   			dt_dprintf("bad size: addr=%p size=0x%x lim=%p\n",
>>   			    (void *)addr, rec->dtrd_size, (void *)lim);
>> @@ -1809,8 +1806,8 @@ static int
>>   dt_fprinta(const dtrace_aggdata_t *adp, void *arg)
>>   {
>>   	const dtrace_aggdesc_t	*agg = adp->dtada_desc;
>> -	const dtrace_recdesc_t	*rec = agg->dtagd_recs;
>> -	uint_t			nrecs = agg->dtagd_nrecs;
>> +	const dtrace_recdesc_t	*rec = agg->dtagd_krecs;
>> +	uint_t			nrecs = agg->dtagd_nkrecs;
>>   	dt_pfwalk_t		*pfw = arg;
>>   	dtrace_hdl_t		*dtp = pfw->pfw_argv->pfv_dtp;
>>   

rec should perhaps not start with the varID record but with the first 
(agg) key record.  So rec has to be bumped up, and nrecs has to be 
decremented accordingly.  Admittedly, this bug predates your patch, but 
you fix the corresponding problem in dt_fprintas()...

>> @@ -1837,8 +1834,8 @@ dt_fprintas(const dtrace_aggdata_t **aggsdata, int naggvars, void *arg)
>>   {
>>   	const dtrace_aggdata_t	*aggdata = aggsdata[0];
>>   	const dtrace_aggdesc_t	*agg = aggdata->dtada_desc;
>> -	const dtrace_recdesc_t	*rec = agg->dtagd_recs;
>> -	uint_t			nrecs = agg->dtagd_nrecs;
>> +	const dtrace_recdesc_t	*rec = &agg->dtagd_krecs[1];
>> +	uint_t			nrecs = agg->dtagd_nkrecs - 1;
>>   	dt_pfwalk_t		*pfw = arg;
>>   	dtrace_hdl_t		*dtp = pfw->pfw_argv->pfv_dtp;
>>   	int			i;
>> -- 
>> 2.34.1
>>
>>
>> _______________________________________________
>> DTrace-devel mailing list
>> DTrace-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/dtrace-devel
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel