[DTrace-devel] [PATCH 40/61] Fix records for aggregation descriptions

Thu Aug 11 04:02:04 UTC 2022

Before I spend too much (more) time on this, I wanted to check in on 
that "larger topic."  While it is true that the layout of aggregation 
data has changed in the interface between producer and consumer, the 
consumer's copy of aggregation data need not change layout -- and in 
fact retaining the legacy layout benefits from staying the same since 
that way we can reuse more legacy code. I'm not saying that the consumer 
cannot change how it stores agg data, but only that it's easier to use 
code that already exists and works.  I tried looking more carefully how 
hard it would be to remove the record descriptor for the agg data.  
E.g., there are print functions that pass record descriptors -- whether 
for keys or agg data or whatever.  So naively that code would want agg 
data to have a record descriptor.  There is also some agg cmp stuff 
that... well, I do not really understand and is perhaps not used. So 
maybe all around the code could use a lot of clean up.  That might be a 
good idea, but that is also maybe a separate project. In other 
instances, you have been in favor of leaving "good enough" code (code 
that exists and is working) alone.  How about in this case?  Again, the 
producer is passing agg data differently to the consumer compared to 
legacy, but at this point leaving the consumer organization of the data 
-- along with its use of a record descriptor for the agg data -- might 
be easiest.  Let me know.

On 8/10/22 13:45, Kris Van Hees wrote:
> On Fri, Jul 08, 2022 at 10:45:24AM -0400, eugene.loh--- via DTrace-devel wrote:
>> From: Eugene Loh <eugene.loh at oracle.com>
>>
>> There is an array of aggregation descriptions, dtp->dt_adesc.
>>
>> In legacy DTrace, an aggregation description's records, in order, describe:
>>
>> *)  The 8-byte, dtrace_aggvarid_t, aggregation variable ID.
>>
>> *)  The aggregation's keys, if any -- one record per key.
>>
>> *)  The aggregation's data.
>>
>> In the port to BPF, the records were changed to be one per each 64-bit
>> word of aggregation data.  Problems include:
>>
>> *)  This breaks some dependent code.
>>
>> *)  There is no description of the keys, if any.
>>
>> *)  There is no use for having a different record for each word of
>>      data... and there are potentially very many such words for
>>      *quantize() aggregations.
>>
>> Change the aggregation description records back to legacy style.
> I am not sure the above actually serves much to explain the patch.  And
> going back to the legacy layout of the description records seems not entirely
> ideal either because that layout was based on how data is laid out in the
> aggregation buffers, and we have changed that (and while you modified that,
> it is still different).
>
> I think it might be better to simply explain that you are updating the
> aggregation description to accomodate aggregation keys.  And that you fix
> the 'one record per 64-bit value' into 'one record for the data'.
>
> That brings up the notion of what the aggdesc should look like.  Since the
> indexed aggregations are going to be stored with the keys as BPF hashmap key,
> and the data as value (i.e. not together), it makes sense to store the keys
> as an array of record descriptions and store the data record separately as
> dtrace_aggdesc->dtagd_drec (data record description).  Storing the data recdesc
> as last element in the array of recdescs seems wrong because the data is not
> stored in the same space as the keys.
>
> And honestly... do we actually need a data record for the data?  I don't think
> so since the only info we need is the size (which you already get from
> aid->di_size and store in the aggdesc), and the offset (in the map value)
> which is a constant value and therefore does not need to be stored anywhere.
>
> So, I don't see a need to store a record description for the data.
>
>> One exception is that the legacy variable ID has changed from an 8-byte
>> dtrace_aggvarid_t to a 4-byte dtrace_aggid_t, to match the 4-byte
>> prefix at the beginning of "tuples".  E.g., see dt_cg_arglist().
> You may want to add that the aggregation ID is a regular identifier ID,
> and thus a 32-bit value so this change is valid.
>
>> In legacy DTrace, aggregation descriptions are requested from the
>> kernel.  In the BPF port, key and data sizes are discovered during
>> code generation.  Therefore, add function dt_aggid_add_rec_data()
>> to complement dt_aggid_add().  (Support for aggregation keys will
>> come in later patches.)
> Based on my comments above about keys vs data, and for consistency, there is
> no need for dt_aggid_add_rec_data().  When you add a function to add record
> descriptions for the keys later in the series, it probably should be named
> dt_agg_rec_add().  I would use dt_agg_* naming and pass in an aggdesc rather
> than the id because in all cases where you need this you would already have
> created an aggdesc anyway.  It would make sense to have dt_aggid_add() return a
> pointer to the aggdesc for that reason.
>
>> Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
>>
>> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
>> index 063ec113..61611c1d 100644
>> --- a/libdtrace/dt_cg.c
>> +++ b/libdtrace/dt_cg.c
>> @@ -6586,8 +6586,10 @@ static dt_cg_aggfunc_f *_dt_cg_agg[DT_AGG_NUM] = {
>>   static void
>>   dt_cg_agg(dt_pcb_t *pcb, dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
>>   {
>> +	dtrace_hdl_t	*dtp = pcb->pcb_hdl;
>>   	dt_ident_t	*aid, *fid;
>>   	dt_cg_aggfunc_f	*aggfp;
>> +	int		nrecs, off;
>>   
>>   	/*
>>   	 * If the aggregation has no aggregating function applied to it, then
>> @@ -6618,7 +6620,23 @@ dt_cg_agg(dt_pcb_t *pcb, dt_node_t *dnp, dt_irlist_t *dlp, dt_regset_t *drp)
>>   	assert(aggfp != NULL);
>>   
>>   	(*aggfp)(pcb, aid, dnp, dlp, drp);
>> -	dt_aggid_add(pcb->pcb_hdl, aid);
>> +
>> +	/* add this agg id if we are seeing it for the first time */
>> +	/* we do this after the BPF code generation since aggfp() sets aid->di_size */
>> +	if (dt_aggid_lookup(dtp, aid->di_id, NULL) == -1) {
> 		dtrace_aggdesc_t	*agg;
>
>> +
>> +		/* count the number of records */
>> +		nrecs = 1;		/* variable ID */
>> +		nrecs++;		/* data */
> There should not be any need for counting records.  You can just use
> ((dt_idsig_t *)aid->di_data)->dis_argc which gives you the number of keys.
> I would not add 1 for the aggid itself because you are storing the record
> for that in dt_aggid_add() itself rather than doing it explicitly here.
>
> As discussed above, I would not include the data record in this because it is
> not stored in the same location as the keys.
>
>> +
>> +		/* add the aid */
>> +		dt_aggid_add(dtp, aid, nrecs);
> 		agg = dt_aggid_add(dtp, aid,
> 				   ((dt_idsig_t *)aid->di_data)->dis_argc);
>
>> +
>> +		off = sizeof(uint32_t);
> You should track the offset as a member in the aggdesc, and have
> dt_agg_rec_add() do the work of tracking the next offset to use for
> subsequent calls.
>
>> +
>> +		/* add record for aggregation data */
>> +		off = dt_aggid_add_rec_data(dtp, aid->di_id, off);
> Should not be needed.  All info that is needed is already present in aggdesc
> once you call dt_aggid_add(), right?  You could just add a dtagd_dsize member
> and be done?
>
> But I think there is a larger topic here though...  Why do we keep using the
> data structures that the legacy version used when we are storing the data in
> a different way.  Sure, copying the aggid, keys, and data into a single block
> of memory and stuffing that into the hash works, but it is the right way?  I
> would think it is more naturel to store varid, keys, and data each in their
> own members on a struct that holds the aggregation data.  That way you do not
> have to read pieces of information from specific offsets etc.  And there is
> no penalty because you are needing to do the copying in multiple memcpy()s
> anyway no matter what data structure approach you use.
>
> But in our new design, I just do not see why we would be storing the data
> along with the keys in a block of memory.  They are already (quite nicely)
> separated as hash key and value in the BPF hash map.
>
> That does impact this and the remainder of patches, but I think it is a
> worthwhile change to make.  And you can then have dtagd_ksize and dtagd_dsize
> as keys size and data size, respectively.  And as mentioned above, no need to
> store a record description for the data since all you need is the size.