[DTrace-devel] [PATCH 2/4] Aggregations are not appropriately reset

Fri Jan 8 11:57:09 PST 2021

On 01/07/2021 07:39 PM, Kris Van Hees wrote:

> On Thu, Jan 07, 2021 at 02:05:38PM -0800, Eugene Loh wrote:
>> I've been trying to understand this suggestion, but simply haven't had
>> any luck.
>>
>> Consider a situation in which an aggregation has data on some CPU other
>> than the first CPU.  There has been a snapshot, and so the aggregation
>> is in the hash table.  Now, we do another snapshot, which is what
>> triggers the bug.
>>
>> If I understand the suggestion (and I probably do not), snap_one() (the
>> per-CPU/per-agg snapshot) is passed some function to move data from src
>> to dst.  Well, snap_one() does not know whether it is on the first CPU
>> or not nor whether the passed function copies or aggregates.  It finds
>> that the per-CPU/per-agg sequence latch number is 0.  So, should it skip
>> the data movement function?  It (later) finds that the per-CPU/per-agg
>> sequence latch number is nonzero.  So, should it invoke the src-to-dst
>> function?  If yes and yes, then we reproduce the buggy behavior
>> (aggregating current data with data from an earlier snapshot).
> dt_aggregate_snap() has a loop over all the CPUs, and it calls
> dt_aggregate_snap_cpu() for each CPU.  So, you can call dt_aggregate_snap_cpu()
> for the first CPU separately (passing in a function that copies the data), and
> then loop over the remaining CPUs and call dt_aggregate_snap_cpu() for those
> (passing in a function that aggregates the data).  Since dt_aggregate_snap_cpu()
> calls dt_aggregate_snap_one(), it can pass in the function it got from
> dt_aggregate_snap().  That means that for the first CPU, it will be copying the
> snapshot data, and for all other CPUs it will be aggregating the data.

As I described, that would perpetuate the bug.  The key is that 
snap_one() checks the latch sequence number and does nothing if it's 0.  
So, in the case I described (data is on some CPU other than the first 
one), the snap_one() call for the first CPU does nothing. Then, 
snap_one() for the CPU with data sees there is data and sees that the 
aggregation is in the hash table and so aggregates the data ... with 
data that pre-exists from an earlier snapshot.  That's exactly the bug 
this patch was supposed to fix.

Alternatively, one could eliminate that optimization (checking whether 
the latch sequence number is nonzero to see if there is any data).  If 
one is willing to forgo that, then I agree the code could be simplified.

Or, snap_one() can always do the copy for the first CPU (if the 
aggregation is already in the hash table), but that requires snap_one() 
knowing that it is the first CPU.

> The only special casing would be for a an aggregation that does not exist in
> the hashtable just yet.  In that case, you should always perform a data copy
> (which can be a hardcoded call to that function, ignoring the passed in one).
>
>> I guess I also do not understand why the top-level snapshot would ever
>> pass an opaque function rather than a simple flag that indicates "first
>> CPU."
> It reduces some code duplication and/or avoids the entire goto accumulate mess
> I wrote, and

The big savings would be moving the "data aggregation" code out to its 
own function.  Once one does that, there are multiple ways of cleaning 
up the "goto accumulate" stuff.  Specifically, passing a function in, 
while not a big deal, isn't nothing either.  Anyhow, we can resolve 
those other issues before worrying much about this one.

> removes the need to know about CPU index-to-id mapping outside of
> the loop in dt_aggregate_snap().

Not a big deal... a "one-liner."  But, again, if one is willing to pass 
in a function, then it's even easier to pass in a boolean value 
instead... firstcpu (or not).

>
>> Anyhow, yes, this stuff is complicated, but I was trying to keep the
>> code deltas small, especially since we don't yet know how we intend to
>> support stuff like truncate.
>>
>>
>> On 01/06/2021 11:19 PM, Kris Van Hees wrote:
>>
>>> I think the implementation is a little bit messy (and granted, it was already
>>> quite messy to begin with).  Looking at it more closely, I wonder whether it
>>> might be better to have dt_aggregate_snap() pass a function pointer to
>>> dt_aggregate_snap_cpu() which stores it in the snap state (st) as the function
>>> to be used in dt_aggregate_snap_one() for handling aggregation data for a
>>> specific aggregation on a specific CPU.  You would then define one function to
>>> do the data copy implementation and one to do the data aggregation
>>> implementation.
>>>
>>>   From dt_aggregate_snap() you'd pass the data copy version for the first CPU in
>>> the system, and the data aggregation version for all other CPUs.  You would
>>> still be able to override the processing function (copy or aggregate) where
>>> needed, e.g. when dealing with an aggregation that does not exist in the hash
>>> table.
>>>
>>> Doing this would eliminate the firstcpu conditionals and I think it will
>>> streamline the code a bit in general.  You can even get rid of the rather ugly
>>> 'goto accumulate' that I put in there to avoid code duplication.
>>>
>>> On Wed, Jan 06, 2021 at 08:39:11PM -0500, eugene.loh at oracle.com wrote:
>>>> From: Eugene Loh <eugene.loh at oracle.com>
>>>>
>>>> In DTv1, each user-space aggregation snapshot would read data from the
>>>> kernel and aggregate, not only over CPUs but also with previous
>>>> snapshots.
>>>>
>>>> In DTv2, the kernel maintains aggregations in BPF maps.  Each user-space
>>>> snapshot should still aggregate over CPUs, but not with previous data.
>>>>
>>>> Reset aggregations appropriately with each snapshot.
>>>>
>>>> Tweak testing to catch this problem.  The test suite will catch this
>>>> problem even more once other functionality has been implemented.
>>>>
>>>> https://github.com/oracle/dtrace-utils/issues/4
>>>> Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
>>>>
>>>> diff --git a/libdtrace/dt_aggregate.c b/libdtrace/dt_aggregate.c
>>>> index d59f107e..ba32d9d5 100644
>>>> --- a/libdtrace/dt_aggregate.c
>>>> +++ b/libdtrace/dt_aggregate.c
>>>> @@ -424,16 +424,17 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
>>>>    	uint_t			i;
>>>>    	int64_t			*src;
>>>>    	int			realsz;
>>>> +	int			firstcpu = (st->cpu == st->dtp->dt_conf.cpus[0].cpu_id);
>>>>    
>>>>    	rval = dt_aggid_lookup(st->dtp, aid->di_id, &agg);
>>>>    	if (rval != 0)
>>>>    		return rval;
>>>>    
>>>> -	/* check latch sequence number to see if there is any data */
>>>> +	/* skip if no data (latch sequence number is zero) */
>>>> +	/* (except, if first CPU, still need to clear any previous data) */
>>>>    	src = (int64_t *)(st->buf + aid->di_offset);
>>>> -	if (*src == 0)
>>>> +	if (*src == 0 && !firstcpu)
>>>>    		return 0;
>>>> -	src++;
>>>>    
>>>>    	/* real size excludes latch sequence number and second data copy */
>>>>    	realsz = (aid->di_size - sizeof(uint64_t)) / 2;
>>>> @@ -456,21 +457,31 @@ dt_aggregate_snap_one(dt_idhash_t *dhp, dt_ident_t *aid, dt_snapstate_t *st)
>>>>    				: NULL;
>>>>    
>>>>    accumulate:
>>>> -		switch (((dt_ident_t *)aid->di_iarg)->di_id) {
>>>> -		case DT_AGG_MAX:
>>>> -			if (*src > *dst)
>>>> -				*dst = *src;
>>>> +		/* skip over latch sequence number to get to the data */
>>>> +		src++;
>>>>    
>>>> -			break;
>>>> -		case DT_AGG_MIN:
>>>> -			if (*src < *dst)
>>>> -				*dst = *src;
>>>> +		/* either copy in initial data or combine with existing data */
>>>> +		if (dst != (int64_t *)agd->dtada_data || firstcpu) {
>>>> +			/* if per-CPU or it's the first CPU, just copy in */
>>>> +			memcpy(dst, src, realsz);
>>>> +		} else {
>>>> +			/* otherwise, combine with existing data */
>>>> +			switch (((dt_ident_t *)aid->di_iarg)->di_id) {
>>>> +			case DT_AGG_MAX:
>>>> +				if (*src > *dst)
>>>> +					*dst = *src;
>>>>    
>>>> -			break;
>>>> -		default:
>>>> -			for (i = 0, cnt = realsz / sizeof(int64_t);
>>>> -			     i < cnt; i++, dst++, src++)
>>>> -				*dst += *src;
>>>> +				break;
>>>> +			case DT_AGG_MIN:
>>>> +				if (*src < *dst)
>>>> +					*dst = *src;
>>>> +
>>>> +				break;
>>>> +			default:
>>>> +				for (i = 0, cnt = realsz / sizeof(int64_t);
>>>> +			     	i < cnt; i++, dst++, src++)
>>>> +					*dst += *src;
>>>> +			}
>>>>    		}
>>>>    
>>>>    		/* If we keep per-CPU data - process that as well. */
>>>> @@ -484,7 +495,10 @@ accumulate:
>>>>    		return 0;
>>>>    	}
>>>>    
>>>> -	/* Not found - add it. */
>>>> +	/* Not found - add it unless there is no data. */
>>>> +	if (*src == 0)
>>>> +		return 0;
>>>> +
>>>>    	h = dt_zalloc(st->dtp, sizeof(dt_ahashent_t));
>>>>    	if (h == NULL)
>>>>    		return dt_set_errno(st->dtp, EDT_NOMEM);
>>>> @@ -496,6 +510,8 @@ accumulate:
>>>>    		return dt_set_errno(st->dtp, EDT_NOMEM);
>>>>    	}
>>>>    
>>>> +	/* skip over latch sequence number to get to the data */
>>>> +	src++;
>>>>    	memcpy(agd->dtada_data, src, realsz);
>>>>    	agd->dtada_size = realsz;
>>>>    	agd->dtada_desc = agg;