[DTrace-devel] [PATCH v3] Add rodata support

Mon Aug 21 17:54:43 UTC 2023

I'm trying to drop issues as much as I reasonably can, but there are 
still some items below.

And another issue:  what about alignment?  What if I want to add some 
data that needs 4- or 8-byte alignment?  Do I do that by fabricating 
some bogus data of the right size and inserting that first, then 
inserting the data I want?  But what if the bogus data magically matches 
data I've already inserted, and the intended padding does not result?  
Or, no promises of alignment?

On 8/21/23 11:53, Kris Van Hees wrote:

> On Fri, Aug 18, 2023 at 03:17:23PM -0400, Eugene Loh wrote:
>> On 8/17/23 22:04, Kris Van Hees wrote:
>>> On Thu, Aug 17, 2023 at 06:10:25PM -0400, Eugene Loh via DTrace-devel wrote:
>>>> On 8/17/23 10:53, Kris Van Hees via DTrace-devel wrote:
>>>> I'd make that second argument NULL.? (See my comment later about
>>>> dt_rodata_write().)
>>> I disagree (likewise, see below).
>>>>> +dt_rodata_t *
>>>>> +dt_rodata_create(size_t bufsz)
>>>>> +{
>>>>> +	dt_rodata_t	*dp = malloc(sizeof(dt_rodata_t));
>>>>> +	size_t		nbuckets = _dtrace_strbuckets;
>>>>> +
>>>>> +	assert(bufsz != 0);
>>>>> +
>>>>> +	if (dp == NULL)
>>>>> +		return NULL;
>>>>> +
>>>>> +	memset(dp, 0, sizeof(dt_rodata_t));
>>>>> +	dp->hash = malloc(nbuckets * sizeof(dt_rohash_t *));
>>>>> +	if (dp->hash == NULL)
>>>>> +		goto err;
>>>>> +
>>>>> +	memset(dp->hash, 0, nbuckets * sizeof(dt_rohash_t *));
>>>>> +	dp->hashsz = nbuckets;
>>>>> +	dp->bufs = NULL;
>>>>> +	dp->ptr = NULL;
>>>>> +	dp->nbufs = 0;
>>>>> +	dp->bufsz = bufsz;
>>>>> +
>>>>> +	if (dt_rodata_grow(dp) == -1)
>>>>> +		goto err;
>>>> Why not let the first dt_rodata_grow() be made by the first "insert"?
>>> Since we want a first NULL element.
>> Why?? And why not insert it with an insert call?
> Because you cannot add NULL to the rodata and because I need it there so I
> can have a guaranteed value for NULL (offset 0).  I believe I commented on
> that already before (as quoted below).

Does the value in byte 0 matter?  I'm guessing it does not.  It seems to 
me the point is not that we want a 0 in the first byte. Rather, we 
simply want to bump dp->ptr and dp->size up by 1 so that whenever an 
offset of 0 is returned we know a NULL pointer had been passed in.

>>>>> +
>>>>> +	/* Pre-populate the read-only data with a 0-byte at offset 0.  */
>>>>> +	*dp->ptr++ = 0;
>>>>> +	dp->size = 1;
>>>>> +	dp->count = 1;
>>>> Why prepopulate?? And why not with an "insert" call?
>>> To guarantee a NULL element at offset 0.
>>>>> + * storage memory block.
>>>>> + */
>>>>> +ssize_t
>>>>> +dt_rodata_copy(const char *s, size_t n, size_t off, char *buf)
>>>> Why not just inline this function?? There is no other copy function to use.
>>> But we can support multiple ones, and in fact, when strtab is implemented based
>>> on rodata, we can use a single writer function and supply it with a custom
>>> copy function (distinct from what rodata needs).
>> What would that function look like?? Today, dt_strtab_copystr() is
>> simply a memcpy.
> A function like that can (and should - though never did because the original
> DTrace code is not always as robust as we'd like) ensure that we do not try
> to write more data than the output buffer can accept.  Or a function can be
> added (as I have done in debugging) to output the string table rather than
> writing it to a memory block.
>
>>>>> +{
>>>>> +	memcpy(buf + off, s, n);
>>>>> +	return n;
>>>>> +}
>>>>> +
>>>>> +ssize_t
>>>>> +dt_rodata_write(const dt_rodata_t *dp, dt_rodata_copy_f *func, void *private)
>>>> Why make the caller specify func?? In particular, the only func used is
>>>> dt_rodata_copy(), which is in this file but the callers are outside of this
>>>> file.? At the very least, insert "if (func == NULL) func = dt_rodata_copy"
>>>> so that a caller does not HAVE to specify func.
>>> See above.  I don't like the concept of a default function because that leads
>>> to needing to go look at the writer function to figure out what the default is.
>>> There is no real benefit from using a default.
>> Maybe it would help if I knew what other functions were being considered.
> I honestly do not see that as being relevant for a code review.

The patch adds 100s of lines.  That's bad.  To judge the patch, one 
needs to ask "is this worth it?" and "does it do this correctly?" All 
that is hard to judge when I don't know what "this" is.

> The code
> provides for using a callback function that is used for the actual processing
> of the data chunks, and assuming the code here does what it needs to do: call
> the callback function for each chunk in order, and verify return value of the
> callback function to ensure the data was processed without problems, it does
> what it is meant to do.
>
>> Right now, my understanding is that rodata is some big linear space
>> where I can dump blocks of data.? I'm given the linear offset to each
>> block.? There is protection against duplicates.? The "write" function
>> presumably dumps that linear space to somewhere designated by a
>> pointer.? Why should I need to specify anything other than that
>> pointer?? Why should I have to go looking at the writer function to
>> figure out what the func argument is supposed to be?
> Because, as I explain above, this function can support more callbacks.  It is
> quite common when this programming technique is used that you need to consider
> the call chain:
>
> File A defines a callback function cb and calls an iterator function for some
> opaque data structure, passing in cb to do the actual work.  This way the
> actual implementation details of the data structure are encapsulated and in
> a sense hidden from the code using it.  Which means it can change without the
> need to change the code that uses it.

It seems to me that an iterator is very different.  In that case, you have a bunch of like things and the iterator does not need to know what those things are.

Here, I *thought* we were simply packing a memory area.  Stuff is loaded in (without producing any duplicates) and then we flush/dump it out.  I would think nothing beyond memcpy would be needed at this point.  If any processing had to take place on the data, it would already have taken place.  Specifically, if the data is going to grow, you're in trouble since the offsets are already baked in.  If the data is going to shrink, you'll be wasting space.  The only thing we know about the data is that they're sitting in some linear space.