[DTrace-devel] [PATCH v4 14/25] libcommon: move uprobes creation into the common library

Mon Oct 10 11:59:58 UTC 2022

On 8 Oct 2022, Kris Van Hees outgrape:

> On Sat, Oct 08, 2022 at 02:16:33AM -0400, Kris Van Hees wrote:
>> On Sat, Oct 08, 2022 at 01:38:00AM -0400, Kris Van Hees via DTrace-devel wrote:
>> > On Fri, Oct 07, 2022 at 11:25:13AM +0100, Nick Alcock via DTrace-devel wrote:
>> > > +/*
>> > > + * Encode a NAME suitably for representation in a uprobe.  All non-alphanumeric,
>> > > + * non-_ characters are replaced with __XX where XX is the hex encoding of the
>> > > + * ASCII code of the byte. __ itself is replaced with ___.  The first letter
>> > > + * gets a similar transformation applied even to digits, because anything
>> > > + * resembling consistency in naming rules is obviously just wrong.
>> > 
>> > This is not really needed.  DTrace uses identifiers for the provider, module,
>> > and function names (i.e. [A-Za-z_][0-9A-Za-z_]+) and an identifier that is allowed to contain - as probe name (and that - is rendered as __ in places where it
>> > has to be an identifier, such as in the USDT data).  So, the only encoding that
>> > would need to be done is turning - into __.  But, even that is probably not
>> > needed because the code that would need to call this function (directly or
>> > indirectly) won't be dealing with probe names with a - in them.  This should
>> > only be used in the process of uprobe based probe creation.  Which is used by
>> > the dtrace provider (BEGIN and END probes) and pid/usdt.  The pid probes are
>> > all either entry, return, or an offset in a function, and the usdt probes are
>> > created based on the DOF which has probe names that already had - translated
>> > into __.

Err... if that's true, how come the first and only encoding problem I
had, that led to all this stuff getting written in the first place, was
that the probe names we get out of the DOF have undecorated -'s in? :)
(Try turning the encoding off and you'll see them all over the place.)

If I was hallucinating and they've gone, that's great, it means we won't
lose so much space on encoding overhead.

>> > > + * The DIFFERENTIATOR byte must be different for every encoded string in the uprobe.
>> > > + * (It is otherwise ignored, and discarded at decoding time.)
>> > 
>> > It might be worth actually explaining why the differentiator character is
>> > needed, i.e. to handle the case where two probe description components are
>> > the same identifier.
>> > 
>> > On the other hand, I do not think you should do that here.  Since it is a
>> > constant, and you already have a similar (actually, identical) value associated
>> > with the 4 argument strings, you can simply hardcode it where the argument
>> > string for the uprobe is constructed.
>> > 
>> > So with no encoding needed, and no differentiator needed, you can get rid of
>> > this.

How I wish this were true :)

> Ugh, I stand corrected.  I of course overlooked the case of the module name
> being a filename (albeit with various limitations), which is a case that does
> need encoding of course.

It's worse than that -- it's an soname derived from ld.so, and God knows
what crazy sonames future ld.so's might permit.

> So, perhaps some encoding is still going to be needed but I would urge you to be
> quite conservative about it because size limitations on these strings can really
> get in our way.

One advantage of splitting the probe identifiers into multiple
parameters is that encoding of : is rarely needed (rather than being
universal). So we're really not losing many bytes on this -- and it's
changeable at any time in the future to a better scheme, because this is
a strictly DTrace->DTrace internal thing.

>> > > +	if (asprintf(&name, "dt_pid/dt_%s%llx_%llx_%lx",
>> > > +		isret ? "ret_" : "",
>> > 
>> > What is the point in adding a "ret_" prefix?  The code that will parse the
>> > uprobes list reads uprobe_events which specifies 'p' or 'r' as the first
>> > character on each line listing a probe, indicating whether it is a regular
>> > probe or a uretprobe.  That is sufficient so no additional "ret_" prefix is
>> > needed.
>> 
>> My mistake - it is not needed for probe lookup, but something is needed to
>> ensure that regular probes and return probes at the same location have a
>> different name.

Yep!

>>                 I would still prefer to see less verbose strings because we
>> are duplicating information.  E,g, if the group is dt_pid, then we already
>> know these are DTrace probes so there is no need to put a dt_ prefer on the
>> uprobe name.  So, I'd suggest using:
>> 
>> 	if (asprintf(&name, "dt_pid/%c_%s%llx_%llx_%lx",
>> 		isret ? 'p' : 'r',

That's much nicer than what I was doing! Adjusted.

>> > This can simply become:
>> > 
>> > 		if (asprintf(&args, "%s1=\\1 %s2=\\2 %s3=\\3 %s4=\\4",
>> > 			     prv, mod, fun, prb) < 0)
>> > 			return NULL;
>> 
>> I forgot about the possibility that (perhaps) some element might be an empty
>> string (unlikely and perhaps not actually possible but anyway)...  So, perhaps
>> it would be better (and easier) to use:
>> 
>>  		if (asprintf(&args, "P%s=\\1 M%s=\\2 F%s=\\3 N%s=\\4",
>>  			     prv, mod, fun, prb) < 0)
>>  			return NULL;
>> 
>> in which case you can simply derive the probe description components by
>> copying from the 2nd character.

sscanf can even do that for us:

switch (sscanf(buf, "p:dt_pid/p_%llx_%llx_%lx %ms "
	       "P%m[^= ]=\\1 M%m[^= ]=\\2 F%m[^= ]=\\3 N%m[^= ]=\\4", &dev_ll, &inum_ll,
	       &addr, &spec, &eprv, &emod, &efun, &eprb)) {

No actual *code* needed at all :)

This also means we can drop a few lines of complexity from the encoder:
if the first char is always set here, we don't need to arrange to encode
digits if they happen to be the first char.

(changes made locally, usdt and pid tests still passing, will check out
all your other mails before running a full test.)