[DTrace-devel] [PATCH 5/5 v2] Add a profile provider

Fri Jun 19 22:58:54 PDT 2020

On 06/16/2020 09:54 PM, Kris Van Hees wrote:

> On Tue, Jun 16, 2020 at 12:58:59PM -0700, Eugene Loh wrote:
>> On 06/15/2020 09:48 PM, Kris Van Hees wrote:
>>> On Mon, Jun 15, 2020 at 02:18:05PM -0400, eugene.loh at oracle.com wrote:
>>>> From: Eugene Loh <eugene.loh at oracle.com>
>>>>
>>>> The probes use perf_events with PERF_TYPE_SOFTWARE and
>>>> PERF_COUNT_SW_CPU_CLOCK.
>>>>
>>>> There are many FIXMEs.  Among them, the trampoline does not
>>>> yet write the entire DTrace context... at least not the DTrace
>>>> probe arguments.  Also, profile-n is not yet implemented.
>>>>
>>>> For other providers, the work of setting a probe's event_fd is
>>>> done in bpf_attach(), but the profile provider is sufficiently
>>>> different that it makes more sense to do so in the provider's
>>>> probe_info() function.  That is, we would like to set the
>>>> probe's event_id and event_fd in probe_info().  On the other
>>> I don't believe this makes sense.  The point is that a probe is not created
>>> (or enabled) until we are going to use it.
>> I wanted to get clarification on what you mean.  Currently, we "create"
>> probes -- like dtrace's uprobes and fbt's kprobes -- during the
>> providers' probe_info() functions.  This occurs during, e.g.,
>> cmd/dtrace.c's "fourth phase" of processing command-line options -- that
>> is, when D is being compiled and before we do any "list" or "exec" stuff
>> (i.e., whether or not we're going to use the probes). I assume such
>> behavior is inconsistent with the point you're making?
> Ok, I expressed myself in a somewhat confusing way due to the overloaded
> concept of what a probe is.  I am refering to the creation of the actual
> underlying perf event as probe creation.  I would e.g. argue that although
> the kprobe does not exist yet for an FBT probe until probe_info() is called,
> the probe actually does exist since the symbol is listed in the list of
> functions that can be traced using kprobes.  So for the case of FBT probes we
> have:
>
>    1. A function is listed in available_filter_functions and we can therefore
>       say that it exists as a probe (we actually create it as a probe when we
>       call populate() in the FBT provider implementation.

"We can say"?  It sounds like a semantic game.  I get what you're saying 
about fbt/kprobes, but I think the real point is that we are forced to 
create the probe (in the "kprobe" sense even if not in the "perf_event" 
sense) at this time so that, e.g., args can be discovered... so that 
probe_info() can return something about the probe.

It seems to me there is nothing magical about 
available_filter_functions.  All kinds of things can be traced, whether 
they're in that list or not.  E.g., this business of "well it exists 
because it could exist" is even more tortured in the case of the dtrace 
provider, which looks for some libdtrace, hunts for symbols in it, and 
then creates corresponding uprobes.  The dtrace provider is creating 
uprobes here that are not part of any fixed "available probes" list.

Or, would you say that the dtrace provider's uprobes are sufficiently 
different from the "available_filter_functions" situation ("fixed" list 
of things that could be traced versus the reality that any function 
could be traced) that we should change the dtrace provider so that it 
does not create its uprobes until we're ready to attach BPF programs to 
them?

Anyhow, regardless of how we want to express any of that, I guess the 
big objectives are:

*)  Ideally, we should not create any probes until we're ready to attach 
BPF programs to them and run.

*)  We do, however, want to find out some stuff about probes earlier on.

Those objectives can conflict for some providers where we need at least 
to create some kprobe so we can figure out its args.  Since we're 
creating the kprobe anyhow, we leave it to be used later for tracing.

>    2. A kprobe is created when probe_info() is called because we need to be
>       able to retrieve probe specific information.  This is actually aimed at
>       determining argument information.  The event_id is retrieved at this point
>       because we are parsing the format file for the kprobe anyway.  We still
>       do not use it for anything really.

Right.  This is the practical issue.  I understand this need.  So, we 
"create the probe" here, for some definition of that phrase.

>    3. A perf event is created for the kprobe so that we can associate a BPF
>       program with it.  This is where the underlying event is actually created
>       as something we can attach to, and this is therefore the point at which
>       we "enable" the probe.
>
> Note that for kprobes and uprobes we could actually use a variant of the
> perf_event_open() system call where the kprobe/uprobe is instantiated at the
> time of creating the perf event.  It gains us auto-cleanup but it means that
> we do not have access to the information probe_info() needs until point 3
> above which is actually too late.
>
> So, the event_fd wouldn't be something that we can fill in at the time of
> probe_info() because the perf event isn't going to exist yet.  That should
> still be perfectly fine for profile-n and tick-n probes because you do not
> need the perf event yet at this point.  As long as we have the dt_probe_t for
> the probe we ought to be OK.
>
>>> During the populate and provide
>>> processing all that really ought to be done is create the probes with their
>>> relevant (provider specific) data.  Then when we are ready to actually start
>>> enabling probes and loading programs we ought to make the perf_event_open
>>> calls to instantiate the kernel level events/probes.
>>>
>>> Since the profile provider is quite different from the other existing providers
>>> (and pid provider will be similarly different) it makes sense to introduce
>>> another callback function for the provider API:
>>>
>>> 	int enable(dtrace_hdl_t *dtp, const dt_probe_t *prp)
>>>
>>> This function is called for a specifc probe in the provider, and is expected
>>> to return a file descriptor (typically return value of perf_event_open()) or
>>> -1 upon error.  The caller should assign this fd (return value) to the event_fd
>>> field in the dt_probe_t struct.
>> I can look at such a function, but I have some questions:
>>
>> *)  Would this be to replace the event_fd initialization in dt_bpf_attach()?
>> *)  In fact, wouldn't it make sense to eliminate dt_bpf_attach()
>> completely, and just have dt_bpf_stmt() call some
>> prp->prov->impl->enable(dtp, prp, fd) to attach the BPF program (fd) to
>> the D "probe"?  I bring this up due to the side discussion we were
>> having about how support for "profile-n" probes (which are implemented
>> with many, per-CPU perf_events) need multiple event_fd's per D probe.
>> Specifically, the profile-provider-specific enable (or attach) function
>> would not return an event_fd.
> Indeed, we need to replace dt_bpf_attach() altogether - so instead of an
> enable() function we probably really should simply add an attach() function
> with prototype:
>
> 	int attach(dtrace_hdl_t *dtp, const dt_probe_t *prp, int bpf_fd)
>
> returning 0 on success and -1 if it fails (and set errno correctly etc).  Funny
> thing is that I used to have a provider-specific attach function in the version
> I prepared for inclusion in the kernel tree (a long time ago).

const prp?

Currently, there is a prp->event_fd.  It presumably will not exist prior 
to this call.  If it is to persist after the call, prp should perhaps 
not be const.  Or?  What did you have in mind here? E.g., calling 
perf_event_open(), using the fd, and then throwing it away?

>>> This was the work of actually creating and enabling the underlying kernel perf
>>> event can be handled by each provider based on its unique requirements.
>>>
>>>> hand, the function was declared to take a "const" probe.
>>>> This didn't even make sense, however, since the function's
>>>> only caller, dt_probe_args_info(), retrieved information
>>>> specifically so that it could set fields like event_id.  So,
>>>> this commit removes the "const" from the probe in probe_info()'s
>>>> declaration.
>>> Why didn't it make sense that probe_info() specifies the 'const' modifier on
>>> the prp argument?  The probe_info() function does not modify any fields in
>>> prp (dt_probe_t), nor is it expected to.  So 'const' is sensible here.
>> The only call chain that calls the providers' probe_info() functions is
>> dt_probe_info()
>> -> dt_probe_args_info()
>> -> prp->prov->impl->probe_info()
>> Not only does this call chain modify the "dt_probe_t *prp", it actually
>> creates it from scratch.  So while it was the case that probe_info()
>> wasn't itself modifying prp, it was only being used to construct prp.
> No, no.  You mentioned that 'const' didn't even make sense here but you are
> refering to how you changed what probe_info() is supposed to do as a reason.
> That doesn't hold.  If you look at the code prior to your patch, it should be
> quite clear that 'const' is warranted here because prp is not modified at all.

Clearly prp was not modified at all.  After all, the code compiled with 
the const.

My point was taking a step back and looking at how to make things work 
for a new provider.  Currently, the only place probe_info() is called is 
(as I said earlier):
     dt_probe_info()
     -> dt_probe_args_info()
     -> prp->prov->impl->probe_info()
The top function creates prp.  The next function modifies it.  The leaf 
function (as you point out) does not modify prp but simply returns 
information (er, that its caller uses to modify prp).  The top function 
is supposed to be provider-agnostic.  The next function is also supposed 
to be provider-agnostic... but it is not.  It assumes probe_info() will 
return an event_id that can be assigned to prp->event_id.  That's fine 
for tracepoint-like providers -- which up until now has meant all of our 
providers -- but it is not sufficient going forward.  That is, this 
business of setting prp->event_id is a provider-specific operation.  So 
it makes sense at least to me to push that down into the 
provider-specific callee.  After all, what are our other choices?  (I 
can think of some, but I do not find them attractive.)  Pushing this 
provider-specific assignment down into the provider-specific function 
means removing the const from probe_info().  I found this a very 
reasonable change to make since, after all, probe_info() was being 
called only to return information to a caller who is then using the info 
to modify prp.

>> It did not seem important to retain that const.  In particular, if the
>> profile-provider probe_info() function were used to create the probe --
>> which is what the dtrace and fbt providers were already doing anyhow --
>> then not much other code had to be touched.
>>
>> But I think I can do your "enable()" suggestion, in which case this
>> const business does not need to be addressed.
> Indeed, which is why I flagged that argument because the 'const' modifier for
> prp as argument to probe_info() is valid.
>
> In summary, what the profile provider ought to do is:
>
> 1. Populate the probe list with the standard profile-n and tick-n probes.
> 2. Create a new probe (at the DTrace level) using provide() to allow dynamic
>     creation of probes.  No actual perf event is created yet - this should just
>     create a probe using dt_probe_insert() and set the needed data in the
>     provider specific struct for the probe.
> 3. Provide sensible data when probe_info() is called.
> 4. Create the actual perf event(s) when attach() is called.  Attach the given
>     BPF program (ref'd by fd) to the newly created perf event(s).

I think I'm generally on board with that plan, but with one proviso.  
Note that the provider-specific function probe_info() is supposed to 
return information about args.  For some providers, that means we want 
to set a provider-specific prp->event_id (since we need to create some 
probe to get its args).  Meanwhile, there are some other providers that 
do not even have a prp->event_id.  So, to me it makes sense to make the 
provider-specific function able to modify prp, so that some providers 
can set some prp->event_id that other providers do not have.  (It 
wouldn't be called prp->event_id.  It would be in prp behind some 
opaque, provider-specific pointer.)