[DTrace-devel] [PATCH] Review perf_event_open() calls

Kris Van Hees kris.van.hees at oracle.com
Thu Oct 29 13:12:10 PDT 2020


On Thu, Oct 22, 2020 at 04:45:24PM -0700, Eugene Loh wrote:
> 
> ----- kris.van.hees at oracle.com wrote:
> 
> > On Wed, Oct 14, 2020 at 05:35:45PM -0400, eugene.loh at oracle.com
> > wrote:
> > > From: Eugene Loh <eugene.loh at oracle.com>
> > > 
> > > Our perf_event_open() calls should be cleaned up a little.  The issues
> > > are minor, with no user-observable effects, but cleaner code is better.
> > > 
> > > Set size in "struct perf_event_attr".
> > 
> > Sure, though see my suggestions below.
> > 
> > > Use flag PERF_FLAG_FD_CLOEXEC.
> > 
> > This is actually not necessary.  I am not opposed to it but the code as it
> > was didn't pose an issue because we're not using execve().
> 
> Okay.  I'd like to understand this better.
> 
> There are cases where we call execvp() but before perf_event_open() is called.
> E.g., to preprocess D scripts during compilation or to launch -c processes.
> Because they precede perf_event_open(), they don't matter.
> 
> There are some cases where we call execlp() during the consumer, like someday
> for pcap processing.
> 
> But the main thing I was wondering about was consistency.  That is, it either
> makes sense for us to use CLOEXEC or not, and we might as well employ some
> consistent policy.

It does not hurt, and it is not a bad idea, so yes, OK to add it for sure,  The
fact that there was some inconsistency probably stems from different pieces of
the code having been written at different times (and e.g. the buffer code was
taken from a tiny implementation I did a long time ago).

I agree it is good to have consistency in this.

> > > Use PERF_SAMPLE_RAW only for the perf_event buffer to which BPF programs
> > > will write.
> > 
> > That is actually harmless, but yes, we can drop it.
> > 
> > > Distribute perf_events evenly among CPUs.  (Note that perf_event buffers
> > > and profile-n events are one-per-CPU.  Further, tick-n events are
> > > distributed randomly for load balancing.  The only change needed is to
> > > distribute tp_attach() randomly over CPUs rather than mapping them all
> > > to CPU 0.)
> > 
> > Can you explain your rationale here, i.e. point out where there is an issue?
> > As you mention, perf_event ring buffers (our output buffers) are per-CPU, and
> > profile-n events are (as required) per-CPU.  The tick-n events being spread
> > across all CPUs makes sense also because they are events that fire as timer
> > events (and are out of necessity tied to a specific CPU).  I am not quite sure
> > where speading SDT events over the various CPUs accomplishes load balancing
> > because those events are triggered by executing code and therefore execute on
> > the same CPU as where the executable code was running.  So I am unclear on
> > what exactly you are load balancing here?
> 
> Mostly, I just did not understand, and I guess I still don't know how
> tick-n and SDT are different.  The perf_event_open() man page says cpu>=0
> means monitoring only on the specified CPU.  That does not seem to be the
> case and it would not make sense if we are running all SDTs on cpu=0.
> 
> Maybe I'll just rephrase as a question:  what does perf_event_open()'s cpu
> argument mean, at least in the case of our SDT usage?

The main issue here is that not all perf events are the same, and in fact there
are quite a variety of functional differences between the different events that
can be accessed as perf events.

For tick-n probes we are creating a perf event that has a timer as its trigger,
and timers need to be associated with a specific CPU at the lowest level.  So,
because the timer is created by the perf event, the CPU that is passed in is
actually the CPU on which the timer is expected to run.

For tracepoint based probes the perf event handling code in the kernel is not
actually creating an event source that is associated with a CPU, but rather it
assictaes the perf event with the tracepoint such that when the tracepoint is
hit, a perf event is triggered.  Since tracepoints are triggered by code
execution they can trigger on any CPU and thus the perf event is not related to
a specific CPU.  The cpu argument is therefore ignored in this case.

> > > Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
> > > ---
> > >  libdtrace/dt_peb.c          | 1 +
> > >  libdtrace/dt_prov_profile.c | 3 +--
> > >  libdtrace/dt_prov_sdt.c     | 8 ++++++--
> > >  3 files changed, 8 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/libdtrace/dt_peb.c b/libdtrace/dt_peb.c
> > > index 0ef1c419..b8c49e2f 100644
> > > --- a/libdtrace/dt_peb.c
> > > +++ b/libdtrace/dt_peb.c
> > > @@ -75,6 +75,7 @@ dt_peb_open(dt_peb_t *peb)
> > >  	memset(&attr, 0, sizeof(attr));
> > >  	attr.config = PERF_COUNT_SW_BPF_OUTPUT;
> > >  	attr.type = PERF_TYPE_SOFTWARE;
> > > +	attr.size = sizeof(struct perf_event_attr);
> > 
> > attr.size = sizeof(attr);
> > 
> > >  	attr.sample_type = PERF_SAMPLE_RAW;
> > >  	attr.sample_period = 1;
> > >  	attr.wakeup_events = 1;
> > > diff --git a/libdtrace/dt_prov_profile.c b/libdtrace/dt_prov_profile.c
> > > index 8541bee6..22cd968a 100644
> > > --- a/libdtrace/dt_prov_profile.c
> > > +++ b/libdtrace/dt_prov_profile.c
> > > @@ -293,7 +293,6 @@ static int attach(dtrace_hdl_t *dtp, const dt_probe_t *prp, int bpf_fd)
> > >  	memset(&attr, 0, sizeof(attr));
> > >  	attr.type = PERF_TYPE_SOFTWARE;
> > >  	attr.config = PERF_COUNT_SW_CPU_CLOCK;
> > > -	attr.sample_type = PERF_SAMPLE_RAW;
> > >  	attr.size = sizeof(struct perf_event_attr);
> > >  	attr.wakeup_events = 1;
> > >  	attr.freq = 0;
> > > @@ -307,7 +306,7 @@ static int attach(dtrace_hdl_t *dtp, const dt_probe_t *prp, int bpf_fd)
> > >  			j = rand() % dtp->dt_conf.num_online_cpus;
> > >  
> > >  		fd = perf_event_open(&attr, -1, dtp->dt_conf.cpus[j].cpu_id,
> > > -				     -1, 0);
> > > +				     -1, PERF_FLAG_FD_CLOEXEC);
> > >  		if (fd < 0)
> > >  			continue;
> > >  		if (ioctl(fd, PERF_EVENT_IOC_SET_BPF, bpf_fd) < 0) {
> > > diff --git a/libdtrace/dt_prov_sdt.c b/libdtrace/dt_prov_sdt.c
> > > index 4b2ac31b..8775103f 100644
> > > --- a/libdtrace/dt_prov_sdt.c
> > > +++ b/libdtrace/dt_prov_sdt.c
> > > @@ -77,14 +77,18 @@ int tp_attach(dtrace_hdl_t *dtp, const dt_probe_t *prp, int bpf_fd)
> > >  	if (datap->event_fd == -1) {
> > >  		int			fd;
> > >  		struct perf_event_attr	attr = { 0, };
> > > +		int			cpu_index, cpu;
> > > +
> > > +		cpu_index = rand() % dtp->dt_conf.num_online_cpus;
> > > +		cpu = dtp->dt_conf.cpus[cpu_index].cpu_id;
> > >  
> > >  		attr.type = PERF_TYPE_TRACEPOINT;
> > > -		attr.sample_type = PERF_SAMPLE_RAW;
> > > +		attr.size = sizeof(struct perf_event_attr);
> > 
> > attr.size = sizeof(attr);
> > 
> > >  		attr.sample_period = 1;
> > >  		attr.wakeup_events = 1;
> > >  		attr.config = datap->event_id;
> > >  
> > > -		fd = perf_event_open(&attr, -1, 0, -1, 0);
> > > +		fd = perf_event_open(&attr, -1, cpu, -1, PERF_FLAG_FD_CLOEXEC);
> > >  		if (fd < 0)
> > >  			return dt_set_errno(dtp, errno);
> > >  
> > > -- 
> > > 2.18.4
> > > 
> > > 
> > > _______________________________________________
> > > DTrace-devel mailing list
> > > DTrace-devel at oss.oracle.com
> > > https://oss.oracle.com/mailman/listinfo/dtrace-devel
> 
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel



More information about the DTrace-devel mailing list