[DTrace-devel] [PATCH 01/16] Add a CPC provider

Eugene Loh eugene.loh at oracle.com
Tue Feb 21 23:56:57 UTC 2023


Unless otherwise noted below, changes made as indicated.

On 2/21/23 16:53, Kris Van Hees wrote:

> On Thu, Jan 26, 2023 at 09:23:14PM -0500, eugene.loh--- via DTrace-devel wrote:
>> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
>> @@ -400,6 +400,80 @@ dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int called)
>> +void
>> +dt_cg_tramp_copy_PC_from_regs(dt_pcb_t *pcb)
> Mind if I rename this to be: dt_cg_tramp_copy_pc_from_regs()
> since most function names are all lowercase.

Okay, though I guess I will do the renaming since a v2 patch is 
apparently needed anyhow.

>> +{
>> +	dtrace_hdl_t	*dtp = pcb->pcb_hdl;
>> +	dt_regset_t	*drp = pcb->pcb_regs;
>> +	dt_irlist_t	*dlp = &pcb->pcb_ir;
>> +	uint_t		Luser = dt_irlist_label(dlp);
>> +	uint_t		Ldone = dt_irlist_label(dlp);
>> +
>> +	if (dt_regset_xalloc_args(drp) == -1)
>> +		longjmp(yypcb->pcb_jmpbuf, EDT_NOREG);
>> +
>> +	/* place the PC in %r3, arg0, and arg1 */
>> +        emit(dlp, BPF_LOAD(BPF_DW, BPF_REG_3, BPF_REG_8, PT_REGS_IP));
>> +        emit(dlp, BPF_STORE(BPF_DW, BPF_REG_7, DMST_ARG(0), BPF_REG_3));
>> +        emit(dlp, BPF_STORE(BPF_DW, BPF_REG_7, DMST_ARG(1), BPF_REG_3));
>> +
>> +	/* check if the PC is kernel or user space */
>> +	if (dtp->dt_bpfhelper[BPF_FUNC_probe_read_kernel] == BPF_FUNC_probe_read_kernel) {
>> +		/*
>> +		 * Use probe_read_kernel() if it is really is probe_read_kernel().
>> +		 * On older kernels, it does not exist and is aliased to something else.
>> +		 */
> Are you sure this is valid?  I thought that e.g. x86_64 allowed reading/writing
> fromto userspace addresses using probe_read_kernel().  Did you check that?


Reading from user space with probe_read_kernel()?  Or do you mean with 
probe_read()?  I think the history is complicated.  At first, 
probe_read() was indiscriminate.  Then the kernel and user variants were 
introduced and probe_read() became picky.  Then it was changed to be 
tolerant on platforms (like x86_64) when kernel and user space could be 
distinguished.  So, where there is a probe_read_kernel(), I use that to 
check if the address is kernel or user space -- and the tests check the 
correctness.  On older kernels, where probe_read_kernel() does not exist 
and probe_read() is indiscriminate, a different approach is used.  So I 
think things are fine.

>> +
>> +		/* test just a single byte */
>> +		emit(dlp,  BPF_MOV_IMM(BPF_REG_2, 1));
>> +
>> +		/* safe to write to FP+DT_STK_SP_BASE, which becomes the clause stack */
>> +		emit(dlp,  BPF_MOV_REG(BPF_REG_1, BPF_REG_FP));
>> +		emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, DT_STK_SP_BASE));
>> +
>> +		/* bpf_probe_read_kernel(%fp + DT_STK_SP, 1, PC) */
>> +		dt_regset_xalloc(drp, BPF_REG_0);
>> +		emit(dlp,  BPF_CALL_HELPER(BPF_FUNC_probe_read_kernel));
>> +
>> +		/* if there was a problem, assume it was user space */
>> +		emit(dlp,  BPF_BRANCH_IMM(BPF_JNE, BPF_REG_0, 0, Luser));
>> +		dt_regset_free(drp, BPF_REG_0);
>> +	} else {
>> +		/*
>> +		 * If no real probe_read_kernel() exists, just test the highest bit.
>> +		 * This is not as robust, but probably works just fine for us.
>> +		 */
>> +
>> +		/* if the highest bit is 0, assume it was user space */
>> +		emit(dlp, BPF_ALU64_IMM(BPF_RSH, BPF_REG_3, 63));
>> +		emit(dlp,  BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_3, 0, Luser));
>> +	}
>> +
>> +	/* PC is kernel space (zero out arg1) */
>> +	emit(dlp,  BPF_STORE_IMM(BPF_DW, BPF_REG_7, DMST_ARG(1), 0));
>> +	emit(dlp,  BPF_JUMP(Ldone));
>> +
>> +	/* PC is user space (zero out arg0) */
>> +	emitl(dlp, Luser,
>> +		   BPF_STORE_IMM(BPF_DW, BPF_REG_7, DMST_ARG(0), 0));
>> +
>> +	/* done */
>> +	emitl(dlp, Ldone,
>> +	      BPF_NOP());
>> +	dt_regset_free_args(drp);
>> +}
>> +
>>   /*
>>    * Copy return value from a dt_pt_regs structure referenced by %r8 to
>>    * mst->arg[1].  Zero the other args.
>> diff --git a/libdtrace/dt_open.c b/libdtrace/dt_open.c
>> @@ -1261,6 +1262,7 @@ dtrace_close(dtrace_hdl_t *dtp)
>>   	dt_pfdict_destroy(dtp);
>>   	dt_dof_fini(dtp);
>>   	dt_probe_fini(dtp);
>> +	// FIXME: add some dt_prov_fini() that will iterate over providers and call provider-specific fini()'s; CPC will call pfm_terminate()
> I would stick with /* */ comment style since that is what we do pretty much
> everywhere else.  What is the impact of not having that yet at this point?

Okay, will switch //, here and subsequently except for where it's 
basically just copying and pasting stuff that pre-exists in multiple 
other files.

The impact is negligible.  Just nice to clean things up explicitly, but 
when the DTrace session ends it gets cleaned up anyhow.

>
>>   
>>   	dt_htab_destroy(dtp, dtp->dt_provs);
>>   
>> diff --git a/libdtrace/dt_prov_cpc.c b/libdtrace/dt_prov_cpc.c
>> +
>> +typedef struct cpc_probe_data {
>> +	char	*name;
>> +	int	*fds;
>> +} cpc_probe_data_t;
> Can we rename this to be cpc_probe_t (similar to profile_probe_t, etc)?

Okay, done.

>> +
>> +/*
>> + * Probe translators.  As we discover which events can actually be used
>> + * on the system, we put them in a linked list that translates from names
>> + * we present to the D/CPC user to names used by the underlying system.
>> + * Importantly, CPC wants no '-' in probe names.  And, stylistically, we
>> + * prefer lower-case probe names.
>> + */
> I would not refer to this as 'translators' because DTrace has a specific
> meaning when talking about translators.  How about "Probe name mapping" ?

Okay, done.

>> +typedef struct cpc_probex {
>> +	char			*Dname;
>> +	char			*pfmname;
>> +	struct cpc_probex	*next;
>> +} cpc_probex_t;
>> +/*
>> + * FIXME: Handle this differently?  E.g., add a member to dt_provimpl_t that
>> + * points to provider-specific data?
>> + */
>> +static cpc_probex_t	*first_probex;
> Why not use a dt_list since that is the typical way we implement lists in this
> codebase?  This would indeed be something to implement using an opaque pointer
> in dt_provimpl_t to some private provider struct.

I'll take a look at using dt_list.

>> +static int populate(dtrace_hdl_t *dtp)
>> +{
>> +	int		n = 0;
>> +
>> +	dt_provider_create(dtp, prvname, &dt_cpc, &pattr);
>> +	first_probex = NULL;
>> +
>> +	/* incidentally, pfm_strerror(pfm_initialize()) describes the error */
>> +	if (pfm_initialize() != PFM_SUCCESS)
>> +		return 0;
>> +
>> +	/* loop over PMUs (FWIW, ipmu=PFM_PMU_PERF_EVENT is among them) */
>> +	for (pfm_pmu_t ipmu = PFM_PMU_NONE; ipmu < PFM_PMU_MAX; ipmu ++) {
>> +		pfm_pmu_info_t pmuinfo;
>> +
>> +		memset(&pmuinfo, 0, sizeof(pmuinfo));
>> +		pmuinfo.size = sizeof(pfm_pmu_info_t);
>> +		if (pfm_get_pmu_info(ipmu , &pmuinfo) != PFM_SUCCESS || pmuinfo.is_present == 0)
>> +			continue;
>> +
>> +		/*
>> +		 * At this point, we have interesting information like:
>> +		 *     - pmuinfo.nevents
>> +		 *     - pmuinfo.name
>> +		 *     - pmuinfo.desc
>> +		 *     - pmuinfo.type = PFM_PMU_TYPE_[UNKNOWN|CORE|UNCORE]
>> +		 *     - pmuinfo.num_cntrs
>> +		 *     - pmuinfo.fixed_num_cntrs
>> +		 *     - pmuinfo.max_encoding // number of event codes returned by pfm_get_event_encoding()
> Comment within comment?  How about replacing // with -?

Okay, though fwiw there are quite a few pre-existing instances of 
comments (//) within comments (/* */) in our code.  No matter. Change made.

>> +		 */
>> +
>> +		/* loop over events */
>> +		for (int ievt = pmuinfo.first_event; ievt != -1; ievt = pfm_get_event_next(ievt)) {
>> +
>> +			/*
>> +			 * Convert opaque integer index ievt into a name evt.name.
>> +			 */
>> +
>> +			pfm_event_info_t evtinfo;
>> +
>> +			memset(&evtinfo, 0, sizeof(evtinfo));
>> +			evtinfo.size = sizeof(evtinfo);
>> +
>> +			/* PFM_OS_[NONE|PERF_EVENT|PERF_EVENT_EXT] */
>> +			if (pfm_get_event_info(ievt, PFM_OS_PERF_EVENT, &evtinfo) != PFM_SUCCESS)
>> +				continue;
>> +
>> +			/*
>> +			 * At this point, we have interesting information like:
>> +			 *     - evtinfo.name
>> +			 *     - evtinfo.desc   // a little verbose and does not say that much
>> +			 *     - evtinfo.nattrs
>> +			 *     - evtinfo.dtype  // should be PFM_DTYPE_UINT64
>> +			 *     - evtinfo.idx    // should be ievt
>> +			 *     - evtinfo.equiv  // some equivalent name, or "(null)"
> See above.  Replace // with -.
>> +			 */
>> +
>> +			/*
>> +			 * Convert the event name into perf_event attr.
>> +			 */
>> +			pfm_perf_encode_arg_t encoding;
>> +			struct perf_event_attr attr;
>> +			char *fstr = NULL;
>> +
>> +			memset(&encoding, 0, sizeof(encoding));
>> +			memset(&attr    , 0, sizeof(  attr  ));
> This looks a bit odd.  I'd drop the extra spacing - it doesn't really add much
> in terms of readability.
>
>> +			encoding.size = sizeof(encoding);
>> +			encoding.attr = &attr;
>> +			encoding.fstr = &fstr;
>> +
>> +			// os = [PFM_OS_PERF_EVENT | PFM_OS_PERF_EVENT_EXT]
>> +			// Note that pfm_strerror(pfm_get_os_event_encoding(...) describes any error.
> Please use /* */ comments.
>
>> +			if (pfm_get_os_event_encoding(evtinfo.name, PFM_PLM0 | PFM_PLM3, PFM_OS_PERF_EVENT, &encoding) != PFM_SUCCESS) {
>> +				if (fstr)
>> +					free(fstr);    // is this necessary if we errored out?
> Please use /* */ comments.  Also...  it this needed or not?  I would assume
> that testing can indicate whether it is needed?

I don't know if it's needed, but I'm reluctant to throw the check out 
simply because I found a case where it isn't needed.

>> +				continue;
>> +			}
>> +
>> +			// At this point, ievt is what we requested, while encoding.idx corresponds to fstr.
>> +			// Meanwhile, fstr will have some ":u=1:k=1" that we would otherwise want to modify.
> Please use /* */ comments.  Also, I do not actually understand the relevance
> of this comment.  Or what it means.  Can you explain?

These comments date back to when I was starting, and so I wanted to say 
something about what is known at what point.  Today, things seem to 
work, but I'm still reluctant to throw away too many of these notes.  
E.g., the CPC provider does not yet support raw events or optional 
attributes.  Anyhow, these are notes giving hints about how ievt 
compares to encoding.idx, what fstr looks like, etc.  I'd like to worry 
first about, say, implementing raw events and opt attrs before worrying 
about cleaning up vestigial notes.

>
>> +			if (fstr)
>> +				free(fstr);
>> +
>> +			/*
>> +			 * Now attr is largely set up.  Note:
>> +			 *     - attr.size is still 0, which is okay
>> +			 *     - attr.freq is 0, which is okay
>> +			 *     - attr.wakeup_events is 0, which we can change
>> +			 */
>> +			attr.wakeup_events = 1;
>> +
>> +			/*
>> +			 * Check attr with perf_event_open().
>> +			 */
>> +			int fd = perf_event_open(&attr, -1, 0 /* FIXME: cpu */, -1, 0);
>> +			if (fd < 0)
>> +				continue;
>> +			close(fd);
>> +
>> +			/*
>> +			 * We convert '-' to '_' to conform to CPC practices
>> +			 * and convert to lower-case characters (for stylistic reasons).
>> +			 */
>> +			cpc_probex_t *next_probex;
>> +			next_probex = dt_zalloc(dtp, sizeof(cpc_probe_data_t));   // FIXME: should free in some probe_destroy() or something/??
>> +			if (next_probex == NULL)
>> +				continue;
>> +			next_probex->pfmname = strdup(evtinfo.name);              // FIXME: check nonNULL; free somewhere
>> +			next_probex->Dname = strdup(evtinfo.name);                // FIXME: check nonNULL; free somewhere
> Please use /* */ comments.  Also...  These FIXMEs seem like things that should
> be resolved in this patch.

I'll give this more thought, but will probably send the v2 patch out 
first.  The cleanup would be clumsy (I think... but again, I can think 
about this more) and of dubious value.  It would be executed at the end 
of the session, when everything will disappear anyhow. So, there would 
be clumsiness as well as delay getting this patch out but for no benefit.

Again, though, I'll try to think about this more.

>> +			for (unsigned char *p = next_probex->Dname; *p; p++)
>> +				*p = (*p == '-') ? '_' : tolower(*p);
>> +			next_probex->next = first_probex;
>> +			first_probex = next_probex;
>> +
>> +			/*
>> +			 * Compose a CPC probe name by adding mode "all" and a sample period
>> +			 * big enough that even the fastest firing probe will not be unreasonable.
>> +			 */
>> +			char *suffix = "-all-1000000000";
>> +			char *s = dt_zalloc(dtp, strlen(next_probex->Dname) + strlen(suffix) + 1);  // FIXME: what if s==NULL?
> Please use /* */ comments.  And yes, what if s == NULL?  That should be
> handled.
>
>> +			sprintf(s, "%s%s", next_probex->Dname, suffix);
>> +
>> +			/*
>> +			 * If this probe is not yet there (likely!), add it.
>> +			 */
>> +			dtrace_probedesc_t pd;
>> +			pd.id = DTRACE_IDNONE;
>> +			pd.prv = prvname;
>> +			pd.mod = modname;
>> +			pd.fun = funname;
>> +			pd.prb = s;
>> +			if (dt_probe_lookup(dtp, &pd) == NULL && cpc_probe_insert(dtp, s))
>> +				n++;
>> +
>> +			dt_free(dtp, s);
>> +		}
>> +	}
>> +
>> +	return n;
>> +}
>> +
>> +static int decode_event(struct perf_event_attr *ap, const char *name) {
>> +	cpc_probex_t *probex;
>> +
>> +	// find the probe translator for this D name
> /* */ comments.
> Do not call it translator.

Okay.

>> +	for (probex = first_probex; probex; probex = probex->next)
>> +		if (strcmp(name, probex->Dname) == 0)
>> +			break;
>> +	if (probex == NULL)
>> +		return -1;
>> +
>> +	// fill in the attr for this pfm name
>>
>> +	pfm_perf_encode_arg_t encoding;
>> +	char *fstr = NULL;
>> +	int ret;
> All variable declarations should be at the beginning of the code block (or
> function).

Okay.

>> +
>> +	memset(&encoding, 0, sizeof(encoding));
>> +	encoding.size = sizeof(encoding);
>> +	encoding.attr = ap;
>> +	encoding.fstr = &fstr;
>> +
>> +	// os = [PFM_OS_PERF_EVENT | PFM_OS_PERF_EVENT_EXT]
>> +	// Note that pfm_strerror(pfm_get_os_event_encoding(...) describes any error.
>> +	ret = pfm_get_os_event_encoding(probex->pfmname, PFM_PLM0 | PFM_PLM3, PFM_OS_PERF_EVENT, &encoding);
>> +	if (fstr)
>> +		free(fstr);    // FIXME: is this necessary if we errored out?  if not, we do not need to define ret?
> /* */ comments.  And you should know whether you need this or not, right?

Again, I'm not confident here and so am playing safe with the check.

>> +	return (ret == PFM_SUCCESS) ? 0 : -1;
>> +}
>> +
>> +static int decode_probename(struct perf_event_attr *ap, const char *name) {
>> +	char buf[256];  // FIXME: fixed length???
> Isn't there a symbolic name for the name length limit?  Is there a known limit?

Oops.  Lazy.  I guess DTRACE_NAMELEN.  Fixed.



More information about the DTrace-devel mailing list