[DTrace-devel] [PATCH v4 20/25] usdt: DTrace userspace side

Kris Van Hees kris.van.hees at oracle.com
Sun Oct 16 17:53:36 UTC 2022


This is a partial review, mostly because there are some more fundamental things
in it that may affect the rest of the code in various ways.

On Fri, Oct 07, 2022 at 11:25:16AM +0100, Nick Alcock via DTrace-devel wrote:
> This is implemented almost entirely in the pid provider, which is
> reassuringly similar to how it was done in the in-kernel days. They're
> really very closely-related beasts, and the same code can handle both
> easily enough.
> 
> This does several things that are sufficiently intertwined that putting
> them in one commit seems most readable:
> 
>  - implements USDT probe discovery, ripping out a lot of old ioctl stuff
>    and obsolete code handling stuff like structure-copying thunks in the
>    Solaris C library and a bunch of obsolete functions around DOF
>    acquisition (keeping one which might well be revived in the next
>    phase), and adding dt_pid_create_usdt_probes, which scans the
>    systemwide uprobe list and creates DTrace-side underlying probes for
>    all of them that are relevant (see below), using an sscanf-based
>    parser: the uprobe naming scheme was designed so that it works with
>    the rather extreme limitations of such parsers.  Thanks to the %m
>    conversion specifier there is no risk of buffer overrunning if the
>    name components are unexpectedly long.
> 
>    Right now this can only create probes for specific processes (those
>    named on the command line in probe names, as usual), but in future
>    it'll grow the ability to make probes for everything dtprobed has
>    spotted probes for.  Because it is driven by the systemwide uprobe
>    list, it can create probes for processes that started before DTrace
>    did, just like the old in-kernel model.

Why is this the case?  If dtprobed creates the uprobes, then all the USDT
probes in the entire system as known to dtrace so it can perform a probe match
across all of them.  I do not see why that would not work.

>  - rejigs dt_prov_pid to use the new uprobe_create mechanism to make pid
>    probes, with names consistent with those dtprobed creates for USDT
>    probes; the uprobes have names like dt_pid/dt_$dev_$ino_$addr, possibly
>    with a ret_ portion if this is a uretprobe.  USDT probes pass their

This is no longer correct.  It got changed in an earlier commit to use
dt_pid/[pr]_$dev_$ino_$addr instead (p for regular probe, r for return probe).

>    args down as encoded, arg-number-appended uprobe arguments, viz:
> 
>    p:dt_pid/dt_2d_231d_7fbe933165fd /tmp/runtest.17112/usdt-dlclose1.123064.9010/livelib.so:0x15fd test_prov1=\1 livelib__2eso2=\2 go3=\3 go4=\4
> 
>    (The numeric suffix is added and stripped off automatically by the
>    name en/decoder, and makes sure that no two "args" have the same
>    name, even if the probe component is the same, as above.)

This is no longer correct either.  Aside from the aforementioned change to
the probe name, we also changed the argument list to use P...=\1 M...=\2 etc.
> 
>  - adjusts provide_pid so it can be called at USDT uprobe discovery time
>    to create underlying probes that user-requested USDT probes will then
>    utilize.  The changes here look rather extreme but are actually quite
>    small: the underlying probe stuff is split into a new function,
>    provide_pid_underlying(), and provide_pid can be asked to do nothingg
>    but call this; the probe name follows the new dtprobed rules, looking
>    like pid:$inum::$offset, and thus is identical for probes created for
>    USDT and for pids at the same offset.
> 
>    The struct pid_probe attached to the underlying probe gains a device
>    number (which it should always have had) and keeps track of the
>    underlying uprobe name from create_uprobe or USDT probe discovery
>    and remembers whether or not DTrace created it (if dtprobed created
>    it, dtrace must not delete it).
> 
>    provide_pid itself is adjusted to call provide_pid_underlying as
>    needed, but also to do more work if the probe already exists: USDT
>    probes can be associated with more than one underlying probe (if the
>    probe appears repeatedly in a program), so if it is repeatedly
>    provide_pid'ed with different offsets, we spot this and chain it into
>    all the necessary underlying probes, and also keep track of all the
>    underlying probes associated with an overlying probe so we can enable
>    them all (see next item).
> 
>  - enabling gets a little more complex.  We can no longer get away with
>    just enabling the underlying probe, because things not in the
>    enablings list don't get entries in the probename strtab, and their
>    probename is liable to be empty: in any case, the underlying probes
>    aren't going to get ->enable called for them because the user never
>    named them (only their overlying probe).  So we intern both the
>    overlying *and* the underlying probe in enablings (by getting
>    ->enable for the overlying probe to walk its list of underlying
>    probes and enable all of them), and arrange for the overlying probe
>    to have no trampoline: in conjunction with a prior commit in this
>    series this causes only the underlying probe to have any BPF code
>    generated for it.

The paragraph above seems very confusing to me.  And I am pretty certain I
know how this is supposed to work :)  E.g. the original code did not 'get
away with' anything.  First of all, only the underlying probes were added to
the enablings list because that list was meant to be the probes that need
to be enabled at the OS level and get BPF programs associated with them.  So,
adding overlying probes to that would have been wrong.  But as you noticed (and
as I reported elsewhere), the side effect was that the probe name components
of the ovelrying probe (the one users care about) were thereby not available
in the strtab which made probename etc fail.

So, USDT is not what makes this necessary... it is needed even for pid probes
and the fact that we did not was a bug.  So, that part should probably be its
own tiny patch (with a test, or if we have one - I think we might - remove the
xfail).

Your reference to what probes ->enable() gets called for is confusing (and
possibly incorrect).  I am not sure because I am a bit lost whether it is 
describing the old behaviour, the new, or the transition from old to new.

I think you also need to be a bit more clear when you say "this causes only
the underlying probe to have any BPF code generated for it."  Perhaps refer
to things like trampoline, clause prologue, etc (things like that) so it is
clear what BPF code you are referring to.

>  - Trampoline generation has to adapt to this, but also has to use a
>    less kludgy way of figuring out the pids the trampoline applies to:
>    rather than parsing the name apart on the spot, we ask dt_pid, which
>    already has code to *properly* parse apart both pid and usdt names
>    and extract the pid from them.

So, it is more like 'laso can use' rather than 'also has to use' right?
I.e. I agree it is a good and beneficial change, but it is less necessity and
more a matter of being more convenient (and better) to use the function.

In all, I don't think this even merits mentioning, esp. since the trampoline
change is already implied by the previous point.

>  - stack arg handling needs a bit of a tweak.  usdt probes (but not any
>    other sorts of probes) are placed after the setup of args for a
>    function call but obviously before the actual call, since the
>    "function" doesn't exist.  On platforms on which
>    PT_REGS_ARGSTKBASE > 0, we have to remember to *not* apply the
>    ARGSTKBASE adjustment when a usdt probe is being called, since
>    that applies to the actual function call, which hasn't happened.
>    (In theory, pid offset probes might need this, but figuring out when
>    this would be needed would be tricky: we'd need to do it only when
>    the probe was placed at the actual site of a function call -- and
>    even that would leave us in trouble if the probe was placed in the
>    middle of arg setup!)
> 
> Signed-off-by: Nick Alcock <nick.alcock at oracle.com>
> ---
>  include/dtrace/pid.h    |  24 ++-
>  libdtrace/Build         |   1 +
>  libdtrace/dt_cg.c       |  12 +-
>  libdtrace/dt_cg.h       |   2 +-
>  libdtrace/dt_pid.c      | 338 +++++++++++++++++++++++++------
>  libdtrace/dt_pid.h      |  11 +-
>  libdtrace/dt_prov_fbt.c |   2 +-
>  libdtrace/dt_prov_pid.c | 432 ++++++++++++++++++++++++++++------------
>  libdtrace/dt_prov_sdt.c |  31 +--
>  libdtrace/dt_provider.h |   4 +-
>  10 files changed, 635 insertions(+), 222 deletions(-)
> 
> diff --git a/include/dtrace/pid.h b/include/dtrace/pid.h
> index 77f07ce38ca7..687691af454f 100644
> --- a/include/dtrace/pid.h
> +++ b/include/dtrace/pid.h
> @@ -2,7 +2,7 @@
>   * Licensed under the Universal Permissive License v 1.0 as shown at
>   * http://oss.oracle.com/licenses/upl.
>   *
> - * Copyright (c) 2009, 2021, Oracle and/or its affiliates. All rights reserved.
> + * Copyright (c) 2009, 2022, Oracle and/or its affiliates. All rights reserved.
>   */
>  
>  /*
> @@ -13,6 +13,7 @@
>  #ifndef _DTRACE_PID_H
>  #define _DTRACE_PID_H
>  
> +#include <sys/types.h>
>  #include <dirent.h>
>  #include <dtrace/universal.h>
>  
> @@ -22,20 +23,33 @@ typedef enum pid_probetype {
>  	DTPPT_RETURN,
>  	DTPPT_OFFSETS,
>  	DTPPT_POST_OFFSETS,
> -	DTPPT_IS_ENABLED
> +	DTPPT_USDT,
> +	DTPPT_IS_ENABLED,
> +	DTPPT_UNDERLYING
>  } pid_probetype_t;

These probetypes should reflect the style of probe that is being used based on
their implementation - not based on how they are being used.  Each of the
original types has its own characteristics at the low level implementation.
DTPPT_USDT is not a different type.  It is a form of DTPPT_OFFSETS or
DTPPT_IS_ENABLED probe.

In the same sense, I am not too sure I would agree that DTPPT_UNDERLYING makes
sense because you can have underlying probes that are entry probes, return
probes, offset probes, ...  So, that again is not really a probe type but
rather a higher level implementation detail.

I think that part of the extra complexity here (and in the following struct
and other code) stems from the fact that you are implementig USDT probes using
the pid provider callbacks etc.  That obviously can work (as you did) but it
seems overly complex to me when you could just use the approach the legacy
version employed, where usdt is implemented using its own provider callbacks
(or operations - whichever terminology you prefer).  They can (and probably
would) use common code, but USDT probes are definitely different from pid
probes and keeping some of the implementation separate makes sense and avoids
needing to carry info around to know what kind of userspace probe you are
dealing with.

>  typedef struct pid_probespec {
> -	pid_t pps_pid;				/* task PID */
>  	pid_probetype_t pps_type;		/* probe type */
> +	const char *pps_prv;			/* provider (without pid) */
>  	char *pps_mod;				/* probe module (object) */
> +	const char *pps_usdt_mod;		/* module of usdt probe */
> +	const char *pps_usdt_prb;		/* name of usdt probe */
>  	char pps_fun[DTRACE_FUNCNAMELEN];	/* probe function */
> -	ino_t pps_ino;				/* object inode */
> +	dev_t pps_dev;				/* object device node */
> +	ino_t pps_inum;				/* object inode */
>  	char *pps_fn;				/* object full filename */
> +	char *pps_prb;				/* probe name */
> +	uint64_t pps_addr;			/* final object base address */
> +	char *pps_uprobe_name;			/* non-NULL if uprobe exists */
> +
> +	/*
> +	 * Fields below this point do not apply to probes of type
> +	 * DTPPT_UNDERLYING.
> +	 */
> +	pid_t pps_pid;				/* task PID */
>  	uint64_t pps_pc;			/* probe address */
>  	uint64_t pps_vaddr;			/* object base address */
>  	uint64_t pps_size;			/* function size (in bytes) */
> -	uint8_t pps_glen;			/* glob pattern length */
>  	char pps_gstr[1];			/* glob pattern string */
>  } pid_probespec_t;
>  
> diff --git a/libdtrace/Build b/libdtrace/Build
> index 2e553c67b634..57bffff86e84 100644
> --- a/libdtrace/Build
> +++ b/libdtrace/Build
> @@ -84,6 +84,7 @@ dt_consume.c_CFLAGS := -Wno-pedantic
>  dt_debug.c_CFLAGS := -Wno-prio-ctor-dtor
>  dt_cg.c_CFLAGS := -Wno-pedantic
>  dt_dis.c_CFLAGS := -Wno-pedantic
> +dt_pid.c_CFLAGS := -Wno-pedantic
>  dt_proc.c_CFLAGS := -Wno-pedantic
>  dt_prov_dtrace.c_CFLAGS := -Wno-pedantic
>  dt_prov_fbt.c_CFLAGS := -Wno-pedantic
> diff --git a/libdtrace/dt_cg.c b/libdtrace/dt_cg.c
> index 9f9d0fb03028..bf147bd3cb4a 100644
> --- a/libdtrace/dt_cg.c
> +++ b/libdtrace/dt_cg.c
> @@ -267,12 +267,15 @@ dt_cg_tramp_copy_regs(dt_pcb_t *pcb, int rp)
>  
>  /*
>   * Copy arguments from a dt_pt_regs structure referenced by the 'rp' argument.
> + * If 'called' is nonzero, the registers are laid out as when inside the
> + * function: if zero, they are laid out as at the call instruction, before the
> + * function is called (as is done for e.g. usdt).

Instead of using a 'called' flag (which I know I will have trouble remembering)
why not simply pass in the stack offset to access arguments values from?  I.e.
PT_REGS_ARGSTKBASE or 0 (or possibly some other value if there is another case
we need to cover).  It also seems more appropriate for the caller to determine
where to access argument values from rather than hardcoding it here.

>   *
>   * The caller must ensure that %r7 contains the value set by the
>   * dt_cg_tramp_prologue*() functions.
>   */
>  void
> -dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp)
> +dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp, int called)

So this would become:

	dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp, int argstkbase)

and below s/PT_REGS_ARGSTKBASE/argstkbase/

>  {
>  	dtrace_hdl_t	*dtp = pcb->pcb_hdl;
>  	dt_irlist_t	*dlp = &pcb->pcb_ir;
> @@ -318,13 +321,13 @@ dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp)
>  	 *		rc = bpf_probe_read[_user](dctx->mst->argv[i],
>  	 *					   sizeof(uint64_t),
>  	 *					   &sp[i - PT_REGS_ARGC +
> -	 *						   PT_REGS_ARGSTKBASE]);
> +	 *						   (called ? PT_REGS_ARGSTKBASE : 0)]);
>  	 *				// mov %r1, %r7
>  	 *				// add %r1, DMST_ARG(i)
>  	 *				// mov %r2, sizeof(uint64_t)
>  	 *				// lddw %r3, [%rp + PT_REGS_SP]
>  	 *				// add %r3, (i - PT_REGS_ARGC +
> -	 *						 PT_REGS_ARGSTKBASE) *
> +	 *						 (called ? PT_REGS_ARGSTKBASE : 0)) *
>  	 *					    sizeof(uint64_t)
>  	 *				// call bpf_probe_read[_user]
>  	 *		if (rc != 0)
> @@ -344,7 +347,8 @@ dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp)
>  		emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, DMST_ARG(i)));
>  		emit(dlp,  BPF_MOV_IMM(BPF_REG_2, sizeof(uint64_t)));
>  		emit(dlp,  BPF_LOAD(BPF_DW, BPF_REG_3, rp, PT_REGS_SP));
> -		emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, (i - PT_REGS_ARGC + PT_REGS_ARGSTKBASE) * sizeof(uint64_t)));
> +		emit(dlp,  BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, (i - PT_REGS_ARGC +
> +			    (called ? PT_REGS_ARGSTKBASE : 0)) * sizeof(uint64_t)));
>  		emit(dlp,  BPF_CALL_HELPER(dtp->dt_bpfhelper[BPF_FUNC_probe_read_user]));
>  		emit(dlp,  BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, lbl_ok));
>  		emit(dlp,  BPF_STORE_IMM(BPF_DW, BPF_REG_7, DMST_ARG(i), 0));
> diff --git a/libdtrace/dt_cg.h b/libdtrace/dt_cg.h
> index a7606bc00f84..1bfe378714b3 100644
> --- a/libdtrace/dt_cg.h
> +++ b/libdtrace/dt_cg.h
> @@ -24,7 +24,7 @@ extern void dt_cg_tramp_prologue_act(dt_pcb_t *pcb, dt_activity_t act);
>  extern void dt_cg_tramp_prologue(dt_pcb_t *pcb);
>  extern void dt_cg_tramp_clear_regs(dt_pcb_t *pcb);
>  extern void dt_cg_tramp_copy_regs(dt_pcb_t *pcb, int rp);
> -extern void dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp);
> +extern void dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp, int called);

extern void dt_cg_tramp_copy_args_from_regs(dt_pcb_t *pcb, int rp,
					    int argstkbase);

>  extern void dt_cg_tramp_copy_rval_from_regs(dt_pcb_t *pcb, int rp);
>  extern void dt_cg_tramp_call_clauses(dt_pcb_t *pcb, const dt_probe_t *prp,
>  				     dt_activity_t act);
> diff --git a/libdtrace/dt_pid.c b/libdtrace/dt_pid.c
> index 4b143966ff3c..c0826ff7c521 100644
> --- a/libdtrace/dt_pid.c
> +++ b/libdtrace/dt_pid.c
> @@ -15,9 +15,11 @@
>  #include <libgen.h>
>  #include <stddef.h>
>  #include <sys/ioctl.h>
> +#include <sys/sysmacros.h>
>  
>  #include <mutex.h>
>  #include <port.h>
> +#include <uprobes.h>
>  
>  #include <dt_impl.h>
>  #include <dt_program.h>
> @@ -34,7 +36,8 @@ typedef struct dt_pid_probe {
>  	const char *dpp_func;
>  	const char *dpp_name;
>  	const char *dpp_obj;
> -	ino_t dpp_ino;
> +	dev_t dpp_dev;
> +	ino_t dpp_inum;
>  	char *dpp_fname;
>  	uintptr_t dpp_pc;
>  	uintptr_t dpp_vaddr;
> @@ -45,6 +48,8 @@ typedef struct dt_pid_probe {
>  	uint_t dpp_last_taken;
>  } dt_pid_probe_t;
>  
> +static char *dt_pid_de_pid(char *buf, const char *prv);
> +
>  /*
>   * Compose the lmid and object name into the canonical representation. We
>   * omit the lmid for the default link map for convenience.
> @@ -96,16 +101,17 @@ dt_pid_create_fbt_probe(struct ps_prochandle *P, dtrace_hdl_t *dtp,
>  {
>  	const dt_provider_t	*pvp;
>  
> +	psp->pps_prv = "pid";
>  	psp->pps_type = type;
>  	psp->pps_pc = (uintptr_t)symp->st_value;
> +	psp->pps_addr = psp->pps_pc - psp->pps_vaddr;
>  	psp->pps_size = (size_t)symp->st_size;
> -	psp->pps_glen = 0;		/* no glob pattern */
> -	psp->pps_gstr[0] = '\0';
> +	psp->pps_gstr[0] = '\0';		/* no glob pattern */
>  
>  	/* Make sure we have a PID provider. */
>  	pvp = dtp->dt_prov_pid;
>  	if (pvp == NULL) {
> -		pvp = dt_provider_lookup(dtp, "pid");
> +		pvp = dt_provider_lookup(dtp, psp->pps_prv);
>  		if (pvp == NULL)
>  			return 0;
>  
> @@ -125,10 +131,10 @@ dt_pid_create_glob_offset_probes(struct ps_prochandle *P, dtrace_hdl_t *dtp,
>  {
>  	psp->pps_type = DTPPT_OFFSETS;
>  	psp->pps_pc = (uintptr_t)symp->st_value;
> +	psp->pps_addr = psp->pps_pc - psp->pps_vaddr;
>  	psp->pps_size = (size_t)symp->st_size;
> -	psp->pps_glen = strlen(pattern);
>  
> -	strncpy(psp->pps_gstr, pattern, psp->pps_glen + 1);
> +	strcpy(psp->pps_gstr, pattern);
>  
>  	/* Create a probe using 'psp'. */
>  
> @@ -168,7 +174,8 @@ dt_pid_per_sym(dt_pid_probe_t *pp, const GElf_Sym *symp, const char *func)
>  
>  	psp->pps_pid = pid;
>  	psp->pps_mod = dt_pid_objname(pp->dpp_lmid, pp->dpp_obj);
> -	psp->pps_ino = pp->dpp_ino;
> +	psp->pps_dev = pp->dpp_dev;
> +	psp->pps_inum = pp->dpp_inum;
>  	psp->pps_fn = strdup(pp->dpp_fname);
>  	psp->pps_vaddr = pp->dpp_vaddr;
>  	strcpy_safe(psp->pps_fun, sizeof(psp->pps_fun), func);
> @@ -306,7 +313,8 @@ dt_pid_per_mod(void *arg, const prmap_t *pmp, const char *obj)
>  
>  	dt_Plmid(pp->dpp_dtp, pid, pmp->pr_vaddr, &pp->dpp_lmid);
>  
> -	pp->dpp_ino = pmp->pr_inum;
> +	pp->dpp_dev = pmp->pr_dev;
> +	pp->dpp_inum = pmp->pr_inum;
>  	pp->dpp_vaddr = pmp->pr_file->first_segment->pr_vaddr;
>  
>  	/*
> @@ -459,7 +467,8 @@ dt_pid_fix_mod(dt_pid_probe_t *pp, dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp,
>  		return NULL;
>  
>  	dt_Pobjname(dtp, pid, pmp->pr_vaddr, m, sizeof(m));
> -	pp->dpp_fname = strdup(m);
> +	if (pp)
> +		pp->dpp_fname = strdup(m);
>  	if ((obj = strrchr(m, '/')) == NULL)
>  		obj = &m[0];
>  	else
> @@ -484,7 +493,8 @@ dt_pid_create_pid_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp,
>  	pp.dpp_pr = dpr->dpr_proc;
>  	pp.dpp_pcb = pcb;
>  	pp.dpp_nmatches = 0;
> -	pp.dpp_ino = 0;
> +	pp.dpp_dev = makedev(0, 0);
> +	pp.dpp_inum = 0;
>  
>  	/*
>  	 * Prohibit self-grabs.  (This is banned anyway by libproc, but this way
> @@ -570,7 +580,226 @@ dt_pid_create_pid_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp,
>  	return ret;
>  }
>  
> -#if 0
> +/*
> + * Rescan the PID uprobe list and create suitable underlying probes.
> + *
> + * If dpr is set, just set up probes relating to mappings found in that one
> + * process.  (dpr must in this case be locked.)
> + *
> + * If pdp is set, create overlying USDT probes for the specified probe
> + * description.
> + *
> + * Return 0 on success or -1 on error.  (Failure to create specific underlying
> + * probes is not an error.)
> + */
> +static int
> +dt_pid_create_usdt_probes(dtrace_hdl_t *dtp, dt_proc_t *dpr,
> +			  dtrace_probedesc_t *pdp, dt_pcb_t *pcb)

Any reason why the arguments to this function were changed from their
original ordering (which matches dt_pid_create_pid_probes() whereas you new
ordering does not)?

> +{
> +	const dt_provider_t *pvp;
> +	FILE *f;
> +	char *buf = NULL;
> +	size_t sz;
> +	int ret = 0;
> +
> +	pvp = dtp->dt_prov_pid;
> +	if (!pvp) {
> +		pvp = dt_provider_lookup(dtp, "pid");
> +		assert(pvp != NULL);
> +		dtp->dt_prov_pid = pvp;
> +	}
> +
> +	assert(pvp->impl != NULL && pvp->impl->provide_pid != NULL);
> +
> +	f = fopen(TRACEFS "uprobe_events", "r");
> +	if (!f) {
> +		dt_dprintf("cannot open " TRACEFS "uprobe_events: %s\n",
> +		    strerror(errno));
> +		return -1;
> +	}
> +
> +	/*
> +	 * Systemwide probing: not yet implemented.
> +	 */
> +	assert(dpr != NULL);
> +
> +	dt_dprintf("Scanning for usdt probes matching %i\n", dpr->dpr_pid);
> +
> +	/*
> +	 * We are only interested in pid uprobes, not any other uprobes that may
> +	 * exist.  Some of these may be for pid probes, some for usdt: we create
> +	 * underlying probes for all of them, except that we may only be
> +	 * interested in probes belonging to mappings in a particular process,
> +	 * in which case we create probes for that process only.
> +	 */
> +	while (getline(&buf, &sz, f) >= 0) {
> +		dev_t dev;
> +		ino_t inum;
> +		uint64_t addr;
> +		unsigned long long dev_ll, inum_ll;
> +		char *inum_str = NULL, *addr_str = NULL;
> +		char *spec = NULL;
> +		char *eprv = NULL, *emod = NULL, *efun = NULL, *eprb = NULL;
> +		char *prv = NULL, *mod = NULL, *fun = NULL, *prb = NULL;
> +		char *uprobe_name, *uend;
> +		pid_probespec_t psp;
> +
> +#define UPROBE_PREFIX "p:dt_pid/dt_"
> +		if (strncmp(buf, UPROBE_PREFIX, strlen(UPROBE_PREFIX)) != 0)
> +			continue;
> +
> +		switch (sscanf(buf, "p:dt_pid/dt_%llx_%llx_%lx %ms "
> +			       "%m[^= ]=\\1 %m[^= ]=\\2 %m[^= ]=\\3 %m[^= ]=\\4", &dev_ll, &inum_ll,
> +			       &addr, &spec, &eprv, &emod, &efun, &eprb)) {
> +		case 8: /* Includes dtrace probe names: decode them. */
> +			prv = uprobe_decode_name(eprv);
> +			mod = uprobe_decode_name(emod);
> +			fun = uprobe_decode_name(efun);
> +			prb = uprobe_decode_name(eprb);
> +			/* FALLTHRU */
> +		case 4: /* No dtrace probe name. */
> +			dev = dev_ll;
> +			inum = inum_ll;
> +			break;
> +		default:
> +			if ((strlen(buf) > 0) && (buf[strlen(buf)-1] == '\n'))
> +				buf[strlen(buf)-1] = 0;
> +			dt_dprintf("Cannot parse %s as a DTrace uprobe name\n", buf);
> +			goto next;
> +		}
> +
> +		if (asprintf(&inum_str, "%llx", inum_ll) < 0)
> +			goto next;
> +		if (asprintf(&addr_str, "%lx", addr) < 0)
> +			goto next;
> +
> +		uprobe_name = buf + 2;
> +		uend = strchr(uprobe_name, ' ');
> +		if (!uend)
> +			goto next;
> +		*uend = 0;
> +
> +		/*
> +		 * Make the underlying probe, if not already present.
> +		 */
> +		memset(&psp, 0, sizeof(pid_probespec_t));
> +
> +		/*
> +		 * These components are only used for creation of an underlying
> +		 * probe with no overlying counterpart: usually these are those
> +		 * not explicitly listed in the D program, which will never be
> +		 * enabled.  In future this may change.
> +		 */
> +
> +		psp.pps_type = DTPPT_UNDERLYING;
> +		psp.pps_prv = "pid";
> +		psp.pps_mod = inum_str;
> +		psp.pps_fun[0] = '\0';
> +		psp.pps_prb = addr_str;
> +
> +		/*
> +		 * Always used.
> +		 */
> +		psp.pps_dev = dev;
> +		psp.pps_inum = inum;
> +		psp.pps_addr = addr;
> +		psp.pps_uprobe_name = uprobe_name;
> +
> +		/*
> +		 * The filename is only needed when creating a spec for uprobes:
> +		 * but in this case the uprobe already exists.
> +		 */
> +		psp.pps_fn = "";
> +
> +		/*
> +		 * Filter out probes not related to the process of interest.
> +		 */
> +		if (dpr && dpr->dpr_proc) {
> +			assert(MUTEX_HELD(&dpr->dpr_lock));
> +			if (Pinode_to_file_map(dpr->dpr_proc, dev, inum) == NULL)
> +				goto next;
> +			psp.pps_pid = Pgetpid(dpr->dpr_proc);
> +		}
> +
> +		/*
> +		 * If we have a full probe name, extract it and use it to match
> +		 * this underlying probe to an overlying DTrace probe.
> +		 */
> +		if (prv && mod && fun && prb) {
> +			psp.pps_prv = prv;
> +			strcpy_safe(psp.pps_fun, sizeof(psp.pps_fun), fun);
> +			psp.pps_usdt_mod = mod;
> +			psp.pps_usdt_prb = prb;
> +		}
> +
> +		/*
> +		 * Does this match the overlying probe we are meant to be
> +		 * creating?  If so, create an overlying probe and let
> +		 * provide_pid() create the underlying probe for us.  Among
> +		 * other things, this ensures that the PID part of the provider
> +		 * name gets stuck into the overlying probe properly.
> +		 *
> +		 * TODO: wildcard probes get handled here.
> +		 */
> +		if (pdp) {
> +			char de_pid_prov[DTRACE_PROVNAMELEN + 1];
> +
> +			if ((pdp->prv[0] == 0 ||
> +			     strcmp(dt_pid_de_pid(de_pid_prov, pdp->prv),
> +				    psp.pps_prv) == 0) &&
> +			    (pdp->mod[0] == 0 || strcmp(pdp->mod, psp.pps_usdt_mod) == 0) &&
> +			    (pdp->fun[0] == 0 || strcmp(pdp->fun, psp.pps_fun) == 0) &&
> +			    (pdp->prb[0] == 0 || strcmp(pdp->prb, psp.pps_usdt_prb) == 0)) {
> +				psp.pps_type = DTPPT_USDT;
> +				psp.pps_mod = (char *) psp.pps_usdt_mod;
> +				psp.pps_prb = (char *) psp.pps_usdt_prb;
> +			}
> +			else {
> +				dt_dprintf("%s:%s:%s:%s -> %s:%s:%s:%s: match failure\n",
> +					   pdp->prv, pdp->mod, pdp->fun, pdp->prb,
> +					   psp.pps_prv, psp.pps_usdt_mod, psp.pps_fun,
> +					   psp.pps_usdt_prb);
> +				goto next;
> +			}
> +		}
> +
> +		/*
> +		 * Create a probe using psp, if not already present.
> +		 *
> +		 * If we are creating an overlying probe, complain about it if
> +		 * probe creation fails.  Otherwise, this is just an underlying
> +		 * probe: we'll complain later if we use it for anything.
> +		 */
> +
> +		if (pvp->impl->provide_pid(dtp, &psp) < 0 && pdp) {
> +			dt_pid_error(dtp, pcb, dpr, D_PROC_USDT,
> +				     "failed to instantiate probe %s for pid %d: %s",
> +				     pdp->prb, dpr->dpr_pid,
> +				     dtrace_errmsg(dtp, dtrace_errno(dtp)));
> +			ret = -1;
> +		}
> +
> +		if (pdp && dpr) {
> +			/*
> +			 * Put the module name in its canonical form.
> +			 */
> +			dt_pid_fix_mod(NULL, pdp, dtp, dpr->dpr_pid);
> +		}
> +
> +	next:
> +		free(eprv); free(emod); free(efun); free(eprb);
> +		free(prv);  free(mod);  free(fun);  free(prb);
> +		free(addr_str);
> +		free(inum_str);
> +		free(spec);
> +		continue;
> +	}
> +	free(buf);
> +
> +	return ret;
> +}
> +
> +#if 0 /* Almost certainly unnecessary in this form */
>  static int
>  dt_pid_usdt_mapping(void *data, const prmap_t *pmp, const char *oname)
>  {
> @@ -585,10 +814,8 @@ dt_pid_usdt_mapping(void *data, const prmap_t *pmp, const char *oname)
>  	int fd = -1;
>  
>  	/*
> -	 * The symbol ___SUNW_dof is for lazy-loaded DOF sections, and
> -	 * __SUNW_dof is for actively-loaded DOF sections. We try to force
> -	 * in both types of DOF section since the process may not yet have
> -	 * run the code to instantiate these providers.
> +	 * We try to force-load the DOF since the process may not yet have run
> +	 * the code to instantiate these providers.

Why bother updating the comment block if you indicate above that this function
is "Almost certainly unnecessary in this form"?  Just leave it as-is (avoiding
changes that are not relevant now) and we'll update this as needed (if ever).

>  	 */
>  	for (i = 0; i < 2; i++) {
>  		if (dt_Pxlookup_by_name(dpr->dpr_hdl, dpr->dpr_pid, PR_LMID_EVERY,
> @@ -628,32 +855,12 @@ dt_pid_usdt_mapping(void *data, const prmap_t *pmp, const char *oname)
>  
>  	return 0;
>  }
> -
> -static int
> -dt_pid_create_usdt_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp,
> -    dt_pcb_t *pcb, dt_proc_t *dpr)
> -{
> -	int ret = 0;
> -
> -	assert(MUTEX_HELD(&dpr->dpr_lock));
> -
> -	if (dt_Pobject_iter(dtp, dpr->dpr_pid, dt_pid_usdt_mapping, dpr) != 0) {
> -		ret = -1;
> -		dt_pid_error(dtp, pcb, dpr, D_PROC_USDT,
> -		    "failed to instantiate probes for pid %d: %s",
> -		    dpr->dpr_pid, strerror(errno));
> -	}
> -
> -	/*
> -	 * Put the module name in its canonical form.
> -	 */
> -	dt_pid_fix_mod(pdp, dtp, dpr->dpr_pid);
> -
> -	return ret;
> -}
>  #endif
>  
> -static pid_t
> +/*
> + * Extract the pid from a USDT provider name.

This is used for more than just USDT probes so this comment is wrong.

> + */
> +pid_t
>  dt_pid_get_pid(const dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp, dt_pcb_t *pcb,
>  	       dt_proc_t *dpr)
>  {
> @@ -684,6 +891,23 @@ dt_pid_get_pid(const dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp, dt_pcb_t *pcb,
>  	return pid;
>  }
>  
> +/*
> + * Return a USDT provider name with the pid removed.

This is useful for more than just USDT probes so I think it is better to drop
the "USDT" in this comment.

> + */
> +static char *
> +dt_pid_de_pid(char *buf, const char *prv)
> +{
> +	const char *c;
> +	char *p = buf;
> +
> +	for (c = prv; *c != '\0'; c++) {
> +		if (!isdigit(*c))
> +			*p++ = *c;
> +	}
> +	*p = 0;
> +	return buf;
> +}
> +
>  int
>  dt_pid_create_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp, dt_pcb_t *pcb)
>  {
> @@ -697,11 +921,10 @@ dt_pid_create_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp, dt_pcb_t *pcb)
>  	if ((pid = dt_pid_get_pid(pdp, dtp, pcb, NULL)) == -1)
>  		return -1;
>  
> -	snprintf(provname, sizeof(provname), PID_PRVNAME, (int)pid);
> +	snprintf(provname, sizeof(provname), "pid%d", (int)pid);

Why?

>  	if (gmatch(provname, pdp->prv) != 0) {
> -		pid = dt_proc_grab_lock(dtp, pid, DTRACE_PROC_WAITING);
> -		if (pid < 0) {
> +		if (dt_proc_grab_lock(dtp, pid, DTRACE_PROC_WAITING) < 0) {
>  			dt_pid_error(dtp, pcb, NULL, D_PROC_GRAB,
>  			    "failed to grab process %d", (int)pid);
>  			return -1;
> @@ -711,14 +934,6 @@ dt_pid_create_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp, dt_pcb_t *pcb)
>  		assert(dpr != NULL);
>  
>  		err = dt_pid_create_pid_probes(pdp, dtp, pcb, dpr);
> -		if (err == 0) {
> -			/*
> -			 * Alert other retained enablings which may match
> -			 * against the newly created probes.
> -			 */
> -			dt_ioctl(dtp, DTRACEIOC_ENABLE, NULL);
> -		}
> -
>  		dt_proc_release_unlock(dtp, pid);
>  	}
>  
> @@ -726,8 +941,7 @@ dt_pid_create_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp, dt_pcb_t *pcb)
>  	 * If it's not strictly a pid provider, we might match a USDT provider.
>  	 */
>  	if (strcmp(provname, pdp->prv) != 0) {
> -		pid = dt_proc_grab_lock(dtp, pid, DTRACE_PROC_WAITING);
> -		if (pid < 0) {
> +		if (dt_proc_grab_lock(dtp, pid, DTRACE_PROC_WAITING) < 0) {
>  			dt_pid_error(dtp, pcb, NULL, D_PROC_GRAB,
>  			    "failed to grab process %d", (int)pid);
>  			return -1;
> @@ -736,16 +950,16 @@ dt_pid_create_probes(dtrace_probedesc_t *pdp, dtrace_hdl_t *dtp, dt_pcb_t *pcb)
>  		dpr = dt_proc_lookup(dtp, pid);
>  		assert(dpr != NULL);
>  
> -#ifdef FIXME
>  		if (!dpr->dpr_usdt) {
> -			err = dt_pid_create_usdt_probes(pdp, dtp, pcb, dpr);
> +			err = dt_pid_create_usdt_probes(dtp, dpr, pdp, pcb);
>  			dpr->dpr_usdt = B_TRUE;
>  		}
> -#endif
>  
>  		dt_proc_release_unlock(dtp, pid);
>  	}
>  
> +	/* (USDT systemwide probing goes here.)  */
> +
>  	return err ? -1 : 0;
>  }
>  
> @@ -756,7 +970,7 @@ dt_pid_create_probes_module(dtrace_hdl_t *dtp, dt_proc_t *dpr)
>  	dt_stmt_t *stp;
>  	dtrace_probedesc_t *pdp;
>  	pid_t pid;
> -	int ret = 0, found = B_FALSE;
> +	int ret = 0;
>  	char provname[DTRACE_PROVNAMELEN];
>  
>  	snprintf(provname, sizeof(provname), "pid%d", (int)dpr->dpr_pid);
> @@ -772,8 +986,6 @@ dt_pid_create_probes_module(dtrace_hdl_t *dtp, dt_proc_t *dpr)
>  			if (pid != dpr->dpr_pid)
>  				continue;
>  
> -			found = B_TRUE;
> -
>  			pd = *pdp;
>  			pd.fun = strdup(pd.fun);	/* we may change it */
>  
> @@ -781,26 +993,22 @@ dt_pid_create_probes_module(dtrace_hdl_t *dtp, dt_proc_t *dpr)
>  			    dt_pid_create_pid_probes(&pd, dtp, NULL, dpr) != 0)
>  				ret = 1;
>  
> -#ifdef FIXME
>  			/*
>  			 * If it's not strictly a pid provider, we might match
>  			 * a USDT provider.
>  			 */
>  			if (strcmp(provname, pdp->prv) != 0 &&
> -			    dt_pid_create_usdt_probes(&pd, dtp, NULL, dpr) != 0)
> +			    dt_pid_create_usdt_probes(dtp, dpr, pdp, NULL) < 0)
>  				ret = 1;
> -#endif
>  
>  			free((char *)pd.fun);
>  		}
>  	}
>  
>  	/*
> -	 * Give DTrace a shot to the ribs to get it to check
> -	 * out the newly created probes.
> +	 * XXX systemwide: rescan for new probes here?  We have to do it
> +	 * at some point, but when?
>  	 */
> -	if (found)
> -		dt_ioctl(dtp, DTRACEIOC_ENABLE, NULL);
>  
>  	return ret;
>  }
> diff --git a/libdtrace/dt_pid.h b/libdtrace/dt_pid.h
> index b5ba9598a425..057684812815 100644
> --- a/libdtrace/dt_pid.h
> +++ b/libdtrace/dt_pid.h
> @@ -1,6 +1,6 @@
>  /*
>   * Oracle Linux DTrace.
> - * Copyright (c) 2006, 2021, Oracle and/or its affiliates. All rights reserved.
> + * Copyright (c) 2006, 2022, Oracle and/or its affiliates. All rights reserved.
>   * Licensed under the Universal Permissive License v 1.0 as shown at
>   * http://oss.oracle.com/licenses/upl.
>   */
> @@ -16,14 +16,11 @@
>  extern "C" {
>  #endif
>  
> -#define	DT_PROC_ERR	(-1)
> -#define	DT_PROC_ALIGN	(-2)
> -
> -#define PID_PRVNAME	"pid%d"

Why?

> -
>  extern int dt_pid_create_probes(dtrace_probedesc_t *, dtrace_hdl_t *,
> -				dt_pcb_t *pcb);
> +				dt_pcb_t *);
>  extern int dt_pid_create_probes_module(dtrace_hdl_t *, dt_proc_t *);
> +extern pid_t dt_pid_get_pid(const dtrace_probedesc_t *, dtrace_hdl_t *, dt_pcb_t *,
> +			    dt_proc_t *);
>  
>  #ifdef	__cplusplus
>  }
> diff --git a/libdtrace/dt_prov_fbt.c b/libdtrace/dt_prov_fbt.c
> index 72c31469df61..db037e4e72d0 100644
> --- a/libdtrace/dt_prov_fbt.c
> +++ b/libdtrace/dt_prov_fbt.c
> @@ -175,7 +175,7 @@ static void trampoline(dt_pcb_t *pcb)
>  		dt_cg_xsetx(dlp, NULL, DT_LBL_NONE, BPF_REG_0, -1);
>  		emit(dlp,  BPF_STORE(BPF_DW, BPF_REG_7, DMST_ARG(0), BPF_REG_0));
>  	} else
> -		dt_cg_tramp_copy_args_from_regs(pcb, BPF_REG_8);
> +		dt_cg_tramp_copy_args_from_regs(pcb, BPF_REG_8, 1);

		dt_cg_tramp_copy_args_from_regs(pcb, BPF_REG_8, PT_REGS_ARGSTKBASE);

>  	dt_cg_tramp_epilogue(pcb);
>  }
>  
> diff --git a/libdtrace/dt_prov_pid.c b/libdtrace/dt_prov_pid.c
> index f747f49741e3..61ed30cf057b 100644
> --- a/libdtrace/dt_prov_pid.c
> +++ b/libdtrace/dt_prov_pid.c
> @@ -6,6 +6,7 @@
>   *
>   * The PID provider for DTrace.
>   */
> +#include <sys/types.h>
>  #include <assert.h>
>  #include <errno.h>
>  #include <string.h>
> @@ -18,22 +19,31 @@
>  #include "dt_provider.h"
>  #include "dt_probe.h"
>  #include "dt_pid.h"
> +#include "dt_string.h"
> +#include "uprobes.h"
>  
>  static const char		prvname[] = "pid";
>  
>  #define UPROBE_EVENTS		TRACEFS "uprobe_events"
>  
> -#define PID_GROUP_FMT		GROUP_FMT "_%lx"
> -#define PID_GROUP_DATA		GROUP_DATA, pp->ino
> -
>  typedef struct pid_probe {
> -	ino_t		ino;
> +	dev_t		dev;
> +	ino_t		inum;
>  	char		*fn;
>  	uint64_t	off;
>  	tp_probe_t	*tp;
>  	dt_list_t	probes;
> +	char		*uprobe_name;
> +	int		dtrace_created;
> +	int		is_return;
> +	int		arg_setup;
>  } pid_probe_t;
>  
> +typedef struct pid_probe_list {
> +	dt_list_t	list;			/* forward/back pointers */
> +	dt_probe_t	*probe;			/* the probe in question */
> +} pid_probe_list_t;
> +
>  static const dtrace_pattr_t	pattr = {
>  { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
>  { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
> @@ -47,140 +57,310 @@ dt_provimpl_t	dt_pid_proc;
>  static int populate(dtrace_hdl_t *dtp)
>  {
>  	dt_provider_create(dtp, prvname, &dt_pid, &pattr);
> +	/* XXX systemwide probing: dt_pid_create_usdt_probes(dtp, NULL_NULL, NULL); */
>  	return 0;
>  }
>  
> -static void probe_destroy(dtrace_hdl_t *dtp, void *datap)
> +/*
> + * Destroy an underlying probe.
> + */
> +
> +static void probe_udestroy(dtrace_hdl_t *dtp, void *datap)

Why not probe_destroy_underlying() (vs probe_destroy()) similar to
provide_pid_underlying() (vs provide_pid())?  Consistency really makes things
easier.

>  {
> -	pid_probe_t	*pp = datap;
> -	tp_probe_t	*tpp = pp->tp;
> +	pid_probe_t		*pp = datap;
> +	tp_probe_t		*tpp = pp->tp;
> +	pid_probe_list_t	*pop, *pop_next;
>  
>  	dt_tp_destroy(dtp, tpp);
>  	dt_free(dtp, pp->fn);
> +	dt_free(dtp, pp->uprobe_name);
> +	for (pop = dt_list_next(&pp->probes); pop != NULL;
> +	     pop = pop_next) {
> +		pop_next = dt_list_next(pop);
> +		dt_free(dtp, pop);
> +	}
>  	dt_free(dtp, pp);
>  }
>  
> +/*
> + * Destroy an overlying probe.
> + */
> +static void probe_destroy(dtrace_hdl_t *dtp, void *datap)
> +{
> +	pid_probe_list_t	*pup = datap, *pup_next;
> +
> +	for (pup = datap; pup != NULL; pup = pup_next) {
> +		pup_next = dt_list_next(pup);
> +		dt_free(dtp, pup);
> +	}
> +}
> +
> +
> +/*
> + * Look up or create an underlying (real) probe, corresponding directly to a
> + * uprobe.  Since multiple pid and USDT probes may all map onto the same
> + * underlying probe, we may already have one in the system.
> + *
> + * If not found, we create a new probe.
> + */
> +static dt_probe_t *provide_pid_underlying(dtrace_hdl_t *dtp,
> +					  const pid_probespec_t *psp,
> +					  const pid_probespec_t *overlying)
> +{
> +	dtrace_probedesc_t	pd;
> +	dt_probe_t		*prp;
> +	pid_probe_t		*pp;
> +
> +	/*
> +	 * The module for these probes is the mapping inode number, in hex
> +	 * (filled in by the caller): the function is "underlying", and the
> +	 * probe name is the offset.  (This is amenable to change if it turns
> +	 * out to be inconvenient: but it's more or less arbitrary anyway.  It
> +	 * just needs to be something that is determinable both for pid and USDT
> +	 * probes.)
> +	 *
> +	 * Return probes get more info from the overlying probe, since they are
> +	 * in a different namespace to other uprobes and don't need to worry
> +	 * about colocation with usdt probes.
> +	 *
> +	 * XXX does this mean we lump together all probes with a given offset
> +	 * when doing wildcarding?  This is hardly ideal.  Do we want underlying
> +	 * probes to be wildcardable at all?  If we do, how should we structure
> +	 * this stuff?
> +	 */
> +
> +	pd.id = DTRACE_IDNONE;
> +	pd.prv = psp->pps_prv;
> +	pd.mod = psp->pps_mod;
> +	pd.prb = psp->pps_prb;
> +
> +	if (psp && overlying && overlying->pps_type == DTPPT_RETURN)
> +		pd.fun = psp->pps_fun;
> +	else
> +		pd.fun = "underlying";
> +
> +	prp = dt_probe_lookup(dtp, &pd);
> +	if (prp == NULL) {
> +		dt_provider_t *pvp;
> +
> +		/* Get the pid provider. */
> +		pvp = dt_provider_lookup(dtp, prvname);
> +		if (pvp == NULL)
> +			return NULL;
> +
> +		/* Set up the pid probe data. */
> +		pp = dt_zalloc(dtp, sizeof(pid_probe_t));
> +		if (pp == NULL)
> +			return NULL;
> +
> +		if (psp->pps_uprobe_name) {
> +			pp->uprobe_name = strdup(psp->pps_uprobe_name);
> +			if (pp->uprobe_name == NULL)
> +				goto fail_errno;
> +		}
> +
> +		pp->dev = psp->pps_dev;
> +		pp->inum = psp->pps_inum;
> +		pp->fn = strdup(psp->pps_fn);
> +		pp->off = psp->pps_addr;
> +		pp->tp = dt_tp_alloc(dtp);
> +		if (pp->tp == NULL)
> +			goto fail;
> +
> +		prp = dt_probe_insert(dtp, pvp, pd.prv, pd.mod, pd.fun, pd.prb,
> +				      pp);
> +		if (prp == NULL)
> +			goto fail;
> +
> +		if (overlying && overlying->pps_type == DTPPT_RETURN)
> +			pp->is_return = 1;
> +	}
> +	else
> +		pp = prp->prv_data;
> +
> +	/*
> +	 * USDT probes are positioned after argument setup for a function that
> +	 * is never actually called, so the stack is slightly different from the
> +	 * common case of accessing args for the current function.
> +	 */
> +	if (overlying && overlying->pps_type == DTPPT_USDT)
> +		pp->arg_setup = 1;
> +
> +	return prp;
> +fail_errno:
> +	dt_set_errno(dtp, errno);
> +fail:
> +	probe_destroy(dtp, pp);
> +	free(pp->uprobe_name);
> +	return NULL;
> +}
> +
>  static int provide_pid(dtrace_hdl_t *dtp, const pid_probespec_t *psp)
>  {
>  	char			prv[DTRACE_PROVNAMELEN];
> -	char			mod[DTRACE_MODNAMELEN];
> -	char			prb[DTRACE_NAMELEN];
> -	dt_provider_t		*pidpvp;
> +	const char		*mod;
> +	const char		*prb;
> +	char			underlying_mod[DTRACE_MODNAMELEN];
> +	char			underlying_prb[DTRACE_NAMELEN];
>  	dt_provider_t		*pvp;
>  	dtrace_probedesc_t	pd;
>  	pid_probe_t		*pp;
> -	dt_probe_t		*prp;
> -	uint64_t		off = psp->pps_pc - psp->pps_vaddr;
> +	dt_probe_t		*prp, *uprp;
> +	uint64_t		off = psp->pps_addr;
> +	pid_probespec_t		pp_psp;
> +	pid_probe_list_t	*pop, *pup;
> +
> +	if (psp->pps_type == DTPPT_UNDERLYING) {
> +		if (provide_pid_underlying(dtp, psp, NULL))
> +			return 0;
> +		else
> +			return -1;
> +	}
>  
>  	/*
> -	 * First check whether this pid probe already exists.  If so, there is
> -	 * nothing left to do.
> +	 * Get the provider and probe names right.  This is quite elaborate: pid
> +	 * probes use a fixed provider name for underlying probes and pid-suffixed
> +	 * for overlying ones, while USDT uses the name given in the USDT probe
> +	 * for the underlying probe instead, and entry and return probes use
> +	 * fixed probe names.  In any case, the underlying probe has a different
> +	 * naming scheme to the overlying in that the probe name, module, and
> +	 * likely function are all different.  (dt_pid_per_sym and
> +	 * dt_pid_create_usdt_probes are also involved in this dance.)
>  	 */
> -	snprintf(prv, sizeof(prv), PID_PRVNAME, psp->pps_pid);
>  
> +	snprintf(prv, sizeof(prv), "%s%d", psp->pps_prv, psp->pps_pid);
> +	snprintf(underlying_prb, sizeof(underlying_prb), "%lx", off);
> +
> +	mod = psp->pps_mod;
>  	switch (psp->pps_type) {
>  	case DTPPT_ENTRY:
> -		strncpy(prb, "entry", sizeof(prb));
> +		prb = "entry";
>  		break;
>  	case DTPPT_RETURN:
> -		strncpy(prb, "return", sizeof(prb));
> +		prb = "return";
> +		/*
> +		 * Return probes are in their own namespace and cannot collide
> +		 * with other probes: so we can give them a name that reflects
> +		 * their nature, since their nature is known (while other
> +		 * underlying probes might serve both USDT and pid probes at the
> +		 * same time, or pid and glob-offset, etc).
> +		 */
> +		sprintf(underlying_prb, "return");
>  		break;
>  	case DTPPT_OFFSETS:
> -		snprintf(prb, sizeof(prb), "%lx", off);
> +		prb = underlying_prb;
> +		break;
> +	case DTPPT_USDT:
> +		mod = psp->pps_usdt_mod;
> +		prb = psp->pps_usdt_prb;
>  		break;
>  	default:
> -		return 0;
> +		dt_dprintf("pid: unknown probe type %i\n", psp->pps_type);
> +		return -1;
>  	}
>  
>  	pd.id = DTRACE_IDNONE;
>  	pd.prv = prv;
> -	pd.mod = psp->pps_mod;
> +	pd.mod = mod;
>  	pd.fun = psp->pps_fun;
>  	pd.prb = prb;
>  
> -	prp = dt_probe_lookup(dtp, &pd);
> -	if (prp != NULL)
> -		return 1;		/* probe found */
> -
> -	/* Get the main (real) pid provider. */
> -	pidpvp = dt_provider_lookup(dtp, prvname);
> -	if (pidpvp == NULL)
> -		return 0;
> -
>  	/* Get (or create) the provider for the PID of the probe. */
>  	pvp = dt_provider_lookup(dtp, prv);
>  	if (pvp == NULL) {
>  		pvp = dt_provider_create(dtp, prv, &dt_pid_proc, &pattr);
>  		if (pvp == NULL)
> -			return 0;
> +			return -1;
>  	}
>  
>  	/* Mark the provider as a PID provider. */
>  	pvp->pv_flags |= DT_PROVIDER_PID;
>  
> -	/*
> -	 * Fill in the probe description for the main (real) probe.  The
> -	 * module is the inode number (in hex), the function name is as
> -	 * specified for the pid probe, and the probe name is "entry",
> -	 * "return", or the offset into the function (in hex).
> -	 */
> -	snprintf(mod, sizeof(mod), "%lx", psp->pps_ino);
> +	snprintf(underlying_mod, sizeof(underlying_mod), "%lx", psp->pps_inum);
> +
> +	/* Get (or create) the underlying probe.  */
> +	memcpy(&pp_psp, psp, sizeof(pid_probespec_t));
> +	pp_psp.pps_type = DTPPT_UNDERLYING;
> +	pp_psp.pps_mod = underlying_mod;
> +	pp_psp.pps_prb = underlying_prb;
> +	uprp = provide_pid_underlying(dtp, &pp_psp, psp);
> +
> +	if (uprp == NULL)
> +		return -1;
> +
> +	pp = uprp->prv_data;
>  
> -	/*
> -	 * Try to lookup the main (real) probe.  Since multiple pid probes may
> -	 * all map onto the same underlying main (real) probe, we may already
> -	 * have one in the system.
> -	 *
> -	 * If not found, we create a new probe.
> -	 */
> -	pd.id = DTRACE_IDNONE;
> -	pd.prv = prvname;
> -	pd.mod = mod;
> -	pd.fun = psp->pps_fun;
> -	pd.prb = prb;
>  	prp = dt_probe_lookup(dtp, &pd);
> -	if (prp == NULL) {
> -		/* Set up the pid probe data. */
> -		pp = dt_zalloc(dtp, sizeof(pid_probe_t));
> -		if (pp == NULL)
> -			return 0;
> +	if (prp != NULL) {
> +		/*
> +		 * Probe already exists.  If it's already in the underlying
> +		 * probe's probe list, there is nothing left to do.
> +		 */
> +		for (pop = dt_list_next(&pp->probes); pop != NULL;
> +		     pop = dt_list_next(pop)) {
> +			if (pop->probe == prp)
> +				return 0;
> +		}
> +	}
>  
> -		pp->ino = psp->pps_ino;
> -		pp->fn = strdup(psp->pps_fn);
> -		pp->off = off;
> -		pp->tp = dt_tp_alloc(dtp);
> -		if (pp->tp == NULL)
> -			goto fail;
> +	/*
> +	 * Underlying and overlying probe list entries.
> +	 */
> +	pop = dt_zalloc(dtp, sizeof(pid_probe_list_t));
> +	if (pop == NULL)
> +		return -1;
> +
> +	pup = dt_zalloc(dtp, sizeof(pid_probe_list_t));
> +	if (pup == NULL) {
> +		dt_free(dtp, pop);
> +		return -1;
> +	}
>  
> -		prp = dt_probe_insert(dtp, pidpvp, prvname, mod, psp->pps_fun,
> -				      prb, pp);
> -		if (prp == NULL)
> -			goto fail;
> -	} else
> -		pp = prp->prv_data;
> +	/* Add the pid probe, if we need to. */
>  
> -	/* Try to add the pid probe. */
> -	prp = dt_probe_insert(dtp, pvp, prv, psp->pps_mod, psp->pps_fun, prb,
> -			      prp);
> +	pup->probe = uprp;
>  	if (prp == NULL)
> -		goto fail;
> +		prp = dt_probe_insert(dtp, pvp, pd.prv, pd.mod, pd.fun, pd.prb,
> +				      pup);
> +	else
> +		dt_list_append((dt_list_t *) prp->prv_data, pup);
>  
> -	/* Add the pid probe to the list of probes for the main (real) probe. */
> -	dt_list_append(&pp->probes, prp);
> +	if (prp == NULL) {
> +		dt_free(dtp, pop);
> +		dt_free(dtp, pup);
> +		return -1;
> +	}
>  
> -	return 1;
> +	pop->probe = prp;
> +
> +	/*
> +	 * Add the pid probe to the list of probes for the underlying probe.
> +	 */
> +	dt_list_append(&pp->probes, pop);
>  
> -fail:
> -	probe_destroy(dtp, pp);
>  	return 0;
>  }
>  
>  static void enable(dtrace_hdl_t *dtp, dt_probe_t *prp)
>  {
> +	const pid_probe_list_t *pup;
>  	assert(prp->prov->impl == &dt_pid_proc);
>  
> -	/* We need to enable the main (real) probe (if not enabled yet). */
> -	dt_probe_enable(dtp, (dt_probe_t *)prp->prv_data);
> +	/*
> +	 * We need to enable the underlying probes (if not enabled yet).
> +	 */
> +	for (pup = prp->prv_data; pup != NULL; pup = dt_list_next(pup)) {
> +		dt_probe_t *uprp = pup->probe;
> +		dt_probe_enable(dtp, uprp);
> +	}
> +
> +	/*
> +	 * Finally, ensure we're in the list of enablings as well.
> +	 * (This ensures that, among other things, the probes map
> +	 * gains entries for us.)
> +	 */
> +	if (!dt_in_list(&dtp->dt_enablings, prp))
> +		dt_list_append(&dtp->dt_enablings, prp);
>  }
>  
>  /*
> @@ -196,11 +376,11 @@ static void enable(dtrace_hdl_t *dtp, dt_probe_t *prp)
>   */
>  static void trampoline(dt_pcb_t *pcb)
>  {
> -	dt_irlist_t		*dlp = &pcb->pcb_ir;
> -	const dt_probe_t	*prp = pcb->pcb_probe;
> -	const dt_probe_t	*pprp;
> -	const pid_probe_t	*pp = prp->prv_data;
> -	uint_t			lbl_exit = pcb->pcb_exitlbl;
> +	dt_irlist_t			*dlp = &pcb->pcb_ir;
> +	const dt_probe_t		*prp = pcb->pcb_probe;
> +	const pid_probe_t		*pp = prp->prv_data;
> +	const pid_probe_list_t		*pop;
> +	uint_t				lbl_exit = pcb->pcb_exitlbl;
>  
>  	dt_cg_tramp_prologue(pcb);
>  
> @@ -211,10 +391,11 @@ static void trampoline(dt_pcb_t *pcb)
>  	 */
>  
>  	dt_cg_tramp_copy_regs(pcb, BPF_REG_8);
> -	if (strcmp(pcb->pcb_probe->desc->prb, "return") == 0)
> +	if (pp->is_return)
>  		dt_cg_tramp_copy_rval_from_regs(pcb, BPF_REG_8);
>  	else
> -		dt_cg_tramp_copy_args_from_regs(pcb, BPF_REG_8);
> +		dt_cg_tramp_copy_args_from_regs(pcb, BPF_REG_8,
> +						!pp->arg_setup);

		dt_cg_tramp_copy_args_from_regs(pcb, BPF_REG_8,
					pp->arg_setup ? 0 : PT_REGS_ARGSTKBASE);
>  
>  	/*
>  	 * Retrieve the PID of the process that caused the probe to fire.
> @@ -241,12 +422,16 @@ static void trampoline(dt_pcb_t *pcb)
>  	 * are no assignments to %r0 possible in between the conditional
>  	 * statements.
>  	 */
> -	for (pprp = dt_list_next(&pp->probes); pprp != NULL;
> -	     pprp = dt_list_next(pprp)) {
> -		uint_t		lbl_next = dt_irlist_label(dlp);
> -		pid_t		pid = strtoul(pprp->desc->prv + 3, NULL, 10);
> -		char		pn[DTRACE_FULLNAMELEN + 1];
> -		dt_ident_t	*idp;
> +	for (pop = dt_list_next(&pp->probes); pop != NULL;
> +	     pop = dt_list_next(pop)) {
> +		const dt_probe_t	*pprp = pop->probe;
> +		uint_t			lbl_next = dt_irlist_label(dlp);
> +		pid_t			pid;
> +		char			pn[DTRACE_FULLNAMELEN + 1];
> +		dt_ident_t		*idp;
> +
> +		pid = dt_pid_get_pid(pprp->desc, pcb->pcb_hdl, pcb, NULL);
> +		assert(pid != -1);
>  
>  		snprintf(pn, DTRACE_FULLNAMELEN, "%s:%s:%s:%s",
>  			 pprp->desc->prv, pprp->desc->mod, pprp->desc->fun,
> @@ -255,7 +440,7 @@ static void trampoline(dt_pcb_t *pcb)
>  		assert(idp != NULL);
>  		/*
>  		 * Check whether this pid-provider probe serves the current
> -		 * process.  This loop creates a sequence
> +		 * process, and emit a sequence of clauses for it when it does.
>  		 */
>  		emit(dlp,  BPF_BRANCH_IMM(BPF_JNE, BPF_REG_0, pid, lbl_next));
>  		emite(dlp, BPF_STORE_IMM(BPF_W, BPF_REG_7, DMST_PRID, pprp->desc->id), idp);
> @@ -276,35 +461,28 @@ static int attach(dtrace_hdl_t *dtp, const dt_probe_t *prp, int bpf_fd)
>  	if (!dt_tp_is_created(tpp)) {
>  		char	*fn;
>  		FILE	*f;
> -		size_t	len;
> -		int	fd, rc = -1;
> -
> -		/* add the uprobe */
> -		fd = open(UPROBE_EVENTS, O_WRONLY | O_APPEND);
> -		if (fd != -1) {
> -			rc = dprintf(fd,
> -				     "%c:" PID_GROUP_FMT "/%s_%s %s:0x%lx\n",
> -				     prp->desc->prb[0] == 'e' ? 'p' : 'r',
> -				     PID_GROUP_DATA, prp->desc->fun,
> -				     prp->desc->prb, pp->fn, pp->off);
> -			close(fd);
> +		char	*prb;
> +		char	*spec;
> +		int	rc = -1;
> +
> +		if (pp->uprobe_name == NULL) {
> +			if (asprintf(&spec, "%s:0x%lx", pp->fn ? pp->fn : "", pp->off) < 0)
> +				return -ENOENT;
> +
> +			prb = uprobe_create(pp->dev, pp->inum, pp->off, spec,
> +					    pp->is_return);
> +			free(spec);
> +			if (prb == NULL)
> +				return -ENOENT;
> +			pp->uprobe_name = prb;
> +			pp->dtrace_created = 1;
>  		}
> -		if (rc == -1)
> -			return -ENOENT;
>  
>  		/* open format file */
> -		len = snprintf(NULL, 0, "%s" PID_GROUP_FMT "/%s_%s/format",
> -			       EVENTSFS, PID_GROUP_DATA, prp->desc->fun,
> -			       prp->desc->prb) + 1;
> -		fn = dt_alloc(dtp, len);
> -		if (fn == NULL)
> +		if (asprintf(&fn, "%s%s/format", EVENTSFS, pp->uprobe_name) < 0)
>  			return -ENOENT;
> -
> -		snprintf(fn, len, "%s" PID_GROUP_FMT "/%s_%s/format",
> -			 EVENTSFS, PID_GROUP_DATA, prp->desc->fun,
> -			 prp->desc->prb);
>  		f = fopen(fn, "r");
> -		dt_free(dtp, fn);
> +		free(fn);
>  		if (f == NULL)
>  			return -ENOENT;
>  
> @@ -334,10 +512,10 @@ static int probe_info(dtrace_hdl_t *dtp, const dt_probe_t *prp,
>   *
>   * If there is an event FD, we close it.
>   *
> - * We also try to remove any uprobe that may have been created for the probe.
> - * This is harmless for probes that didn't get created.  If the removal fails
> - * for some reason we are out of luck - fortunately it is not harmful to the
> - * system as a whole.
> + * We also try to remove any uprobe that may have been created for the probe
> + * (but only if we created it, not if dtprobed did).  This is harmless for
> + * probes that didn't get created.  If the removal fails for some reason we are
> + * out of luck - fortunately it is not harmful to the system as a whole.
>   */
>  static void detach(dtrace_hdl_t *dtp, const dt_probe_t *prp)
>  {
> @@ -350,15 +528,20 @@ static void detach(dtrace_hdl_t *dtp, const dt_probe_t *prp)
>  
>  	dt_tp_detach(dtp, tpp);
>  
> +	if (!pp->dtrace_created)
> +		return;
> +
>  	fd = open(UPROBE_EVENTS, O_WRONLY | O_APPEND);
>  	if (fd == -1)
>  		return;
>  
> -	dprintf(fd, "-:" PID_GROUP_FMT "/%s_%s\n", PID_GROUP_DATA,
> -		prp->desc->fun, prp->desc->prb);
> +	dprintf(fd, "-:%s", pp->uprobe_name);
>  	close(fd);
>  }
>  
> +/*
> + * Used for underlying probes.
> + */
>  dt_provimpl_t	dt_pid = {
>  	.name		= prvname,
>  	.prog_type	= BPF_PROG_TYPE_KPROBE,
> @@ -368,13 +551,16 @@ dt_provimpl_t	dt_pid = {
>  	.attach		= &attach,
>  	.probe_info	= &probe_info,
>  	.detach		= &detach,
> -	.probe_destroy	= &probe_destroy,
> +	.probe_destroy	= &probe_udestroy,
>  };
>  
> +/*
> + * Used for pid probes for specific processes.
> + */
>  dt_provimpl_t	dt_pid_proc = {
>  	.name		= prvname,
>  	.prog_type	= BPF_PROG_TYPE_KPROBE,
>  	.provide_pid	= &provide_pid,
>  	.enable		= &enable,
> -	.trampoline	= &trampoline,
> +	.probe_destroy	= &probe_destroy,
>  };
> diff --git a/libdtrace/dt_prov_sdt.c b/libdtrace/dt_prov_sdt.c
> index f9fc9f8b4f6a..ca47cf415d0f 100644
> --- a/libdtrace/dt_prov_sdt.c
> +++ b/libdtrace/dt_prov_sdt.c
> @@ -42,6 +42,7 @@ static const char		modname[] = "vmlinux";
>  #define KPROBES			"kprobes"
>  #define SYSCALLS		"syscalls"
>  #define UPROBES			"uprobes"
> +#define PID			"dt_pid"
>  
>  static const dtrace_pattr_t	pattr = {
>  { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
> @@ -57,14 +58,15 @@ static const dtrace_pattr_t	pattr = {
>   *   - GROUP_FMT (created by DTrace processes)
>   *   - kprobes and uprobes
>   *   - syscalls (handled by a different provider)
> + *   - pid and usdt probes (ditto)
>   */
>  static int populate(dtrace_hdl_t *dtp)
>  {
>  	dt_provider_t	*prv;
>  	FILE		*f;
> -	char		buf[256];
> +	char		*buf = NULL;
>  	char		*p;
> -	int		n = 0;
> +	size_t		n;
>  
>  	prv = dt_provider_create(dtp, prvname, &dt_sdt, &pattr);
>  	if (prv == NULL)
> @@ -74,31 +76,29 @@ static int populate(dtrace_hdl_t *dtp)
>  	if (f == NULL)
>  		return 0;
>  
> -	while (fgets(buf, sizeof(buf), f)) {
> -		int	dummy;
> -		char	str[sizeof(buf)];
> -
> +	while (getline(&buf, &n, f) >= 0) {
>  		p = strchr(buf, '\n');
>  		if (p)
>  			*p = '\0';
>  
>  		p = strchr(buf, ':');
>  		if (p != NULL) {
> -			size_t	len;
> +			int	dummy;
> +			char	*str;
>  
>  			*p++ = '\0';
> -			len = strlen(buf);
>  
> -			if (sscanf(buf, GROUP_FMT, &dummy, str) == 2)
> +			if (sscanf(buf, GROUP_SFMT, &dummy, &str) == 2) {
> +				free(str);
> +				continue;
> +			}
> +			else if (strcmp(buf, KPROBES) == 0)
>  				continue;
> -			else if (len == strlen(KPROBES) &&
> -				 strcmp(buf, KPROBES) == 0)
> +			else if (strcmp(buf, SYSCALLS) == 0)
>  				continue;
> -			else if (len == strlen(SYSCALLS) &&
> -				 strcmp(buf, SYSCALLS) == 0)
> +			else if (strcmp(buf, UPROBES) == 0)
>  				continue;
> -			else if (len == strlen(UPROBES) &&
> -				 strcmp(buf, UPROBES) == 0)
> +			else if (strcmp(buf, PID) == 0)
>  				continue;
>  
>  			if (dt_tp_probe_insert(dtp, prv, prvname, buf, "", p))
> @@ -110,6 +110,7 @@ static int populate(dtrace_hdl_t *dtp)
>  		}
>  	}
>  
> +	free(buf);
>  	fclose(f);
>  
>  	return n;
> diff --git a/libdtrace/dt_provider.h b/libdtrace/dt_provider.h
> index 53193e28889b..1acac6e61aa0 100644
> --- a/libdtrace/dt_provider.h
> +++ b/libdtrace/dt_provider.h
> @@ -25,9 +25,11 @@ extern "C" {
>   * to this format string as needed.
>   *
>   * GROUP_DATA provides the necessary data items to populate the format string
> - * (PID of the dtrace process and the provider name).
> + * (PID of the dtrace process and the provider name).  GROUP_SFMT is like
> + * GROUP_FMT, but for sscanf().
>   */
>  #define GROUP_FMT	"dt_%d_%s"
> +#define GROUP_SFMT	"dt_%d_%ms"
>  #define GROUP_DATA	getpid(), prvname
>  
>  struct dt_probe;
> -- 
> 2.35.1
> 
> 
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel



More information about the DTrace-devel mailing list