[DTrace-devel] [PATCH v3 12/20] usdt: daemon

Thu Oct 6 17:13:24 UTC 2022

On 29 Sep 2022, Kris Van Hees stated:

> First off, there are outstanding issues Eugene reported with the installation
> of dtprobed that need resolving before we can do a reviewed-by on this patch.

Fixed. There were more issues than that, actually :( but with those
resolved systemd looks happy with it now:

● dtprobed.service - DTrace USDT probe creation daemon
   Loaded: loaded (/usr/lib/systemd/system/dtprobed.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2022-10-06 13:15:58 BST; 1h 27min ago
     Docs: man:dtprobed(8)
 Main PID: 230968 (dtprobed)
    Tasks: 2 (limit: 50127)
   Memory: 724.0K
   CGroup: /system.slice/dtprobed.service
           └─230968 /usr/sbin/dtprobed -F

> On Thu, Sep 08, 2022 at 01:26:30PM +0100, Nick Alcock via DTrace-devel wrote:
>> This commit adds a daemon, "dtprobed", which usually runs at boot
>> (monitored by systemd, if possible), providing /dev/dtrace/helper using
>> CUSE, accepting DOF from processes doing the usual DTrace ioctl()s to
>> that device, passing the DOF to a child jailed with seccomp() for
>> parsing and accepting structures containing parsed results back, then
>> emitting uprobes from these results before allowing the ioctl()ing.  The
>> uprobes created have stereotyped names that include an encoded
>> representation of the name of the corresponding DTrace USDT probe.  (The
>> name also contains the address and a number of other things, so that
>> probes that appear in multiple places in a process still work.)
>> 
>> (The CUSE device is an "unrestricted ioctl" device, which restricts
>> dtprobed to running only as root, because the ioctl has to pull data --
>> the DOF -- out of arbitrary places in the client memory according to the
>> passed-in structure. Since you need to be root to create uprobes at all
>> this is not any kind of restriction.)
>> 
>> The uprobes code is migrated from DTrace proper: in this commit, only
>> the instance in the DTrace provider is migrated (the pid provider will
>> come in a later commit).
>> 
>> The implications on all this for libproc are interestingly twisted.
>> 
>> dtprobed has to use libproc in order to figure out which mappings to
>> inject uprobess into given the address of the probe, but libproc has
>> over time acquired dependencies on some things inside libdtrace.  This
>> was never intended: those dependencies are now either removed (a couple
>> of accidental calls to dt_dprintf) or migrated to a new utility build
>> library, dtprobed/libcommon-daemon (mostly uprobes.c, which does the
>> actual creation of uprobes, and dt_list).
>
> This does not make sense to me.  For one, you are relocating dt_list to be
> provided by a library that is part of the daemon code tree but if you look
> at the usage of dt_list I see the following:
>
>                libdtrace: 170 instances
>                  libproc:  27 instances
>                 dtprobed:  18 instances
>
> So I'd argue that dt_lsit definitely should be part of libdtrace or at least

That doesn't work, because libproc uses it, and the daemon depends on
that but not on libdtrace. Putting it in the libproc dir is possible,
but... weird. But, I suppose, no weirder than putting it in the daemon
directory!

> of a library at the top level (like libproc) rather than being under the
> daemon code tree.  And I'd say we should do the same with the other pieces
> since they are shared code and the daemon really shouldn't be the central part
> of it all but rather an add-on to make usdt work.
> 
> Also, perhaps the establishing of a new library to make some code available
> for use by the daemon should be its own patch, ahead of introducing the daemon?

Well, yes, except that introducing an empty library is sort of pointless
and confusing.

However, introducing a new library, migrating dt_list into it and doing
everything else needed to decouple libproc from libdtrace so that you
can use it from a separate program... *that* could be a separate commit.

Done, under a new directory libcommon/.

>> Adding even more craziness to this: recent (> 2018) libfuse has a nice
>> logging interface, which if available means that libfuse will log
>> FUSE-side problems into syslog or anywhere else of your choosing: we
>> emit into syslog if -d or -F (debug, foreground) are not specified and
>> systemd is not in use (if systemd is in use, we never daemonize at
>> all). But older libfuse does not provide this, and unfortunately OL8
>> (but not OL7!) has such an older libfuse.  So we add a compatibility
>> wrapper providing a minimal reimplementation of the logging interface if
>> built against such an old libfuse.
>> 
>> The daemon will grow in future releases: we plan a state transfer
>> mechanism to allow DTrace proper access to the DOF and to allow the
>> daemon to block its DOF-submitting processes in ioctl() until interested
>> running DTraces have finished initializing. But what we have here now
>> works for simple cases.
>
> I wouldn't add this.  Who knows what the future path will look like, and even
> for things we have discussed so far, the plans may still change.  No point in
> putting this in a commit message.

OK!

>> @@ -85,6 +85,10 @@ INCLUDEDIR := $(prefix)/include
>>  INSTINCLUDEDIR := $(DESTDIR)$(INCLUDEDIR)
>>  SBINDIR := $(prefix)/sbin
>>  INSTSBINDIR := $(DESTDIR)$(SBINDIR)
>> +UDEVDIR := $(prefix)/lib/udev/rules.d
>> +INSTUDEVDIR := $(DESTDIR)$(SYSUDEVDIR)
>
> SYSUDEVDIR is not defined anywhere.  Is UDEVDIR supposed to be SYSUDEVDIR?

Vice versa :) also spotted by Eugene.

>> diff --git a/Makeconfig b/Makeconfig
>> index 52d72661f383..cc20ef4c296d 100644
>> --- a/Makeconfig
>> +++ b/Makeconfig
>> @@ -72,4 +72,6 @@ $(eval $(call check-symbol-rule,ELF_GETSHDRSTRNDX,elf_getshdrstrndx,elf))
>>  $(eval $(call check-symbol-rule,LIBCTF,ctf_open,ctf))
>>  $(eval $(call check-symbol-rule,STRRSTR,strrstr,c))
>>  $(eval $(call check-symbol-rule,WAITFD,waitfd,c))
>> +$(eval $(call check-symbol-rule,LIBSYSTEMD,sd_notify,systemd))
>> +$(eval $(call check-symbol-rule,FUSE_LOG,fuse_set_log_func,fuse3))
>
> Are the new dependencies documented anywhere?  Should be listed in README.md.

>>  $(eval $(call check-header-symbol-rule,CLOSE_RANGE,close_range(3,~0U,0),c,unistd))
>> diff --git a/Makeoptions b/Makeoptions
>> index 011440a31d83..715d0a2e54b6 100644
>> --- a/Makeoptions
>> +++ b/Makeoptions
>> @@ -1,11 +1,12 @@
>>  # The implementation of the configurable make options.
>>  #
>>  # Oracle Linux DTrace.
>> -# Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.
>> +# Copyright (c) 2011, 2022, Oracle and/or its affiliates. All rights reserved.
>>  # Licensed under the Universal Permissive License v 1.0 as shown at
>>  # http://oss.oracle.com/licenses/upl.
>>  
>>  debugging ?= no
>> +dof_dbg ?= no
>>  coverage ?= no
>>  verbose ?= no
>>  
>> @@ -14,12 +15,17 @@ help::
>>  	@printf "make debugging=yes [targets]   Disable optimization to make debugger use easier\n" >&2
>>  	@printf "make coverage=yes [targets]    Turn on coverage support in the testsuite\n" >&2
>>  	@printf "make verbose=yes [target]      Enable verbose building\n" >&2
>> +	@printf "make dof_dbg=yes [targets]     Turn on especially noisy DOF parser debugging\n" >&2
>>  	@printf "\n" >&2
>>  
>>  ifneq ($(debugging),no)
>>  override CFLAGS += -O0 -g
>>  endif
>>  
>> +ifneq ($(dof_dbg),no)
>> +override CFLAGS += -DDOF_DEBUG
>> +endif
>> +
>>  ifneq ($(coverage),no)
>>  override CFLAGS += -O0 --coverage
>>  override LDFLAGS += --coverage
>> diff --git a/dtprobed/60-dtprobed.rules b/dtprobed/60-dtprobed.rules
>> new file mode 100644
>> index 000000000000..e0ec7a7c593f
>> --- /dev/null
>> +++ b/dtprobed/60-dtprobed.rules
>> @@ -0,0 +1,4 @@
>> +# Licensed under the Universal Permissive License v 1.0 as shown at
>> +# http://oss.oracle.com/licenses/upl.
>> +
>> +KERNEL=="dtrace/helper", MODE="0666"
>> diff --git a/dtprobed/Build b/dtprobed/Build
>> new file mode 100644
>> index 000000000000..c02e22c1c2e4
>> --- /dev/null
>> +++ b/dtprobed/Build
>> @@ -0,0 +1,45 @@
>> +# Oracle Linux DTrace.
>> +# Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
>> +# Licensed under the Universal Permissive License v 1.0 as shown at
>> +# http://oss.oracle.com/licenses/upl.
>> +
>> +CMDS += dtprobed
>> +BUILDLIBS += libcommon-daemon
>> +LIBS += libcommon-daemon
>> +
>> +dtprobed_DIR := $(current-dir)
>> +dtprobed_TARGET = dtprobed
>> +dtprobed_CPPFLAGS := -I. -Idtprobed -Ilibproc -Ilibport
>> +dtprobed_CFLAGS := $(shell pkg-config --cflags fuse3)
>
> You use pkg-config but no build dependency for pkg-config has been added to
> the spec file?
>
> Extra build dependencies should probably be added to README.md as well
> (pkg-config, fuse3, libsystemd).
>
>> +dtprobed_LIBS := -lcommon-daemon -lproc -lcommon-daemon -lport -lelf $(shell pkg-config --libs fuse3)
>> +dtprobed_DEPS := libproc.a libcommon-daemon.a libport.a
>> +dtprobed_SOURCES := dtprobed.c
>> +
>> +libcommon-daemon_TARGET = libcommon-daemon
>> +libcommon-daemon_DIR := $(current-dir)
>> +libcommon-daemon_CPPFLAGS := -I. -Idtprobed -Ilibproc
>> +libcommon-daemon_SOURCES = dof_parser.c dof_parser_host.c uprobes.c dt_list.c
>> +libcommon-daemon_LIBSOURCES = libcommon-daemon
>
> Mentioned above... this should not be here.  It is clearly stuff that is or can
> be used by both libdtrace and the daemon, so it should be at the top level.

This is now in libcommon/ (the -daemon part of the name is gone, since
that was always misleading and is now extra-misleading).

>> +
>> +ifdef HAVE_LIBSYSTEMD
>> +dtprobed_CFLAGS += $(shell pkg-config --cflags libsystemd)
>> +dtprobed_LIBS += $(shell pkg-config --libs libsystemd)
>> +endif
>> +
>> +ifndef HAVE_FUSE_LOG
>> +dtprobed_SOURCES += rpl_fuse_log.c
>> +endif
>
> Why not call it fuse_log.c?  Since it is providing that in lieu of not having
> it in the fuse library, I think there is no need to add a rpl_ prefix (which
> actually had me thinking for a bit what that is supposed to mean anyway).

The rpl_ is a convention derived from Gnulib that basically means "this
is a replacement for some other thing: if the other thing is present,
this is not included at all". I'm specifically trying to *avoid* calling
it fuse_log.[ch], because fuse_log.h is what is present in FUSE itself,
and I'm trying to avoid fighting with include file ordering here. I'm
happy to rename it, but it should be something with a prefix in front of
the fuse_*. dtrace_fuse_*? dtprobed_fuse_*?

>> +
>> +dtprobed.c_CFLAGS := -Wno-pedantic
>> +
>> +install::
>> +	mkdir -p $(INSTUDEVDIR)
>> +	$(call describe-install-target,$(INSTUDEVDIR),60-dtprobed.rules)
>> +	install -m 644 $(dtprobed_DIR)60-dtprobed.rules $(INSTUDEVDIR)
>> +ifdef HAVE_LIBSYSTEMD
>> +	mkdir -p $(INSTSYSTEMDUNITDIR)
>> +	$(call describe-install-target,$(INSTSYTEMDUNITDIR),dtprobed.service)

(this was a typo: fixed)

>> +#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
>> +
>> +size_t			dtrace_dof_maxsize = 256 * 1024 * 1024;
>
> Why retain dtrace_dof_maxsize as variable name when you renamed all other
> dtarce_dof_* as dof_*?
>
> Actually, there is one other: dtrace_dof_sect().
>
> For consistency, use dof_maxsize and dof_sect().

Nicely spotted: fixed.

>> +dof_helper_t *
>> +dof_copyin_helper(int in, int out, int *ok)
>> +{
>> +	dof_helper_t *dh;
>> +	size_t i;
>> +
>> +	/*
>> +	 * The dof_helper_t is easy: fixed-size.
>> +	 */
>
> I would get rid of the comment above - it doesn't really add any value and the
> (short) imlementation of this function makes it rather obvious that we're
> reading a fixed size entity.

Agreed. (This is residue from when this function was part of a bigger
one that did too much.)

>> +
>> +	/*
>> +	 * First get the header, which gives the size of everything else.
>> +	 */
>> +	dh = malloc(sizeof(dof_helper_t));
>> +	if (!dh)
>> +		abort();
>
> Maybe you should at least *try* to report an error about the allocation
> failure?  Perhaps the OOM condition won't allow the error to be reported, but
> at least try?  (More of these are present in the rest of the code.)

I would like to fix this, but it's tricky. We can't allocate when
returning, but that makes it very hard to return anything because all
the output from this code must be instances of probe_creation_info_t
returned from dof_parser_write_one(), which is the very thing we are
reporting errors from. This is in any case in a seccomp jail and will
result in a report from dtprobed of parser subprocess death and restart
(which will probably no longer be OOM and will complete happily, or will
be sufficiently OOM that the parser restart is impossible).

>> +
>> +	memset(dh, 0, sizeof(dof_helper_t));
>
> Why memset() the struct when we are going to read it just below this?  And
> reading an imcomplete struct is an error.

Because there have been ever so many bugs involving uninitalized data in
code running as root, and the code below is a page-long bug cluster with
at least a dozen bugs in it. Better safe than sorry.

>> +
>> +	for (i = 0; i < sizeof(dof_helper_t);) {
>> +		size_t ret;
>> +
>> +		ret = read(in, ((char *) dh) + i, sizeof(dof_helper_t) - i);
>> +
>> +		if (ret < 0) {
>> +			switch (errno) {
>> +			case EINTR:
>> +				continue;
>> +			default:
>> +				goto err;
>> +			}
>> +		}
>> +
>> +		/*
>> +		 * EOF: parsing done, process shutting down or message
>> +		 * truncated.  Fail, in any case.
>> +		 */
>> +		if (ret == 0)
>> +			goto err;
>> +
>> +		i += ret;
>> +	}
>> +
>> +	*ok = 1;
>> +	return dh;
>> +
>> +err:
>> +	*ok = 0;
>> +	free(dh);
>> +	return NULL;
>
> No error reporting?

No point. This is self-healing: this can only go wrong if something has
gone wrong with the communication pipe to the caller, and setting *ok =
0 and returning NULL forces a parser shutdown and reinvocation,
magically repairing the pipe :) if this keeps going wrong, we're
probably out of fds or memory systemwide and much bigger problems will
be observed immediately. (It's also hard to see what we can possibly do:
if the in pipe has failed, the out pipe is probably also not working
well, and that's the only route we have for reporting errors.)

>> +}
>> +
>> +dof_hdr_t *
>> +dof_copyin_dof(int in, int out, int *ok)
>> +{
>> +	struct dof_hdr *dof;
>
> Here (and all throughout the code) you mix TYPE_t and struct TYPE datatype
> names, or simply changed the original code from TYPE_t to struct TYPE.  Please
> stick with the original TYPE_t declaratoins and uses as they were in the
> original code.  While it certainly is a matter of taste and convention, the
> DTrace codebase has favoured TYPE_t and for consistency we should stick with
> that.

Whoops, adjusted back.

This is all because this came from the old kernel module code, which
years ago went through a Coccinelle script to transform foo_t into
struct foo for all those cases where foo was a structure (but only for
those cases, so non-structures retained foo_t). This madness is no
longer important, since we're not kernel code, thank goodness.

>> +	size_t i, sz;
>> +
>> +	*ok = 1;
>> +
>> +	/*
>> +	 * First get the header, which gives the size of everything else.
>> +	 */
>> +	dof = malloc(sizeof(dof_hdr_t));
>> +	if (!dof)
>> +		abort();
>
> No error reporting?

See above :)

>> +
>> +	memset(dof, 0, sizeof(dof_hdr_t));
>
> Same as above (unnecessary).

See above, for exactly the same reasn (bug cluster, making damn sure
this is always initialized and we never send or receive uninitialized
data out of the jail).

(Also, as far as I'm concerned, in code running as root there is no such
thing as unnecessary initialization of uninitialized variables. The cost
is more or less zero in any case: it's in the L1 cache and all modern
CPUs have special hardware to make zero-initialization somewhere between
zero cost and actually *negative* cost -- because if something is
zero-initialized we don't need to fetch its -- garbage -- contents out
of main memory.)

>> +			new_dof = realloc(dof, dof->dofh_loadsz);
>> +			if (!new_dof)
>> +				abort();
>> +
>> +			memset(((char *)new_dof) + i + ret, 0, new_dof->dofh_loadsz - (i + ret));
>
> Same as above (unnecessary).

For the same reason as above: I'm not leaving any uninitialized bytes
here.

>> +			dof = new_dof;
>> +			sz = dof->dofh_loadsz;
>> +		}
>> +
>> +		i += ret;
>> +	}
>> +
>> +	return dof;
>> +
>> +err:
>> +	*ok = 0;
>> +	free(dof);
>> +	return NULL;
>
> No error reporting?

As above.

>> +/*
>> + * Return the dof_sec_t pointer corresponding to a given section index.  If the
>> + * index is not valid, dof_error() is called and NULL is returned.  If a type
>> + * other than DOF_SECT_NONE is specified, the header is checked against this
>> + * type and NULL is returned if the types do not match.
>> + */
>> +static struct dof_sec *dtrace_dof_sect(int out, struct dof_hdr *dof,
>> +				       uint32_t doftype, dof_secidx_t i)
>
> dtrace_dof_sect -> dof_sect
>
> Why did you change 'type' to be 'doftype'?  It is the type of the section we
> are looking for - not some type indicator for a DOF object.  If you want make
> it more clear wht 'type' means, perhaps use 'sectype'?

Adjusted to 'sectype'. 'type' gives me hives as a variable name: type of
what? (It's not very clear, as indicated by the fact that I renamed it
wrong!)

>> +/*
>> + * Apply the relocations from the specified 'sec' (a DOF_SECT_URELHDR) to the
>> + * specified DOF.  At present, this amounts to simply adding 'ubase' to the
>> + * site of any user SETX relocations to account for load object base address.
>> + * In the future, if we need other relocations, this function can be extended.
>> + */
>> +static int
>> +dof_relocate(int out, struct dof_hdr *dof, struct dof_sec *sec, uint64_t ubase)
>> +{
>> +	uintptr_t		daddr = (uintptr_t)dof;
>> +	struct dof_relohdr	*dofr;
>> +	struct dof_sec		*ss, *rs, *ts;
>> +	struct dof_relodesc	*r;
>> +	unsigned int		i, n;
>
> The original code used uint_t which is mostly used throughout the DTrace source
> code so I would prefer not to change it to 'unsigned int' if anything because
> it just adds noise to e.g. looking at a diff with the v1 code that implements
> this.

I think this was probably another Coccinelle transformation before I
picked this code up: uint_t etc are gross pointless obsolete horrors and
the kernel rightly abhors them all. C already has a perfectly good type
for this, with a shorthand alias no less.

> There are a few other occurences of this.

Good! (but adjusted back, with reluctance).

>> +			if (!IS_ALIGNED(taddr, sizeof(uint64_t))) {
>> +				dof_error(out, EINVAL, "misaligned setx relo");
>> +				return -1;
>> +			}
>> +
>> +			/*
>> +			 * XXX is this still necessary?
>
> Just try it and handle accordingly.

Try it on what? I don't know what version of the toolchain this related
to, or even if it was Solaris-only. Only the author of this (i.e. you?)
could tell me :)

>                                      Leaving it for a later time likely means
> it never gets looked at again.  In fact, there is a fair chance that the
> relocation by the runtime loader is now happening on all architectures.  When
> I did the arm64 port of the legacy version of DTrace, arm64 had newer versions
> of the toolchain in comparison to x86_64.

So... tear it out and do an aarch64 test on... the oldest aarch64 distro
we support, which is I guess up-to-date OL8 (do we support OL7 on
aarch64 at all any more? I believe not, but I could be wrong.)

>> +/*
>> + * The dof_hdr_t passed to dtrace_dof_slurp() should be a partially validated

(another dtrace_ remnant: ditched).

>> +static void
>> +emit_tp(int out, uint64_t base, uint64_t offs, int is_enabled)
>> +{
>> +	probe_creation_info_t tp;
>> +
>> +	memset(&tp, 0, sizeof(tp));
>> +
>> +	tp.size = offsetof(struct probe_creation_info, dpi.tracepoint.is_enabled) +
>> +		sizeof(tp.dpi.tracepoint.is_enabled);
>
> What is the point of this construct, given that this is not a use case where
> there is dynamically sized data expected at the end of the struct. The size
> can simply be sizeof(probe_creation_info_t), right?  Yes, that means you make
> the data packet a little bit bigger based on the other struct in the union, but
> that hardly matters, and the benefit is that you do not use this complex way to
> calculate the size of a fixed-sized entity.  It also makes it easier for the
> reader code to have a fixed-sized struct to read (with possible extra data
> tacked onto the end of it).

This conflicts with your own advice elsewhere. If we do this, but we
don't memset(), we pass uninitialized data back over the pipe (an
absolute no-no). I have no objection to just using a sizeof() for all
fixed-size messages otherwise: adjusted (but it makes it more complex,
because now the means to determine the size is inconsistent and you have
to remember to use the offsetof() approach for some messages but not
others).

(Adjusted to use sizeof() in the 50% of cases where that is possible.)

>> +	tp.type = PIT_TRACEPOINT;
>> +	tp.dpi.tracepoint.addr = base + offs;
>> +	tp.dpi.tracepoint.is_enabled = is_enabled;
>
> This (and quite a few more) occurences of how probe_creation_info is used is
> perhaps better provided through a set of macros, one for each type?  That also
> makes maintaining the probe_creation_info mechanism easier.

I have no idea what those macros might look like. There's honestly not
enough code here for such complexity to be worth doing, IMHO. There's
only 15 uses total in this file!

>> +	dof_parser_write_one(out, &tp, tp.size);
>> +
>> +	dt_dbg_dof("        Tracepoint at 0x%lx (0x%llx + 0x%x)%s\n",
>> +		   base + offs, base, offs, is_enabled ? " (is_enabled)" : "");
>> +}
>> +
>> +static int
>> +uint32_cmp(const void *ap, const void *bp)
>> +{
>> +	return (*(const uint32_t *)ap - *(const uint32_t *)bp);
>> +}
>
> Not needed because of my comment below...
[...]
>> +
>> +static void
>> +validate_emit_probe(int out, struct dtrace_helper_probedesc *dhpb)
>> +{
>> +	int		i;
>> +
>> +	/*
>> +	 * The offsets must be unique.
>> +	 */
>> +	qsort(dhpb->dthpb_offs, dhpb->dthpb_noffs, sizeof(uint32_t),
>> +	     uint32_cmp);
>> +	for (i = 1; i < dhpb->dthpb_noffs; i++) {
>> +		if (dhpb->dthpb_base + dhpb->dthpb_offs[i] <=
>> +		    dhpb->dthpb_base + dhpb->dthpb_offs[i - 1]) {
>> +			dof_error(out, EINVAL, "non-unique USDT offsets at %i: %li <= %li",
>> +				  i, dhpb->dthpb_base + dhpb->dthpb_offs[i],
>> +				  dhpb->dthpb_base + dhpb->dthpb_offs[i - 1]);
>> +			return;
>> +		}
>> +	}
>> +
>> +	qsort(dhpb->dthpb_enoffs, dhpb->dthpb_nenoffs, sizeof(uint32_t),
>> +	     uint32_cmp);
>> +	for (i = 1; i < dhpb->dthpb_nenoffs; i++) {
>> +		if (dhpb->dthpb_base + dhpb->dthpb_enoffs[i] <=
>> +		    dhpb->dthpb_base + dhpb->dthpb_enoffs[i - 1]) {
>> +			dof_error(out, EINVAL, "non-unique is-enabled USDT offsets "
>> +				  "at %i: %li <= %li", i,
>> +				  dhpb->dthpb_base + dhpb->dthpb_enoffs[i],
>> +				  dhpb->dthpb_base + dhpb->dthpb_enoffs[i - 1]);
>> +			return;
>> +		}
>> +	}
>> +
>> +	if (dhpb->dthpb_noffs == 0 && dhpb->dthpb_nenoffs == 0) {
>> +		dof_error(out, EINVAL, "USDT probe with zero tracepoints");
>> +		return;
>> +	}
>> +
>> +	/* XXX TODO translated args
>> +	   pp->ftp_nargs = dhpb->dthpb_xargc;
>> +	   pp->ftp_xtypes = dhpb->dthpb_xtypes;
>> +	   pp->ftp_ntypes = dhpb->dthpb_ntypes;
>> +	*/
>> +
>> +	/*
>> +	 * Return info on each tracepoint in turn.
>> +	 */
>> +	for (i = 0; i < dhpb->dthpb_noffs; i++)
>> +		emit_tp(out, dhpb->dthpb_base, dhpb->dthpb_offs[i], 0);
>> +
>> +	/*
>> +	 * Then create a tracepoint for each is-enabled point.
>> +	 *
>> +	 * XXX original code looped over ntps here, which is noffs + enoffs.
>> +	 * This seems surely wrong!
>> +	 */
>> +	for (i = 0; i < dhpb->dthpb_nenoffs; i++)
>> +		emit_tp(out, dhpb->dthpb_base, dhpb->dthpb_enoffs[i], 1);
>> +
>> +	/*
>> +	 * XXX later:
>> +	 * If the arguments are shuffled around we set the argument remapping
>> +	 * table. Later, when the probe fires, we only remap the arguments
>> +	 * if the table is non-NULL.
>> +	 *
>> +	for (i = 0; i < dhpb->dthpb_xargc; i++) {
>> +		if (dhpb->dthpb_args[i] != i) {
>> +			pp->ftp_argmap = dhpb->dthpb_args;
>> +			break;
>> +		}
>> +	} */
>
> It seems easy enough to emit the argument data as well, since it exists, and
> I would expect the DOF parser to do so, and leave it up to the reader to use
> what it needs and discard what it does not need.  And we know we *will* be
> needing it at some point.

I'm doing it later because right now I have no idea what that sort of
thing should look like in the struct probe_creation_info, and I didn't
want to spend time defining it long before we use it. Put off until next
phase :)

>> +}
>
> This validate_emit_probe() function implements functionality that is not at all
> part of the DOF parser but rather comes from the fasttrap provider in the
> legacy version.  I don't think it should be here - this has to do with the
> specific implementation details of the probing mechanism and thus should be
> more appropriately part of the dtprobed proper.  I.e. have the DOF parser emit
> what is in the DOF it processes and have dtprobed proper deal with that data.

So you're suggesting what... that we emit the tracepoints, whatever they
are, and then have dtprobed.c check them for uniqueness?

I tried to do this but it is actually really quite difficult and makes
dtprobed noticeably more complex for more or less no benefit.

As it is, dtprobed's probe processing is basically stream-based: it
accepts a probe followed by a stream of tracepoints from the parser
child and inserts each tracepoint one by one. Because we can rely on
them all to have unique addresses we can trust that they won't collide
with each other, so we can forget all about each one after we insert it.
If we do things the way you suggest, we have to slurp them all in and
store them somewhere just in order to do the uniqueness checking we
could more easily have done right here!

It's much easier just to have the parser child check everything it can:
if we're lucky (i.e. that checking never requires verification of state
in the victim process) we can rely on doing all this verification in the
here, which is *exactly* the sort of thing the child should be doing in
its nice safe jail, and can just stream it out in the fully-privileged
parent.

I'm happy to split this function in two, though: it's doing two
unrelated jobs, validation and emission. That's an unambiguous
improvement.

>> +
>> +static void
>> +helper_provide_one(int out, struct dof_helper *dhp,
>> +		   struct dof_hdr *dof, struct dof_sec *sec)
>> +{
>> +	uintptr_t	daddr = (uintptr_t)dof;
>> +	uint32_t	*off, *enoff;
>> +	char		*strtab;
>> +	unsigned int	i;
>> +	void		*arg;
>> +
>> +	struct dof_sec			*str_sec, *prb_sec, *arg_sec, *off_sec,
>> +					*enoff_sec;
>> +	struct dof_provider		*prov;
>> +	struct dof_probe		*probe;
>> +
>> +	probe_creation_info_t	pci_probes;
>
> This list of variable declarations is a bit of mess in terms of spacing, etc.

Ugh, agreed.

> I think pci_probes is a very cryptic name.  It would imply to me that it is
> somehow an array of probes or something.  The name also makes me think of a
> specific kind of probes (PCI probes) which clearly it is not.  Instead it seems
> to be used as a data item that is used to emit the number of probes in the
> provider?

I'm just using the same sort of (IMHO horrible) variable abbreviation
scheme DTrace uses elsewhere, probe_creation_info -> pci: I assumed
you'd insist on it.

It's no worse than 'pvp' meaning, oh god what does it mean (it's so very
memorable I can't remember it even after seeing it a hundred times, and
pdp and psp and all the other p.p are even worse), 'pointer to
dt_provider_t'. Every time I read that I think 'player versus player'.
But yes 'pci' is rather overloaded as a term :) Maybe we can change both
that and 'pvp' to something actually memorable. Please.

(Adjusted to simply 'probes_msg' and 'probe_msg', to reflect that it is
a message about a probe, not the probe itself.)

>> +	/*
>> +	 * See dtrace_helper_provider_validate().
>
> Which has already been renamed to be helper_provider_validate() so the comment
> needs updating.

Oops! Fixed.

>> +	pci_probes.size = offsetof(struct probe_creation_info, dpi.probes.nprobes) +
>> +		sizeof(pci_probes.dpi.probes.nprobes);
>
> See my comment above in emit_tp().

Adjusted.

>> +		pci_probe = malloc(pci_probe_size);
>> +		if (!pci_probe) {
>> +			dof_error(out, ENOMEM, "Out of memory allocating probe");
>> +			return;
>> +		}
>> +
>> +		memset(pci_probe, 0, pci_probe_size);
>
> Should not be needed.

See above. Intentional belt and braces. This is code running as root, dammit.

>> +
>> +		pci_probe->size = pci_probe_size;
>> +		pci_probe->type = PIT_PROBE;
>> +		pci_probe->dpi.probe.ntp = dhpb.dthpb_noffs + dhpb.dthpb_nenoffs;
>> +		ptr = stpcpy(pci_probe->dpi.probe.mod_func_name, dhpb.dthpb_mod);
>> +		ptr++;
>> +		ptr = stpcpy(ptr, dhpb.dthpb_func);
>> +		ptr++;
>> +		strcpy(ptr, dhpb.dthpb_name);
>> +		dof_parser_write_one(out, pci_probe, pci_probe_size);
>
> Where is the provider name?

It's in there now :) field renamed to probe_name accordingly.

>> +/*
>> + * Result of DOF probe parsing for probe creation. We receive a probes info
>> + * structure, followed by N probe info structures each of which is followed by
>> + * possibly many tracepoint info structures, all tagged.  Things not useful for
>> + * probe creation (like args, translated types, etc) are not returned.
>
> But this is the DOF parser... not the dedicated probe creator.  So, the parser
> should return all relevant DOF data and leave it to the reader (dtprobed for
> now) to determine what it needs and what it does not need.

That means the non-jailed code has to process something as complex as
the original DOF input, making the whole jailing etc redundant. It also
makes it much more complex to define the stuff we're returning, and I
was trying *not* to have to specify an alternate representation for
anything I didn't need at this stage. It's extensible, we can add more
as we need it.

>> + *
>> + * On error, a PIT_ERR structure is returned with an error message.
>> + */
>> +
>> +typedef enum probe_info {
>> +	PIT_PROBES = 0,
>> +	PIT_PROBE = 1,
>> +	PIT_TRACEPOINT = 2,
>> +	PIT_ERR = 3
>> +} probe_info_t;
>
> The naming here is rather odd, I think.  This is not probe info, but rather
> the type of the DOF parser output packets.

Yeah, that's a bad name. Hmm. Maybe dof_parsed_info_t...

>> +
>> +typedef struct probe_creation_info {
>
> Again, an odd choice of name I think.  Probe creation is only one use case of
> DOF parsing.  E.g. in the future we will probably have to re-parse the DOF at
> the DTrace level to determine things like argument mapping.  Maybe a name like
> dof_data_t (and dof_type_t instead of probe_info_t) does a batter job giving a
> name to what this really is?

We'll probably have to rename it later, but right now all it is used for
is probe creation, and even the comment says "result of DOF parsing for
probe creation'. I was *trying* to make it clear that this wasn't a
complete encoding of DOF but rather specifically and only what we were
using right now, and that it could be extended as needed, but clearly
that flew over your head ;)

Renaming it when we are using it for more is one keystroke away, it's
easy...

... but I'm happy to call it dof_parsed_t so we don't need to rename it
in future. (But I would resist trying to add *everything* to it, or
making it something that's not useful in streamable form when creating
probes. Cross-probe consistency checking etc is something that's meant
to be done in the jailed child if at all possible.)

(Adjusted.)

>> +	/*
>> +	 * Size of this instance of this structure.
>> +	 */
>> +	size_t size;
>> +
>> +	probe_info_t type;
>> +
>> +	union dpi {
>
> I have to ask...  what does dpi stand for?  Since this is just a construct to
> encapsulate the different types of dof parser data perhaps a very minimal name
> 'u' would be more more convenient?

Terrible name, I know. 'dtrace probe info' perhaps? I can't even remember. :)

>> +		struct dpi_probes_info {
>> +			/*
>> +			 * Number of probes that follow.
>> +			 */
>> +			size_t nprobes;
>> +		} probes;
>
> So, with this naming we already end up with
>
> 	var.dpi.dpi_probes_info.nprobes
>
> so the whole 'dpi' thing is duplicated (whatever it means) and I do not think
> it adds any clarity.  Instead, something like
>
> 	var.u.nprobes.val
>
> is much more compact and (I think) more clear.

... what on earth does 'u' mean? Honestly given that we know we're using
GCC we might as well use an anonymous union and lose the 'dpi' entirely.
(Dropped. It was only there because in the past I was doing sizeofs of
it, so it needed a name.)

>> +		struct dpi_probe_info {
>> +			/*
>> +			 * Number of tracepoints that follow.
>> +			 */
>> +			size_t ntp;
>> +
>> +			/*
>> +			 * Three \0-separated strings.
>> +			 */
>> +			char mod_func_name[1];
>
> A more appropriate name migh tbe 'desc' since we often refer to the quadruple
> (prov:mod:func:name) as the probe description.

I changed it to probe_name now it's a full quad :) I can rename it to
probedesc, but that feels likely to collide with ever so many different
uses of probedesc to mean actual structures etc.

>> +		} probe;
>
> Simlarly, something like:
>
> 	var.u.probe.ntp
> 	var.u.probe.desc
>
> and so on...

It's var.probe.ntp etc now :)

>> +	/*
>> +	 * On the first read, only read in the size.  Decide how much to read
>> +	 * only after that, both to make sure we don't underread and to make
>> +	 * sure we don't *overread* and concatenate part of another message
>> +	 * onto this one.
>> +	 */
>
> As mentioned above, I think you should simply always write (and read) at least
> sizeof(probe_creation_info_t).  That also makes the code more robust because it
> avoids adding in dependencies on the struct layout (as you do below).

I agree for non-dynamically-sized regions (and this is now done), but
for things like the probe name that are arbitrarily long and also
arbitrarily short, sizing them now becomes more complex and runs the
risk (in the dynamically-sized regions) of exposing uninitialized data
to the caller. (And I tried doing it that way, and it actually makes the
code below no simpler at all, because the thing can still be arbitrarily
long.)

>> +static int debug;
>> +static int foreground;
>> +int _dtrace_debug = 0;				/* For libproc.  */
>
> Why do you have both 'debug' and '_dtrace_debug'?  Since -d initializes both
> to 1, aren't they just the same thing when it comes to dtprobed?  Just use
> _dtrace_debug, since you need it for libproc anyway.

Mostly because I started with 'debug' and didn't think to drop it :) dropped.

>> +	cs = cuse_lowlevel_setup(argc, argv, &ci, &dtprobed_clop,
>> +				 &multithreaded, userdata);
>> +
>> +	if (cs == NULL)
>> +		goto err;
>
> I would inline the perror and exit here.  No point in doing a goto if this is
> the only instance.  And inlining highlights the special case.

Agreed, that's residue from more complex code in the past.

>> +
>> +	if (multithreaded) {
>> +		fprintf(stderr, "CUSE thinks dtprobed is multithreaded!\n");
>> +		fprintf(stderr, "This should never happen.\n");
>> +		errno = EINVAL;
>> +		return NULL;
>
> Why not just use 'goto err;' here?

Because the perror() is inappropriate in this case. Both labels are gone
now anyway :)

>> +static void
>> +create_probe(pid_t pid, probe_creation_info_t *probe, probe_creation_info_t *tp)
>> +{
>> +	const char *mod, *func, *name;
>> +	char *probe_name;
>> +
>> +	if (tp->dpi.tracepoint.is_enabled)
>> +		return;				/* Not yet implemented.  */
>> +
>> +	mod = probe->dpi.probe.mod_func_name;
>> +	func = mod + strlen(mod) + 1;
>> +	name = func + strlen(func) + 1;
>
> Provider name is needed also.

This now looks like

	prv = probe->dpi.probe.probe_name;
	mod = prv + strlen(prv) + 1;
	fun = mod + strlen(mod) + 1;
	prb = fun + strlen(fun) + 1;

>> +	/*
>> +	 * Final call: DOF acquired.  Pass to parser for processing.
>> +	 */
>
> The code that follows should be split out into its own function because it no
> longer has to do with ioctl() interaction but rather with the processing of the
> data that was received.

Good point! Easier to read this way too. (I stuck the new function below
the current one, so you can still read it linearly.)

This let me see that the error handling was crappy and often failed to
report if (upon error) it failed to unblock the caller by EINVALing it.
Fixed that.

>> +	/*
>> +	 * These are "command-line" arguments to FUSE itself: our args are
>> +	 * different.  The double-NULL allows us to add an arg.
>> +	 */
>> +	char *args[] = { argv[0], "-f", "-s", "-o", "allow_other", NULL, NULL };
>> +	int nargs = 5;
>
> How about naming these something like fuse_argv and fuse_argc?  That makes it
> less confusing, I think, since having args, nargs, argv, and argc all present
> here makes it easy to lose track of what is what.

I did that but was rapidly confused because fuse's own examples use
fuse_argv and fuse_argc promiscuously to mean something quite different.
I meant to rename them back once I'd stopped reading that code so much
but forgot. Fixed ;)

>> +		default:
>> +			fprintf(stderr, "Syntax: dtprobed [-F] [-d] [-n devname] [-t timeout]\n");
>> +			exit(1);
>> +		}
>> +	}
>> +
>> +	if (optind < argc) {
>> +		fprintf(stderr, "Syntax: dtprobed [-F] [-d] [-n devname] [-t timeout]\n");
>> +		exit(1);
>> +	}
>
> Surely there is a better way than duplicating the fprintf+exit here?  Make the
> default case abort in way that triggers the error reporting above, maybe?

The traditional way is a function named usage() that does the usage dump
and exit. Adjusted.

>> +	/*
>> +	 * When not foregrounding, daemonize, respond to errors (which have
>> +	 * already been reported, arrange to log to syslog, then report
>> +	 * successful startup down our synchronization pipe (by closing it).
>> +	 */
>
> I do not believe this comment block is accurate.  Some of the actions it
> describes are actually executed even when we run in the foreground, e.g.
> setting the log function.

Ugh. English grammar is awful! :) also this comment clearly got
half-edited and then forgotten about: the brackets aren't even nested!

A better comment is now here:

>> +	if (!foreground) {
>> +		if ((sync_fd = daemonize(0)) < 0) {
>> +			teardown_device();
>> +			exit(2);
>> +		}
>> +	}

	/*
	 * We are now daemonized, if we need to be.  Arrange to log to syslog,
	 * fire off the jailed parser subprocess, then report successful startup
	 * down our synchronization pipe (by closing it) and tell systemd (if
	 * present) that we have started.
	 */

>> +	fuse_set_log_func(log_msg);
>> +
[...]

>> diff --git a/dtprobed/rpl_fuse_log.c b/dtprobed/rpl_fuse_log.c
>> new file mode 100644
>> index 000000000000..801b1fb845dd
>> --- /dev/null
>> +++ b/dtprobed/rpl_fuse_log.c
>
> I would just call this fuse_log.c

Well, something else PREFIX_fuse_log.c is fine: see above. But I don't
know what PREFIX you'd prefer :)

>> +static const char hexdigits[] = "0123456789abcdef";
>> +
>> +/*
>> + * Encode a NAME suitably for representation in a uprobe.  All non-alphanumeric,
>> + * non-_ characters are replaced with __XX where XX is the hex encoding of the
>> + * ASCII code of the byte. __ itself is replaced with ___.  The first letter
>> + * gets a similar transformation applied even to digits, because anything
>> + * resembling consistency in naming rules is obviously just wrong.
>> + */
>
> Why not just do the encoding that DTrace has always done?

... which is? I wasn't aware we had an existing encoding scheme that
would transform probe names into something valid in uprobe arg strings.

I'm happy to use something less general, but I'm honestly not too sure
what things *are* valid in probe names (it seems to be checked in
multiple places which do not align), and this at least isn't going to
trip over something unexpected.

(This has changed now: we add a 'differentiation char' on the end at
encoding time and trim it off at decoding time, so that all arg names
are different even if the probedesc components are named the same
thing.)

>> +		if (asprintf(&name, "dt_pid_%s%llx_%llx_%llx_%s",
>> +			     isret ? "ret_" : "",
>> +			     (unsigned long long) dev,
>> +			     (unsigned long long) ino,
>> +			     (unsigned long long) addr, encoded_name) < 0) {
>> +			free(encoded_name);
>> +			return NULL;
>> +		}
>> +		free(encoded_name);
>
> This has a pretty decent chance of being a name that is simply too long for
> uprobes (per the kernel trace subsystem size checks).  It cannot be longer
> than 64 characters.  At a minimum, why not use a group name that identifies
> the file and then a uprobe name that is the encoded name?  Both group name and
> probe name can be up to 64 characters.
> 
> So, use something like: dt_pid_DEV_INO/ENCODED_NAME
> or (since a provider name is needed): PROVIDER_DEV_INO/ENCODED_NAME

Fixed, in the wonderful way you suggested: a consistent probe name of
dt_pid_[ret_]%llx_%llx_%llx_%s but with the probe name components passed
in as argument names with a last byte to differentiate all arg names
from each other, e.g.

p:dt_pid/dt_2d_231d_7fbe933165fd /tmp/runtest.17112/usdt-dlclose1.123064.9010/livelib.so:0x15fd test_prov1=\1 livelib__2eso2=\2 go3=\3 go4=\4

This gives us nearly 64 bytes per probe name (we lose at lease four
bytes for the =\N and the differentiator byte)

> I am not sure why you add ret_ for a return probe when the uprobe_events file
> will already list that as r: (as opposed to p: for regular probes).

Just to make sure it doesn't match the parsing in dt_pid.c and get
mistaken for a USDT probe, even if a probe is left behind somehow
(probably by a forcible dtrace termination).

>> +	}
>> +
>> +	/* The final uprobe name has "uprobes/" on the front, always. */
>> +
>> +	if (asprintf(&final_name, "uprobes/%s", name) < 0) {
>> +		free(name);
>> +		return NULL;
>> +	}
>> +
>> +	/* Add the uprobe. */
>> +	fd = open(TRACEFS "uprobe_events", O_WRONLY | O_APPEND);
>> +	if (fd != -1) {
>> +		rc = dprintf(fd, "%c:%s %s\n", isret ? 'r' : 'p',
>> +			     name, spec);
>> +		close(fd);
>> +	}
>> +	free(name);
>> +
>> +	if (fd == -1 || rc == -1) {
>> +		free(final_name);
>> +		return NULL;
>> +	}
>
> How about error reporting if the probe could not be created?  Silent failure is
> not a good thing.

Yeah, that's something that's a bit tricky :( I'd need a callback or
something, since this is called both from dtrace and dtprobed and I
suspect they'd want to report failure in different ways... (the
foregrounded daemon would want to report to stderr, but I'm not sure if
dtrace would and the backgrounded daemon needs to syslog it).

So I put that off until phase 2 :) so this didn't take forever. I'll add
a TODO.

>> diff --git a/include/dtrace/pid.h b/include/dtrace/pid.h
>> index 77f07ce38ca7..947f8079c0a8 100644
>> --- a/include/dtrace/pid.h
>> +++ b/include/dtrace/pid.h
>
> I am pretty certain this does not belong in this patch.  Nothing in the patch
> uses this file.  This is userspace stuff.

Yeah, this definitely got mis-squashed-back. Sorry about that.  (This
hunk also stops the rest compiling.)

>> diff --git a/libdtrace/dt_prov_dtrace.c b/libdtrace/dt_prov_dtrace.c
>> index 82f806809fc4..286e0d45c231 100644
>> --- a/libdtrace/dt_prov_dtrace.c
>> +++ b/libdtrace/dt_prov_dtrace.c
>
> I am pretty certain this does not belong in this patch.

Actually this bit does! We're rejigging it to use uprobe_spec_by_addr()
so that the uprobes it creates have a common name shared with all the
others.  Doing this as soon as possible prevents us from ever having any
duplicated uprobe creation code, without stuffing the dt_prov_dtrace
thing into the userspace commit (where it defintely doesn't belong
because that commit is all about dt_prov_pid).

I'm happy to split this out into a patch immediately following the core
daemon stuff, but to me this is conceptually part of the same thing.

-- 
NULL && (void)