[DTrace-devel] [PATCH] Implement the io provider

Fri Dec 1 23:34:04 UTC 2023

On 30 Nov 2023, Kris Van Hees via DTrace-devel told this:

> Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
> Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>

OK there is no choice, I have to review this one :)

Very nice, insofar as anything with this much kernel-internal grot in it
could be nice. A bunch of comments though.

> diff --git a/libdtrace/dt_bpf_maps.h b/libdtrace/dt_bpf_maps.h
> index 0dd36b16..9a21eddf 100644
> --- a/libdtrace/dt_bpf_maps.h
> +++ b/libdtrace/dt_bpf_maps.h
> @@ -31,6 +31,8 @@ struct dt_bpf_specs {
>  					 * drain this buffer */
>  };
>  
> +#define IO_BIO_SIZ 256

This seems.. odd? Can't we tell what a suitable size is by examining the
CTF? (There's a FIXME lower down about this, but I'm not sure what
'change EDT_NOCTF' implies.)

> +#define IO_BIO_STK 4

This one seems to be unused.

>  typedef struct dt_bpf_cpuinfo	dt_bpf_cpuinfo_t;
>  struct dt_bpf_cpuinfo {
>  	cpuinfo_t	ci;
> @@ -40,6 +42,8 @@ struct dt_bpf_cpuinfo {
>  	uint64_t	lockstat_bfrom;	/* lockstat: block time start */
>  	uint64_t	lockstat_btime;	/* lockstat: block time */
>  	uint64_t	lockstat_stime;	/* lockstat: spin time */
> +	uint64_t	io_bio_ptr_wait; /* io: stored bio pointer */
> +	char		io_bio_fake[IO_BIO_SIZ]; /* io: bio fake struct */

... it looks like this only needs to be as big as the largest offset we
access ourselves, and any attempt to exceed the BIO_SIZ will presumably
be caught by the verifier. I guess that's OK? I don't like the static
sizing and magic numbers, but no doubt, this being BPF, there is no
alternative.

> + * The 'io' SDT provider for DTrace-specific probes.
> + *
> + * These io::: probes mimic the instrumentation in legacy DTrace.
> + * Specifically, all probes have three probe args:
> + *     bufinfo_t *
> + *     devinfo_t *
> + *     fileinfo_t *
> + * but the trampoline really only needs to supply a bio pointer,
> + * from which translators will make the first two args.  The fileinfo_t *
> + * is 0 on Linux.
[...]

Oh nice description! I award you 1 (one) Commenter Gold Star. (If you
get two you can use both of them at once in the /* */ of another
comment.)

> + * For fbt:::entry probes on nfs_ and xfs_ functions, however, get only
> + * a hdr arg.

Grammar: probably drop the 'For'.

> + * Unfortunately, more than one function may be active at any time (on a
> + * CPU).

Presumably for stacked block devices. (Some of them may be split, some
may be passed down unchanged.)

>           So the return function needs to know which bio pointer or fake
> + * bio to use.  These rules are used:
> + *
> + *   - For nfs_ and xfs_ functions, just use the fake bio.
> + *
> + *   - For most other functions, use the bio pointer for that
> + *     function.

I don't really know enough about the bio layer to know if this makes
sense, but if it works, it works, I hope.

> +/* Defined in include/linux/blk_types.h */
> +#define REQ_OP_READ	0
> +#define REQ_OP_WRITE	1

Presumably the same in all current kernels. (Shame we can't pull them in
from there, really.)

> +/* Defined in fs/xfs/xfs_buf.h */
> +#define XBF_WRITE	(1 << 1) /* buffer intended for writing to device */

(Ditto.)

> +static const char	prvname[] = "io";
> +static const char	modname[] = "vmlinux";  // FIXME:  Really?  Or blank?

vmlinux, I'd say. This is core kernel functionality.

> +/*
> + * If the set of functions in the fbt probes changes,
> + * update the list in test/unittest/io/tst.fbt_probes.r.
> + */
> +static probe_dep_t	probes[] = {
> +	{ "wait-start",
> +	  DTRACE_PROBESPEC_NAME,	"fbt::submit_bio_wait:entry" },
> +	{ "wait-start",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:xfs::xfs_buf_iowait" },
> +	{ "wait-done",
> +	  DTRACE_PROBESPEC_FUNC,	"fbt::submit_bio_wait" },
> +	{ "wait-done",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:xfs::xfs_buf_iowait_done" },
> +	{ "done",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:block::block_bio_complete" },
> +	{ "done",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:block::block_rq_complete" },
> +	{ "done",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_readpage_done" },
> +	{ "done",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_writeback_done" },
> +	{ "start",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:block::block_bio_queue" },
> +	{ "start",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_initiate_read",
> +	  DT_VERSION_NUMBER(5, 6, 0), },
> +	{ "start",
> +	  DTRACE_PROBESPEC_NAME,	"fbt:nfs:nfs_initiate_read:entry",
> +	  0, DT_VERSION_NUMBER(5, 5, 19) },
> +	{ "start",
> +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_initiate_write",
> +	  DT_VERSION_NUMBER(5, 6, 0), },
> +	{ "start",
> +	  DTRACE_PROBESPEC_NAME,	"fbt:nfs:nfs_initiate_write:entry",
> +	  0, DT_VERSION_NUMBER(5, 5, 19) },

I like this scheme :) this fixes my worries about kernel version changes
breaking things and causing trouble. (I mean, yes, we need to change the
code, which is more invasive than just adjusting a translator, but in
the old days we'd have needed to adjust the position of SDT probes in
the kernel itself, which is *also* a code change. At least they're both
in the same codebase now.)

> +/*
> + * All four probes have three probe args.  The first two will be extracted
> + * by a translator from the (struct bio *) we supply.  The (struct file *)
> + * we supply will be 0 in all cases.

One of these days we should figure out how to do something with
fileinfo. We have a dcache, we should be able to do this...

> +static const dtrace_pattr_t	pattr = {
> +{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
> +{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
> +{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
> +{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
> +{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
> +};

Evolving and private, really? (I know they all used to be, but shouldn't
we start to try to improve that?)

> +/*
> + * Get a reference to the cpuinfo structure for the current CPU.
> + *
> + * Clobbers %r0 through %r5
> + * Stores pointer to cpuinfo struct in %r6
> + */
> +static void get_cpuinfo(dtrace_hdl_t *dtp, dt_irlist_t *dlp, uint_t exitlbl)
> +{
> +	dt_ident_t	*idp = dt_dlib_get_map(dtp, "cpuinfo");
> +
> +	assert(idp != NULL);
> +	dt_cg_xsetx(dlp, idp, DT_LBL_NONE, BPF_REG_1, idp->di_id);
> +	emit(dlp, BPF_MOV_REG(BPF_REG_2, BPF_REG_FP));
> +	emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, DT_TRAMP_SP_BASE));
> +	emit(dlp, BPF_STORE_IMM(BPF_DW, BPF_REG_2, 0, 0));
> +	emit(dlp, BPF_CALL_HELPER(BPF_FUNC_map_lookup_elem));
> +	emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, exitlbl));
> +	emit(dlp, BPF_MOV_REG(BPF_REG_6, BPF_REG_0));
> +}

I will admit to not knowing what this is for (or what DT_TRAMP_SP_BASE
is used for), but it appears to be a duplicate of code in
dt_prov_lockstat.c: shouldn't the two be unified?

> +static int BPF_width(size_t sz)
> +{
> +	switch(sz) {
> +	case 1: return BPF_B;
> +	case 2: return BPF_H;
> +	case 4: return BPF_W;
> +	case 8: return BPF_DW;
> +	default: assert(0);
> +	}
> +}

We also have several implementations of this, though it's so small that
maybe it should just be a static function in a header somewhere so it
can be inlined everywhere it's used.

> +/*
> + * Generate BPF instructions to dereference the pointer in %r3.
> + *
> + * We often have to dereference a pointer.  However, the pointer might
> + * not look safe to the BPF verifier.  So we use bpf_probe_read() to
> + * copy to a safe location (use slot 0) and then load from there.

Why is this in provider-specific code at all? It seems generally useful
dt_cg-like stuff. (In fact I'm fairly sure we have similar code in dt_cg
already?)

(The actual code looks sound.)

> +/*
> + * Zero out the entire fake struct bio area.
> + * We assume %r6 already points to the area.
> + */
> +static void io_zero_bio(dtrace_hdl_t *dtp, dt_irlist_t *dlp)
> +{
> +	ctf_file_t *cfp = dtp->dt_shared_ctf;
> +	ctf_id_t type;
> +	size_t sz;
> +
> +	if (!cfp)
> +		longjmp(yypcb->pcb_jmpbuf, EDT_NOCTF);
> +
> +	type = ctf_lookup_by_name(cfp, "struct bio");

We should probably not look in dt_shared_ctf for this, but rather use
dtrace_lookup_by_type(), which will find struct bio even if the
deduplicator chooses to put it somewhere else. (In particular, if any
module anywhere ever defines a *different* struct bio, the dt_shared_ctf
dict will end up with a CTF_K_FORWARD "struct bio" with a size of
-ECTF_INCOMPLETE...

> +	if (type == CTF_ERR)
> +		longjmp(yypcb->pcb_jmpbuf, EDT_NOCTF);

... and this will fire. If you use dtrace_lookup_by_type() and use
"vmlinux" as the object parameter, it will look in vmlinux
preferentially, which should always do the right thing.

> +	sz = ctf_type_size(cfp, type);
> +	if (sz > IO_BIO_SIZ)
> +		longjmp(yypcb->pcb_jmpbuf, EDT_NOCTF);    /* FIXME change EDT_NOCTF */

Hm, a FIXME left behind. Not sure what you want to change it to :)

> +	emit(dlp,  BPF_MOV_REG(BPF_REG_1, BPF_REG_6));
> +	emit(dlp,  BPF_MOV_IMM(BPF_REG_2, sz));
> +	emit(dlp,  BPF_MOV_REG(BPF_REG_3, BPF_REG_9)); /* in trampoline, dctx is in %r9 */
> +	emit(dlp,  BPF_LOAD(BPF_DW, BPF_REG_3, BPF_REG_3, DCTX_STRTAB));
> +	emite(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -1), dt_dlib_get_var(dtp, "ZERO_OFF"));
> +	emit(dlp,  BPF_CALL_HELPER(BPF_FUNC_probe_read));

Hm, does this need to update the zerosize code in
dt_bpf.c:gmap_create_strtab? Since this is a new use for the zero-offset stuff...

> +}
> +
> +/*
> + * For NFS events, we have to construct a fake struct bio, which we have to
> + * populate from the nfs_pgio_header argument the underlying probe provides.
> + */
> +static void io_nfs_args(dtrace_hdl_t *dtp, dt_irlist_t *dlp, uint_t exitlbl,
> +			const char *prb, const char *uprb)

(and despite appearances, this function and io_xfs_args are quite
different, and unifying them seems likely to be pointless.)

> +	off = dt_cg_ctf_offsetof("struct nfs_pgio_header", "inode", &siz, 0);
> +	deref_reg3(dlp, exitlbl, off, siz, BPF_REG_3);
> +
> +	off = dt_cg_ctf_offsetof("struct nfs_inode", "fileid", &siz, 0)
> +	    - dt_cg_ctf_offsetof("struct nfs_inode", "vfs_inode", NULL, 0);

Pardon my stupidity, but what on earth does *subtracting* offsetofs from
each other imply? Comment, perhaps :)

> +	/*
> +	 * bio.bi_iter.bi_sector = xfs_buf_daddr(bp);
> +	 *
> +	 * In fs/xfs/xfs_buf.h, we have
> +	 *
> +	 *     xfs_daddr_t xfs_buf_daddr(struct xfs_buf *bp)
> +	 *     {
> +	 *         return bp->b_maps[0].bm_bn;
> +	 *     }
> +	 *
> +	 * So that gives
> +	 *     bio.bi_iter.bi_sector = bp->b_maps->bm_bn;
> +	 *
> +	 * include/linux/blk_types.h
> +	 *     struct bio {
> +	 *         [...]
> +	 *         struct bvec_iter        bi_iter;
> +	 *         [...]
> +	 *     }
> +	 * include/linux/bvec.h
> +	 *     struct bvec_iter {
> +	 *         sector_t                bi_sector;
> +	 *         [...]
> +	 *     };
> +	 * fs/xfs/xfs_buf.h
> +	 *     struct xfs_buf_map {
> +	 *         xfs_daddr_t             bm_bn;
> +	 *         [...]
> +	 *     };
> +	 *     struct xfs_buf {
> +	 *         [...]
> +	 *         struct xfs_buf_map      *b_maps;
> +	 *         [...]
> +	 *     }
> +	 */

I really wish we didn't have to hardwire so much kernel knowledge in
here, but honestly given that it has to go *somewhere* this seems as
good a place as any, and at least there is a nice comment about it.
(It's a shame we can't have a translator pick this stuff out, though,
and then have the io provider use that translator. Isn't that what
they're for?)

> +	/* Handle the start and done probes (non-XFS, non-NFS). */
> +	if (strcmp(prp->desc->prb, "start") == 0) {
> +		/*
> +		 * Older kernels pass 2 arguments to block_bio_queue, and bio
> +		 * is in arg1.  Newer kernels have bio in arg0 already.
> +		 */
> +		if (uprp->nargc == 2) {

This seems incredibly fragile. When a newer kernel comes out and it
suddenly uses two arguments again, is anyone going to remember this code
exists? I doubt it.

> +	/* Handle the non-XFS wait-done flavor. */
> +	if (strcmp(prp->desc->prb, "wait-done") == 0) {
> +		/*
> +		 * We need instrument submit_bio_wait(struct bio *):

Missing 'to'.

> +done:
> +	/*
> +	 * Note: DTrace does not currently support the use of fileinfo_t with
> +	 * io probes.  In Oracle Linux, there is no information about the file
> +	 * where the I/O request originated at the point where the io probes
> +	 * fire.

I... thought we were trying to make this stuff work with upstream
kernels. The comment should probably just refer to the Linux kernel
rather than any single specific enterprise kernel. (They don't differ
*that* much from each other.)

(unless of course this is something that Oracle Linux really *does* do
differently from everyone else. I doubt it.)

> diff --git a/test/unittest/io/check_io_probe_args.sh b/test/unittest/io/check_io_probe_args.sh
> new file mode 100755
> index 00000000..1c3c88d1
> --- /dev/null
> +++ b/test/unittest/io/check_io_probe_args.sh

Nice!

> +    # Check for a legal set of flags.
> +    {

Heh heh heh I'll have to try this on some of my more elaborate storage
stacks :)

> +#
> +# For each statname, check that the reported major/minor numbers agree with "ls -l".
> +#
> +
> +while read mymajor myminor mystatname; do
> +    read mymajor0 myminor0 <<< $(ls -l /dev | gawk '$NF == "'$mystatname'" { print $(NF-5), $(NF-4) }' | tr ',' ' ')

This is not guaranteed. There can perfectly well be block devices in
subdirectories of /dev which do not appear in /dev itself :/ but we can
probably wait until someone hits this case and then fix it, since most
common cases *do* also put a link in /dev :)

> +# != "nfs"       "<unknown>"     # FIXME: "<unknown>"?  Really?

Hm yeah that seems very odd to me too.

> +gawk 'NF == 23 { print $21, $16 }' $infile | sort | uniq > map-name-to-major.txt
> +nmaps=`cat map-name-to-major.txt | wc -l`
> +nnames=`awk '{print $1}' map-name-to-major.txt | sort | uniq | wc -l`
> +nmajor=`awk '{print $2}' map-name-to-major.txt | sort | uniq | wc -l`

(sort -u?)

> +if [ $nnames -ne $nmaps -o $nmajor -ne $nmaps ]; then
> +    echo "ERROR: name-to-major-number is not a one-to-one mapping"

I honestly have no idea what this is checking. There is absolutely no
guarantee that each name gets a major number of its own, and indeed it
is very common not to:

loom:~# ls -l /dev/mapper
total 0
lrwxrwxrwx 1 root root        8 Nov 30 13:34 backup-plain -> ../dm-12
crw------- 1 root root  10, 236 Jul 17 13:20 control
brw------- 1 root root 252,   5 Jul 17 13:20 main-root
brw------- 1 root root 252,   8 Jul 17 13:20 main-oracrypt
brw------- 1 root root 252,   6 Jul 17 13:20 main-oramail
brw------- 1 root root 252,  10 Jul 17 13:20 main-rescue
brw------- 1 root root 252,   1 Jul 17 13:20 main-swap
brw------- 1 root root 252,   2 Jul 17 13:20 main-vms
[...]
brw-rw---- 1 root disk 8,   0 Jul 17 13:20 /dev/sda
brw-rw---- 1 root disk 8,   1 Jul 17 13:20 /dev/sda1
brw-rw---- 1 root disk 8,   2 Jul 17 13:20 /dev/sda2
brw-rw---- 1 root disk 8,   3 Jul 17 13:20 /dev/sda3
brw-rw---- 1 root disk 8,   4 Jul 17 13:20 /dev/sda4
brw-rw---- 1 root disk 8,  16 Jul 17 13:20 /dev/sdb
brw-rw---- 1 root disk 8,  17 Jul 17 13:20 /dev/sdb1
brw-rw---- 1 root disk 8,  18 Jul 17 13:20 /dev/sdb2
brw-rw---- 1 root disk 8,  19 Jul 17 13:20 /dev/sdb3
brw-rw---- 1 root disk 8,  20 Jul 17 13:20 /dev/sdb4
brw-rw---- 1 root disk 8,  32 Jul 17 13:20 /dev/sdc
brw-rw---- 1 root disk 8,  33 Jul 17 13:20 /dev/sdc1
brw-rw---- 1 root disk 8,  34 Jul 17 13:20 /dev/sdc2
brw-rw---- 1 root disk 8,  35 Jul 17 13:20 /dev/sdc3

etc.

> +    cat map-name-to-major.txt
> +    retval=1
> +fi
> +
> +#
> +# If the name is "nfs", the edev should be 0.  FIXME: is this correct?
> +#

nfs has no associated block device, so yes. (Other network filesystems
are similar.)

> +exit $retval
> diff --git a/test/unittest/io/dump_io_probe_args.d b/test/unittest/io/dump_io_probe_args.d
> new file mode 100644
> index 00000000..fb4c702c
> --- /dev/null
> +++ b/test/unittest/io/dump_io_probe_args.d
> @@ -0,0 +1,47 @@
> +/*
> + * Oracle Linux DTrace.
> + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
> + * Licensed under the Universal Permissive License v 1.0 as shown at
> + * http://oss.oracle.com/licenses/upl.
> + */
> +/* @@skip: not used directly by the test hardness; called by other scripts */

harness :)

All really nice tests, but...

> +dd if=/dev/zero of=$iodir.img bs=1024 count=$((300*1024)) status=none
> +mkfs.xfs $iodir.img > /dev/null
> +    mkdir $iodir
> +        mount -t xfs -o $fsoptions $iodir.img $iodir
> +            devnam=`losetup -j $iodir.img | awk 'BEGIN { FS = ":" } ; {print $1}'`
> +            statname=`basename $devnam`
> +
> +            dd if=/dev/urandom of=$tempfile count=$filesize bs=1 status=none
> +        $rundt "umount $iodir"                                             -o log.write
> +        mount -t xfs -o $fsoptions $iodir.img $iodir
> +            $rundt "sum $tempfile"                                         -o log.read

Are these -o log.write things *way* out there on the end of the line
editor errors or what? They look almost like they belong on the mount
line above... (and I had to dig around with the cursor to figure out
what line they actually belonged to -- seriously hard to read).

> +myaddr=`awk '$4 == "xfs_end_bio"       {print $1}' /proc/kallmodsyms`

Given that we're phasing this out, is it wise to introduce new tests
that depend on it?

> +mkdir $exdir
> +  # what is the fsid?
> +  exportfs -i -v -o "rw,sync,no_root_squash,insecure,fsid=8434437287" 127.0.0.1:$exdir > /dev/null

I'm not sure why you're hardwiring it. It is a filesystem ID used to
identify the fs, only necessary for filesystems not stored on devices
that don't have uuids in the fs, and for filesystems stored on devices
whose device numbers can change. It's totally unimportant for tests like
this, but it does matter for real use, since if a server goes down and
comes up with a different fsid for an exported filesystem, all clients
using that filesystem will find the mount point has gone stale and they
might be in quite a lot of trouble if, say, it's /home :)

-- 
NULL && (void)