[DTrace-devel] [PATCH] Implement the io provider

Kris Van Hees kris.van.hees at oracle.com
Sat Dec 2 04:27:50 UTC 2023


On Fri, Dec 01, 2023 at 10:36:03PM -0500, Eugene Loh via DTrace-devel wrote:
> On 12/1/23 18:34, Nick Alcock via DTrace-devel wrote:
> 
> > On 30 Nov 2023, Kris Van Hees via DTrace-devel told this:
> > 
> > > Signed-off-by: Eugene Loh <eugene.loh at oracle.com>
> > > Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
> > OK there is no choice, I have to review this one :)
> > 
> > Very nice, insofar as anything with this much kernel-internal grot in it
> > could be nice. A bunch of comments though.
> > 
> > > diff --git a/libdtrace/dt_bpf_maps.h b/libdtrace/dt_bpf_maps.h
> > > index 0dd36b16..9a21eddf 100644
> > > --- a/libdtrace/dt_bpf_maps.h
> > > +++ b/libdtrace/dt_bpf_maps.h
> > > @@ -31,6 +31,8 @@ struct dt_bpf_specs {
> > >   					 * drain this buffer */
> > >   };
> > > +#define IO_BIO_SIZ 256
> > This seems.. odd? Can't we tell what a suitable size is by examining the CTF?
> 
> I suspect so.  At this point, it feels like something that was stuck in for
> expedient development but can in fact be resolved.  But maybe I'm forgetting
> something.

More comments on this later.  But it may go away altogether.  I'm looking at
not storing anything in the cpuinfo anymore for the io provider and instead
storing the fake bio in a TLS variable.  That makes things a lot cleaner (and
we have code that already handles things like zeroing out the value when it is
first allocated).

But it is still a WIP - if it works, v2 of this patch will include it.

> >   (There's a FIXME lower down about this, but I'm not sure what
> > 'change EDT_NOCTF' implies.)
> > 
> 
> I think EDT_NOCTF was also for expediency, while a different EDT_* might
> arguably make more sense.
> 
> > > +#define IO_BIO_STK 4
> > This one seems to be unused.
> 
> I agree.  I won't bother explaining the history of what that used to do.

I forgot to remove it.

> > >   typedef struct dt_bpf_cpuinfo	dt_bpf_cpuinfo_t;
> > >   struct dt_bpf_cpuinfo {
> > >   	cpuinfo_t	ci;
> > > @@ -40,6 +42,8 @@ struct dt_bpf_cpuinfo {
> > >   	uint64_t	lockstat_bfrom;	/* lockstat: block time start */
> > >   	uint64_t	lockstat_btime;	/* lockstat: block time */
> > >   	uint64_t	lockstat_stime;	/* lockstat: spin time */
> > > +	uint64_t	io_bio_ptr_wait; /* io: stored bio pointer */

Hah, you failed to notice that io_bio_ptr_wait isn't uses anymore either! :-)

I forgot to remove it.

> > > +	char		io_bio_fake[IO_BIO_SIZ]; /* io: bio fake struct */
> > ... it looks like this only needs to be as big as the largest offset we
> > access ourselves,
> 
> I guess, but...?
> 
> > and any attempt to exceed the BIO_SIZ will presumably
> > be caught by the verifier. I guess that's OK? I don't like the static
> > sizing and magic numbers, but no doubt, this being BPF, there is no
> > alternative.
> 
> There may be an alternative to the static sizing and magic numbers. We
> allocate stuff.  We generate the BPF code.  At this stage, we can probably
> clean messiness like this up.

As mentioned above, I hope to get rid of this altogether.

> > > + * The 'io' SDT provider for DTrace-specific probes.
> > > + *
> > > + * These io::: probes mimic the instrumentation in legacy DTrace.
> > > + * Specifically, all probes have three probe args:
> > > + *     bufinfo_t *
> > > + *     devinfo_t *
> > > + *     fileinfo_t *
> > > + * but the trampoline really only needs to supply a bio pointer,
> > > + * from which translators will make the first two args.  The fileinfo_t *
> > > + * is 0 on Linux.
> > [...]
> > 
> > Oh nice description! I award you 1 (one) Commenter Gold Star. (If you
> > get two you can use both of them at once in the /* */ of another
> > comment.)
> > 
> > > + * For fbt:::entry probes on nfs_ and xfs_ functions, however, get only
> > > + * a hdr arg.
> > Grammar: probably drop the 'For'.
> > 
> > > + * Unfortunately, more than one function may be active at any time (on a
> > > + * CPU).
> > Presumably for stacked block devices. (Some of them may be split, some
> > may be passed down unchanged.)
> > 
> > >            So the return function needs to know which bio pointer or fake
> > > + * bio to use.  These rules are used:
> > > + *
> > > + *   - For nfs_ and xfs_ functions, just use the fake bio.
> > > + *
> > > + *   - For most other functions, use the bio pointer for that
> > > + *     function.
> > I don't really know enough about the bio layer to know if this makes
> > sense, but if it works, it works, I hope.
> > 
> > > +/* Defined in include/linux/blk_types.h */
> > > +#define REQ_OP_READ	0
> > > +#define REQ_OP_WRITE	1
> > Presumably the same in all current kernels. (Shame we can't pull them in
> > from there, really.)
> > 
> > > +/* Defined in fs/xfs/xfs_buf.h */
> > > +#define XBF_WRITE	(1 << 1) /* buffer intended for writing to device */
> > (Ditto.)
> > 
> > > +static const char	prvname[] = "io";
> > > +static const char	modname[] = "vmlinux";  // FIXME:  Really?  Or blank?
> > vmlinux, I'd say. This is core kernel functionality.
> > 
> > > +/*
> > > + * If the set of functions in the fbt probes changes,
> > > + * update the list in test/unittest/io/tst.fbt_probes.r.
> > > + */
> > > +static probe_dep_t	probes[] = {
> > > +	{ "wait-start",
> > > +	  DTRACE_PROBESPEC_NAME,	"fbt::submit_bio_wait:entry" },
> > > +	{ "wait-start",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:xfs::xfs_buf_iowait" },
> > > +	{ "wait-done",
> > > +	  DTRACE_PROBESPEC_FUNC,	"fbt::submit_bio_wait" },
> > > +	{ "wait-done",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:xfs::xfs_buf_iowait_done" },
> > > +	{ "done",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:block::block_bio_complete" },
> > > +	{ "done",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:block::block_rq_complete" },
> > > +	{ "done",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_readpage_done" },
> > > +	{ "done",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_writeback_done" },
> > > +	{ "start",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:block::block_bio_queue" },
> > > +	{ "start",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_initiate_read",
> > > +	  DT_VERSION_NUMBER(5, 6, 0), },
> > > +	{ "start",
> > > +	  DTRACE_PROBESPEC_NAME,	"fbt:nfs:nfs_initiate_read:entry",
> > > +	  0, DT_VERSION_NUMBER(5, 5, 19) },
> > > +	{ "start",
> > > +	  DTRACE_PROBESPEC_NAME,	"rawtp:nfs::nfs_initiate_write",
> > > +	  DT_VERSION_NUMBER(5, 6, 0), },
> > > +	{ "start",
> > > +	  DTRACE_PROBESPEC_NAME,	"fbt:nfs:nfs_initiate_write:entry",
> > > +	  0, DT_VERSION_NUMBER(5, 5, 19) },
> > I like this scheme :) this fixes my worries about kernel version changes
> > breaking things and causing trouble. (I mean, yes, we need to change the
> > code, which is more invasive than just adjusting a translator, but in
> > the old days we'd have needed to adjust the position of SDT probes in
> > the kernel itself, which is *also* a code change. At least they're both
> > in the same codebase now.)
> > 
> > > +/*
> > > + * All four probes have three probe args.  The first two will be extracted
> > > + * by a translator from the (struct bio *) we supply.  The (struct file *)
> > > + * we supply will be 0 in all cases.
> > One of these days we should figure out how to do something with
> > fileinfo. We have a dcache, we should be able to do this...

It is doubtful we can do anything here...  bios do not contain any info AFAIK
that refers back to a dentry.

> > > +static const dtrace_pattr_t	pattr = {
> > > +{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
> > > +{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
> > > +{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
> > > +{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
> > > +{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
> > > +};
> > Evolving and private, really? (I know they all used to be, but shouldn't
> > we start to try to improve that?)
> > 
> > > +/*
> > > + * Get a reference to the cpuinfo structure for the current CPU.
> > > + *
> > > + * Clobbers %r0 through %r5
> > > + * Stores pointer to cpuinfo struct in %r6
> > > + */
> > > +static void get_cpuinfo(dtrace_hdl_t *dtp, dt_irlist_t *dlp, uint_t exitlbl)
> > > +{
> > > +	dt_ident_t	*idp = dt_dlib_get_map(dtp, "cpuinfo");
> > > +
> > > +	assert(idp != NULL);
> > > +	dt_cg_xsetx(dlp, idp, DT_LBL_NONE, BPF_REG_1, idp->di_id);
> > > +	emit(dlp, BPF_MOV_REG(BPF_REG_2, BPF_REG_FP));
> > > +	emit(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, DT_TRAMP_SP_BASE));
> > > +	emit(dlp, BPF_STORE_IMM(BPF_DW, BPF_REG_2, 0, 0));
> > > +	emit(dlp, BPF_CALL_HELPER(BPF_FUNC_map_lookup_elem));
> > > +	emit(dlp, BPF_BRANCH_IMM(BPF_JEQ, BPF_REG_0, 0, exitlbl));
> > > +	emit(dlp, BPF_MOV_REG(BPF_REG_6, BPF_REG_0));
> > > +}
> > I will admit to not knowing what this is for (or what DT_TRAMP_SP_BASE
> > is used for), but it appears to be a duplicate of code in
> > dt_prov_lockstat.c: shouldn't the two be unified?
> 
> For lookup_elem, we need to access the key by reference.  So we need to put
> it somewhere.  The stack -- slot(0) at TRAMP_SP_BASE -- is a safe place for
> this.  Anyhow, yeah, some expedient copy-and-paste here that should arguably
> be cleaned up.

This is expected to go away.

> > > +static int BPF_width(size_t sz)
> > > +{
> > > +	switch(sz) {
> > > +	case 1: return BPF_B;
> > > +	case 2: return BPF_H;
> > > +	case 4: return BPF_W;
> > > +	case 8: return BPF_DW;
> > > +	default: assert(0);
> > > +	}
> > > +}
> > We also have several implementations of this, though it's so small that
> > maybe it should just be a static function in a header somewhere so it
> > can be inlined everywhere it's used.
> 
> Yes.

Indeed, but perhaps after ths patch goes in, and then we'll just clean up all
instances.

> > > +/*
> > > + * Generate BPF instructions to dereference the pointer in %r3.
> > > + *
> > > + * We often have to dereference a pointer.  However, the pointer might
> > > + * not look safe to the BPF verifier.  So we use bpf_probe_read() to
> > > + * copy to a safe location (use slot 0) and then load from there.
> > Why is this in provider-specific code at all? It seems generally useful
> > dt_cg-like stuff. (In fact I'm fairly sure we have similar code in dt_cg
> > already?)
> 
> Yes.  One nuance is that this is in the trampoline.  So it accesses some
> things slightly differently from the way dt_cg does it.  But the providers
> could share code.
> 
> > (The actual code looks sound.)
> > 
> > > +/*
> > > + * Zero out the entire fake struct bio area.
> > > + * We assume %r6 already points to the area.
> > > + */
> > > +static void io_zero_bio(dtrace_hdl_t *dtp, dt_irlist_t *dlp)
> > > +{
> > > +	ctf_file_t *cfp = dtp->dt_shared_ctf;
> > > +	ctf_id_t type;
> > > +	size_t sz;
> > > +
> > > +	if (!cfp)
> > > +		longjmp(yypcb->pcb_jmpbuf, EDT_NOCTF);
> > > +
> > > +	type = ctf_lookup_by_name(cfp, "struct bio");
> > We should probably not look in dt_shared_ctf for this, but rather use
> > dtrace_lookup_by_type(), which will find struct bio even if the
> > deduplicator chooses to put it somewhere else. (In particular, if any
> > module anywhere ever defines a *different* struct bio, the dt_shared_ctf
> > dict will end up with a CTF_K_FORWARD "struct bio" with a size of
> > -ECTF_INCOMPLETE...
> > 
> > > +	if (type == CTF_ERR)
> > > +		longjmp(yypcb->pcb_jmpbuf, EDT_NOCTF);
> > ... and this will fire. If you use dtrace_lookup_by_type() and use
> > "vmlinux" as the object parameter, it will look in vmlinux
> > preferentially, which should always do the right thing.
> > 
> > > +	sz = ctf_type_size(cfp, type);
> > > +	if (sz > IO_BIO_SIZ)
> > > +		longjmp(yypcb->pcb_jmpbuf, EDT_NOCTF);    /* FIXME change EDT_NOCTF */
> > Hm, a FIXME left behind. Not sure what you want to change it to :)
> 
> I also do not know.  Hence, the FIXME is left behind.
> 
> > > +	emit(dlp,  BPF_MOV_REG(BPF_REG_1, BPF_REG_6));
> > > +	emit(dlp,  BPF_MOV_IMM(BPF_REG_2, sz));
> > > +	emit(dlp,  BPF_MOV_REG(BPF_REG_3, BPF_REG_9)); /* in trampoline, dctx is in %r9 */
> > > +	emit(dlp,  BPF_LOAD(BPF_DW, BPF_REG_3, BPF_REG_3, DCTX_STRTAB));
> > > +	emite(dlp, BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -1), dt_dlib_get_var(dtp, "ZERO_OFF"));
> > > +	emit(dlp,  BPF_CALL_HELPER(BPF_FUNC_probe_read));
> > Hm, does this need to update the zerosize code in
> > dt_bpf.c:gmap_create_strtab? Since this is a new use for the zero-offset stuff...
> 
> I think so.  Yes.
> 
> > > +}
> > > +
> > > +/*
> > > + * For NFS events, we have to construct a fake struct bio, which we have to
> > > + * populate from the nfs_pgio_header argument the underlying probe provides.
> > > + */
> > > +static void io_nfs_args(dtrace_hdl_t *dtp, dt_irlist_t *dlp, uint_t exitlbl,
> > > +			const char *prb, const char *uprb)
> > (and despite appearances, this function and io_xfs_args are quite
> > different, and unifying them seems likely to be pointless.)
> 
> Agreed.
> 
> > > +	off = dt_cg_ctf_offsetof("struct nfs_pgio_header", "inode", &siz, 0);
> > > +	deref_reg3(dlp, exitlbl, off, siz, BPF_REG_3);
> > > +
> > > +	off = dt_cg_ctf_offsetof("struct nfs_inode", "fileid", &siz, 0)
> > > +	    - dt_cg_ctf_offsetof("struct nfs_inode", "vfs_inode", NULL, 0);
> > Pardon my stupidity, but what on earth does *subtracting* offsetofs from
> > each other imply? Comment, perhaps :)
> 
> Comment seems reasonable.
> 
> For this email, meanwhile, the point is that (following the lead of legacy
> DTrace), we want stuff like NFS_FILEID(hdr->inode). Looking in
> include/linux/nfs_fs.h:
> 
>         static inline __u64 NFS_FILEID(const struct inode *inode) { return
> NFS_I(inode)->fileid; }
>         static inline struct nfs_inode *NFS_I(const struct inode *inode) {
> return container_of(inode, struct nfs_inode, vfs_inode); }
> 
> The subtraction comes from the container_of() stuff.  That is, there is some
> struct nfs_inode.  We want the fileid member and we have the vfs_inode
> member.  So we subtract the latter to get to the "container of" address and
> then we add the former.  Like:
> 
>         struct container {
>             member1;
>             member2;
>         }
> 
> If I have the address of container.member1, I subtract an offset to get the
> address of container and then add an offset to get to the address of
> container.member2.
> 
> > > +	/*
> > > +	 * bio.bi_iter.bi_sector = xfs_buf_daddr(bp);
> > > +	 *
> > > +	 * In fs/xfs/xfs_buf.h, we have
> > > +	 *
> > > +	 *     xfs_daddr_t xfs_buf_daddr(struct xfs_buf *bp)
> > > +	 *     {
> > > +	 *         return bp->b_maps[0].bm_bn;
> > > +	 *     }
> > > +	 *
> > > +	 * So that gives
> > > +	 *     bio.bi_iter.bi_sector = bp->b_maps->bm_bn;
> > > +	 *
> > > +	 * include/linux/blk_types.h
> > > +	 *     struct bio {
> > > +	 *         [...]
> > > +	 *         struct bvec_iter        bi_iter;
> > > +	 *         [...]
> > > +	 *     }
> > > +	 * include/linux/bvec.h
> > > +	 *     struct bvec_iter {
> > > +	 *         sector_t                bi_sector;
> > > +	 *         [...]
> > > +	 *     };
> > > +	 * fs/xfs/xfs_buf.h
> > > +	 *     struct xfs_buf_map {
> > > +	 *         xfs_daddr_t             bm_bn;
> > > +	 *         [...]
> > > +	 *     };
> > > +	 *     struct xfs_buf {
> > > +	 *         [...]
> > > +	 *         struct xfs_buf_map      *b_maps;
> > > +	 *         [...]
> > > +	 *     }
> > > +	 */
> > I really wish we didn't have to hardwire so much kernel knowledge in
> > here, but honestly given that it has to go *somewhere* this seems as
> > good a place as any, and at least there is a nice comment about it.
> > (It's a shame we can't have a translator pick this stuff out, though,
> > and then have the io provider use that translator. Isn't that what
> > they're for?)
> > 
> > > +	/* Handle the start and done probes (non-XFS, non-NFS). */
> > > +	if (strcmp(prp->desc->prb, "start") == 0) {
> > > +		/*
> > > +		 * Older kernels pass 2 arguments to block_bio_queue, and bio
> > > +		 * is in arg1.  Newer kernels have bio in arg0 already.
> > > +		 */
> > > +		if (uprp->nargc == 2) {
> > This seems incredibly fragile. When a newer kernel comes out and it
> > suddenly uses two arguments again, is anyone going to remember this code
> > exists? I doubt it.
> > 
> > > +	/* Handle the non-XFS wait-done flavor. */
> > > +	if (strcmp(prp->desc->prb, "wait-done") == 0) {
> > > +		/*
> > > +		 * We need instrument submit_bio_wait(struct bio *):
> > Missing 'to'.
> > 
> > > +done:
> > > +	/*
> > > +	 * Note: DTrace does not currently support the use of fileinfo_t with
> > > +	 * io probes.  In Oracle Linux, there is no information about the file
> > > +	 * where the I/O request originated at the point where the io probes
> > > +	 * fire.
> > I... thought we were trying to make this stuff work with upstream
> > kernels. The comment should probably just refer to the Linux kernel
> > rather than any single specific enterprise kernel. (They don't differ
> > *that* much from each other.)
> > 
> > (unless of course this is something that Oracle Linux really *does* do
> > differently from everyone else. I doubt it.)
> 
> Good point.  That text was copied out of Oracle Linux DTrace documentation.
> 
> > > diff --git a/test/unittest/io/check_io_probe_args.sh b/test/unittest/io/check_io_probe_args.sh
> > > new file mode 100755
> > > index 00000000..1c3c88d1
> > > --- /dev/null
> > > +++ b/test/unittest/io/check_io_probe_args.sh
> > Nice!
> > 
> > > +    # Check for a legal set of flags.
> > > +    {
> > Heh heh heh I'll have to try this on some of my more elaborate storage
> > stacks :)
> > 
> > > +#
> > > +# For each statname, check that the reported major/minor numbers agree with "ls -l".
> > > +#
> > > +
> > > +while read mymajor myminor mystatname; do
> > > +    read mymajor0 myminor0 <<< $(ls -l /dev | gawk '$NF == "'$mystatname'" { print $(NF-5), $(NF-4) }' | tr ',' ' ')
> > This is not guaranteed. There can perfectly well be block devices in
> > subdirectories of /dev which do not appear in /dev itself :/ but we can
> > probably wait until someone hits this case and then fix it, since most
> > common cases *do* also put a link in /dev :)
> > 
> > > +# != "nfs"       "<unknown>"     # FIXME: "<unknown>"?  Really?
> > Hm yeah that seems very odd to me too.
> > 
> > > +gawk 'NF == 23 { print $21, $16 }' $infile | sort | uniq > map-name-to-major.txt
> > > +nmaps=`cat map-name-to-major.txt | wc -l`
> > > +nnames=`awk '{print $1}' map-name-to-major.txt | sort | uniq | wc -l`
> > > +nmajor=`awk '{print $2}' map-name-to-major.txt | sort | uniq | wc -l`
> > (sort -u?)
> > 
> > > +if [ $nnames -ne $nmaps -o $nmajor -ne $nmaps ]; then
> > > +    echo "ERROR: name-to-major-number is not a one-to-one mapping"
> > I honestly have no idea what this is checking. There is absolutely no
> > guarantee that each name gets a major number of its own, and indeed it
> > is very common not to:
> 
> I'm responsible for this stuff and I did not know what I was doing. I just
> tried stuff empirically and added tests for what seemed "reasonable."  You
> need to tell me:  is this check reasonable or unreasonable for this
> particular purpose.
> 
> > loom:~# ls -l /dev/mapper
> > total 0
> > lrwxrwxrwx 1 root root        8 Nov 30 13:34 backup-plain -> ../dm-12
> > crw------- 1 root root  10, 236 Jul 17 13:20 control
> > brw------- 1 root root 252,   5 Jul 17 13:20 main-root
> > brw------- 1 root root 252,   8 Jul 17 13:20 main-oracrypt
> > brw------- 1 root root 252,   6 Jul 17 13:20 main-oramail
> > brw------- 1 root root 252,  10 Jul 17 13:20 main-rescue
> > brw------- 1 root root 252,   1 Jul 17 13:20 main-swap
> > brw------- 1 root root 252,   2 Jul 17 13:20 main-vms
> > [...]
> > brw-rw---- 1 root disk 8,   0 Jul 17 13:20 /dev/sda
> > brw-rw---- 1 root disk 8,   1 Jul 17 13:20 /dev/sda1
> > brw-rw---- 1 root disk 8,   2 Jul 17 13:20 /dev/sda2
> > brw-rw---- 1 root disk 8,   3 Jul 17 13:20 /dev/sda3
> > brw-rw---- 1 root disk 8,   4 Jul 17 13:20 /dev/sda4
> > brw-rw---- 1 root disk 8,  16 Jul 17 13:20 /dev/sdb
> > brw-rw---- 1 root disk 8,  17 Jul 17 13:20 /dev/sdb1
> > brw-rw---- 1 root disk 8,  18 Jul 17 13:20 /dev/sdb2
> > brw-rw---- 1 root disk 8,  19 Jul 17 13:20 /dev/sdb3
> > brw-rw---- 1 root disk 8,  20 Jul 17 13:20 /dev/sdb4
> > brw-rw---- 1 root disk 8,  32 Jul 17 13:20 /dev/sdc
> > brw-rw---- 1 root disk 8,  33 Jul 17 13:20 /dev/sdc1
> > brw-rw---- 1 root disk 8,  34 Jul 17 13:20 /dev/sdc2
> > brw-rw---- 1 root disk 8,  35 Jul 17 13:20 /dev/sdc3
> > 
> > etc.
> > 
> > > +    cat map-name-to-major.txt
> > > +    retval=1
> > > +fi
> > > +
> > > +#
> > > +# If the name is "nfs", the edev should be 0.  FIXME: is this correct?
> > > +#
> > nfs has no associated block device, so yes. (Other network filesystems
> > are similar.)
> > 
> > > +exit $retval
> > > diff --git a/test/unittest/io/dump_io_probe_args.d b/test/unittest/io/dump_io_probe_args.d
> > > new file mode 100644
> > > index 00000000..fb4c702c
> > > --- /dev/null
> > > +++ b/test/unittest/io/dump_io_probe_args.d
> > > @@ -0,0 +1,47 @@
> > > +/*
> > > + * Oracle Linux DTrace.
> > > + * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
> > > + * Licensed under the Universal Permissive License v 1.0 as shown at
> > > + * http://oss.oracle.com/licenses/upl.
> > > + */
> > > +/* @@skip: not used directly by the test hardness; called by other scripts */
> > harness :)
> > 
> > All really nice tests, but...
> > 
> > > +dd if=/dev/zero of=$iodir.img bs=1024 count=$((300*1024)) status=none
> > > +mkfs.xfs $iodir.img > /dev/null
> > > +    mkdir $iodir
> > > +        mount -t xfs -o $fsoptions $iodir.img $iodir
> > > +            devnam=`losetup -j $iodir.img | awk 'BEGIN { FS = ":" } ; {print $1}'`
> > > +            statname=`basename $devnam`
> > > +
> > > +            dd if=/dev/urandom of=$tempfile count=$filesize bs=1 status=none
> > > +        $rundt "umount $iodir"                                             -o log.write
> > > +        mount -t xfs -o $fsoptions $iodir.img $iodir
> > > +            $rundt "sum $tempfile"                                         -o log.read
> > Are these -o log.write things *way* out there on the end of the line
> > editor errors or what? They look almost like they belong on the mount
> > line above... (and I had to dig around with the cursor to figure out
> > what line they actually belonged to -- seriously hard to read).
> 
> All subjective.  The -o stuff can certainly be pulled to the left quite a
> bit if it helps even just one reviewer.
> 
> > > +myaddr=`awk '$4 == "xfs_end_bio"       {print $1}' /proc/kallmodsyms`
> > Given that we're phasing this out, is it wise to introduce new tests
> > that depend on it?
> 
> Maybe not but maybe also not a big deal since a few tests will have to be
> changed anyhow.
> 
> > > +mkdir $exdir
> > > +  # what is the fsid?
> > > +  exportfs -i -v -o "rw,sync,no_root_squash,insecure,fsid=8434437287" 127.0.0.1:$exdir > /dev/null
> > I'm not sure why you're hardwiring it. It is a filesystem ID used to
> > identify the fs, only necessary for filesystems not stored on devices
> > that don't have uuids in the fs, and for filesystems stored on devices
> > whose device numbers can change. It's totally unimportant for tests like
> > this, but it does matter for real use, since if a server goes down and
> > comes up with a different fsid for an exported filesystem, all clients
> > using that filesystem will find the mount point has gone stale and they
> > might be in quite a lot of trouble if, say, it's /home :)
> 
> I'd like to think I'm sometimes a nice guy, but somehow it gives me perverse
> pleasure to point out that the hardwired fsid simply mimics the practice
> introduced by:
> 
> $ git show fc59c0dc9
> commit fc59c0dc9725311ea299aff8bc4e6054fe6d0529
> Author: Nick Alcock <nick.alcock at oracle.com>
> Date:   Wed Aug 2 21:11:34 2017 +0100
> 
>     test/unittest/io/tst.nfs.sh: explicitly set fsid
> 
>     Some filesystem types, like tmpfs, require fsid= when doing NFSv3
>     exports, so set the fsid to an unlikely-to-collide big value.
> 
> diff --git a/test/unittest/io/tst.nfs.sh b/test/unittest/io/tst.nfs.sh
> index de7941bc..31712565 100755
> --- a/test/unittest/io/tst.nfs.sh
> +++ b/test/unittest/io/tst.nfs.sh
> @@ -26,7 +26,7 @@ trap "rm -f $tempfile; umount $clientpath; rmdir
> $clientpath; exportfs -u 127.0.
>  # setup NFS server
>  service nfs start > /dev/null 2>&1
>  mkdir $serverpath
> -exportfs -i -v -o "rw,sync,no_root_squash,insecure" 127.0.0.1:$serverpath >
> /dev/null
> +exportfs -i -v -o "rw,sync,no_root_squash,insecure,fsid=8434437287"
> 127.0.0.1:$serverpath > /dev/null
> 
>  # setup NFS client
>  mkdir $clientpath
> 
> _______________________________________________
> DTrace-devel mailing list
> DTrace-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/dtrace-devel



More information about the DTrace-devel mailing list