[DTrace-devel] [PATCH 3/4] test: fix tst.depth.sh to use proper trigger
Eugene Loh
eugene.loh at oracle.com
Tue Nov 11 22:55:48 UTC 2025
On 11/11/25 17:13, Kris Van Hees wrote:
> On Tue, Nov 11, 2025 at 03:50:37PM -0500, Eugene Loh wrote:
>> On 11/10/25 22:00, Kris Van Hees wrote:
>>
>>> On Mon, Nov 10, 2025 at 07:10:52PM -0500, Eugene Loh wrote:
>>>> I'm against this patch.
>>>>
>>>> The picky stuff is:
>>>> - the Copyright year should be updated
>>> Easy to fix - done.
>>>
>>>> - the timeout probe could replace
>>>> tick-1sec /n++ == 10/
>>>> with simply
>>>> tick-10sec
>>> Arguably, the two are not identical. It depends on why one is used over the
>>> other. Here we could probably go either way, though it might be nice to have
>>> the tick probe firing every second as a way to generate a bit more activity.
>>> Given the frequency of the other probes, not really important though as far as
>>> I can see. Note that I didn't touch the timeout probe primarily because that
>>> was not the purpose of this patch anyway.
>>>
>>>> - the "delay" env var be set for delaydie???
>>> Why? We are simply using delaydie to have a trigger executable to execute.
>>> This is a valid use of it (and it is used that same way in other tests also).
>>>
>>>> - the ustacks for delaydie are perhaps not very interesting
>>> That is not the point of the test, and for that matter, I highly doubt that
>>> ustacks from date would have been any more interesting (aside from the fact
>>> that you'd never know because the test didn't even work with date).
>>>
>>>> A little more serious is that all this patch does is swap one failure mode
>>>> for another. What for? It's XFAIL anyhow, and we do not check that we're
>>>> getting a particular, expected failure.
>>> The difference is that the pre-patch failure was due to a problem with the
>>> actual test. But since it is XFAIL that was reported as an expected failure,
>>> which was entirely bogus. After the patch, it still fails but at least it is
>>> failing for valid reasons. I.e. now we have a failure we should look into.
>>>
>>>> The real problem, meanwhile, is that this patch leads to lots of test
>>>> failures, at least for me. Not every VM, but many. Oh, and I do not mean
>>>> on this test, because this test remains XFAIL anyhow.
>>>>
>>>> Consider a subset of the VMs I run on. I get these failures on the
>>>> test/unittest/ustack/tst.*.d tests. Why on these? Because they run after
>>>> tst.depth.sh.
>>>>
>>>> +------------------ arm OL8 UEK7
>>>> | +---------------- x86 OL9 UEK7
>>>> | | +-------------- arm OL9 UEK7
>>>> | | | +------------ x86 OL9 UEK8
>>>> | | | | +---------- arm OL9 UEK8
>>>> | | | | | +-------- arm OL10 UEK8
>>>> | | | | | |
>>>> v v v v v v
>>>>
>>>> x x x x x x tst.kthread.d FAIL: timed out.
>>>> x x x x tst.uaddr.d FAIL: timed out.
>>>> x x x x tst.uaddr-pid0.d FAIL: timed out.
>>>> x x tst.ufunc.d FAIL: timed out.
>>>> x x x tst.ufunc-pid0.d FAIL: timed out.
>>>> x x tst.umod.d FAIL: timed out.
>>>> x x tst.ustack25_max95_profile.d FAIL: timed out.
>>>> x x tst.ustack25_pid.d FAIL: timed out.
>>>> x x tst.ustack25_profile.d FAIL: timed out.
>>>> x x tst.ustack95_max25_profile.d FAIL: timed out.
>>>> x x tst.ustack95_profile.d FAIL: timed out.
>>>> x x tst.ustack_max25_profile.d FAIL: timed out.
>>>> x x tst.ustack_profile.d FAIL: timed out.
>>>> x tst.usym.d FAIL: timed out.
>>>> x tst.usym-pid0.d FAIL: timed out.
>>>>
>>>> I think the problem is that tst.depth.sh -- with this patch, in any case --
>>>> turns on a huge number of probes. That causes problems for this test, but
>>>> not in the sense of it FAILing since it is marked XFAIL anyhow. (All this
>>>> patch does is change the FAIL from one problem to another, but we still do
>>>> not specify what failure we expect to see nor check that that is what we
>>>> get.)
>>>>
>>>> So, the problem is not with this test, but that the system is left in some
>>>> whacked state due to having attempted so many probes. As a result, the
>>>> process "dtrace -o $file -c delaydie" is still around well after the depth
>>>> test has finished, the load average remains high, the pid0 process does not
>>>> run, and many tests time out. How long (how many seconds or how many tests)
>>>> does this effect last? It depends, as one can see above.
>>>>
>>>> This raises all sorts of other questions as well, such as whether tests in
>>>> our test suite should be allowed to depend on running on a somewhat idle
>>>> system, whether we (dtrace or the test suite) are cleaning up well enough,
>>>> etc.
>>> There is of course the aspect that this patch is part of a series, and the
>>> next patch in the series should resolve the impact issues of this test. If
>>> it makes you more happy, I'd happily move this patch to the end of the series.
>>> I don't see how it matters, because this crazy fallout from tst.depth.sh will
>>> be an issue either way until the per-PID uprohes support is applied.
>> I don't get it.
>>
>> First of all, the crazy fallout results with the patch and it doesn't
>> without the patch. Without the patch the test was basically SKIPing
>> (admittedly as a mistaken side effect of a test defect).
>>
>> Second, the above results, with the crazy fallout, included the entire patch
>> series. That is, the tip of the branch looked like this:
> Well, it was not clear from your email that you were testing this after you
> applied the entire series. That of course changes the situation.
>
> Clearly, there may be more wrong with this test. Or it might be pointing on a
> potential issue where DTrace might leave things hanging in a state that takes
> too long to clean up. Either way, though, the change in this patch is needed
> to make this test even remotely useful.
>
> If the patch causes the test to trigger other tests to fail, then more work
> is needed. I would still recommend that we let this patch go through, as it
> *is* a correct fix for a problem.
>
> And then look at what the new issue is that this test brings to light, and fix
> what is needed to fix that.
How about having the patch also SKIP the test and we simply file a bug
to track the issue. Otherwise, we're simply invalidating a bunch of
other tests (in the process, increasing the run time of the test
suite). The cascade of FAILs is even longer than I indicated for one of
the VMs. The patch is a fix to a problem, but it's only a partial fix
to what's going on here, a partial fix that sabotages quite a few other
tests. The related problems can be tracked much more efficiently and
directly with a bug ticket.
>> 59a49ce9 (tag: 2.0.4) doc: Add blank line before bold text so it is rendered
>> correctly
>> 61e69762 bpf: fix file descriptor leak
>> 20780963 uprobe: remove unnecessary enable_*() functions
>> 170c4cf9 test: fix tst.depth.sh to use proper trigger
>> cf7ae35b uprobe: Implement PID-specific uprobes
>>
>> If there are other bits/patches I should try, let me know.
>>
>>>> On 11/10/25 10:27, Kris Van Hees wrote:
>>>>> The test used 'date' as its trigger but that caused the test to XFAIL
>>>>> for the wrong reason: date does not have symbols and therefore dtrace
>>>>> failed to enable probes for it.
>>>>>
>>>>> Signed-off-by: Kris Van Hees <kris.van.hees at oracle.com>
>>>>> ---
>>>>> test/unittest/ustack/tst.depth.sh | 8 ++++----
>>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/test/unittest/ustack/tst.depth.sh b/test/unittest/ustack/tst.depth.sh
>>>>> index dec16a3a..977d1b42 100755
>>>>> --- a/test/unittest/ustack/tst.depth.sh
>>>>> +++ b/test/unittest/ustack/tst.depth.sh
>>>>> @@ -16,7 +16,7 @@ dtrace=$1
>>>>> rm -f $file
>>>>> -$dtrace $dt_flags -o $file -c date -s /dev/stdin <<EOF
>>>>> +$dtrace $dt_flags -o $file -c test/triggers/delaydie -s /dev/stdin <<EOF
>>>>> #pragma D option quiet
>>>>> #pragma D option bufsize=1M
>>>>> @@ -45,7 +45,7 @@ $dtrace $dt_flags -o $file -c date -s /dev/stdin <<EOF
>>>>> EOF
>>>>> status=$?
>>>>> -if [ "$status" -ne 0 ]; then
>>>>> +if [ $status -ne 0 ]; then
>>>>> echo $tst: dtrace failed
>>>>> rm -f $file
>>>>> exit $status
>>>>> @@ -78,8 +78,8 @@ EOF
>>>>> status=$?
>>>>> -count=`wc -l $file | cut -f1 -do`
>>>>> -if [ "$count" -lt 1000 ]; then
>>>>> +count=`wc -l $file | cut -f1 -d' '`
>>>>> +if [ $count -lt 1000 ]; then
>>>>> echo $tst: output was too short
>>>>> status=1
>>>>> fi
More information about the DTrace-devel
mailing list