[DTrace-devel] Dtrace on Linux Port

Mon Mar 5 03:32:58 PST 2018

On 2 Mar 2018, Paul Dagnelie uttered the following:

> Hey Nick,
>
> Thanks for the feedback! I managed to get past that issue (I made a silly
> mistake and only partially finished rebuilding the kernel after switching
> branches a couple times), but now I am actually hitting a crash in
> dwarf2ctf.

Oh dear. Sorry! I'm afraid the only real test harness for dwarf2ctf
before now has been Oracle UEK kernels, so different combinations of
modules might trigger bugs for a while. Untrodden snow...

>            Unfortunately, the core dump is 7.2GB.

Unfortunately dwarf2ctf assembles very large tables in the course of its
work, and coredumps are often not much use because the call stacks are
usually quite boring: it's the parameters those functions were called
with, and the global data they manipulate, that is of interest (it
assembles huge hashtables in one pass, then examines them in the next
pass).

This is why it has a separate tracing facility (add -DDEBUG to
HOSTCFLAGS_dwarf2ctf.o in scripts/dwarf2ctf/Makefile, delete
scripts/dwarf2ctf/dwarf2ctf.o to force rebuilding with -DDEBUG, then
re-make with DWARF2CTF_TRACE set in the environment). This emits
gigabytes of trace output on stderr, but it's much more compressible
than a coredump and will often provide enough info to figure out what is
going wrong on its own. (The key is usually looking at the type ID being
emitted right before the crash, and searching backwards for previous
references to the same ID: those references will be the deduplicator
running over the same type in various kernel modules, and put all of
those lines together and you have everything dwarf2ctf did with that
type, no matter what object file it came from).

The monster gigabytes of trace should also lzma up much better than a
coredump (it's really, really repetitive) and in conjunction with a few
eu-readelf -w calls on the object files that things are going wrong in
almost always highlights the problem quite soon.

>                                                   I've taken a quick poke
> around with gdb, and the crash is in ctf_type_resolve. Specifically, the
> instruction that's faulting is `mov %rdi,0x8(%rsp)`, which looks to me like
> it's overflowing a stack. It's possible that means the backtrace I got is
> actually accurate, but I doubt it; after a few reasonable looking entries,
> it quickly starts repeating `#n  0x00007f818c395c7d in ctf_type_align
> (fp=<optimized out>, type=<optimized out>) at libctf/ctf_types.c:408` for
> at least the next 15 thousand entries.

I speculate that a structure has ended up pointing to itself: the
question now is where that structure is and how on earth it ended up in
that situation. This should be impossible (of course). In fact I'm not
immediately sure how we could even *construct* such a structure, since
you can't point at a type that doesn't exist yet...

This is *probably* some object file that's playing preprocessor games
with a structure that we need to blacklist from deduplication: either
that or it's a genuine bug. It should be obvious which it is from the
trace output.