[OracleOSS] [TitleIndex] [WordIndex]

OCFS2/Debugging

IN PROGRESS

DEBUGGING OCFS2 ISSUES

This page explores beyond typical configuration issues.

Debugging Infrastructure

Before delving into the different types of issues, we'll explore the debugging infrastructure in OCFS2/kernel.

OCFS2 is built with static tracing code where each log is controlled by one or more tracing bits. By selectively enabling these bits, one is able to get trace logs for various components in the file system. debugfs.ocfs2 -l lists all the tracing bits.

# debugfs.ocfs2 -l
ENTRY deny
EXIT deny
TCP off
MSG off
SOCKET off
HEARTBEAT off
HB_BIO off
DLMFS off
DLM off
DLM_DOMAIN off
DLM_THREAD off
DLM_MASTER off
DLM_RECOVERY off
AIO off
JOURNAL off
DISK_ALLOC off
SUPER off
FILE_IO off
EXTENT_MAP off
DLM_GLUE off
BH_IO off
UPTODATE off
NAMEI off
INODE off
VOTE off
DCACHE off
CONN off
QUORUM off
EXPORT off
ERROR allow
NOTICE allow
KTHREAD off

The idea here is to enable only the bits that are likely to provide the required information. So if the issue concerns the filesystem, enabling INODE and NAMEI would be a good start. For the dlm, it would be DLM and DLM_THREAD.

The possible actions are deny, allow (enable) and off (disable). deny is used to limit tracing in code paths where more than one trace bit is set. It is best to refer to the code to understand this. mlog() macros is what is logged in the trace. The default trace bit for the component is specified by MLOG_MASK_PREFIX. Any additional bits are specified in the first argument to mlog().

One way to get a trace involves enabling it, sleeping a shortwhile and disabling it. It is important one resets the trace bits to the default setting. Notice that ENTRY and EXIT are disabled as deny (not off) as these are in all trace paths.

# debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow; sleep 10; debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off;

If a specific command (mv foo bar) is causing an error and you wish to see where the error is originating in the fs, do:

# debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow
# mv foo bar & THAT_PID=$(jobs -p %-)
# echo $THAT_PID
# debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off

Note that as the trace is enabled for all mounts, knowing the pid of the offending process helps in reading the trace.

It is never a good idea to enable all the trace bits. Limit tracing to what you think would be relevant.

Both RHEL4 and SLES9 kernels support netconsole. It is important one sets this up at the get go rather than wait for the kernel oops. The oops traces are very helpful to Kernel developers to diagnose problems and netconsole is an easy way of catching the entire oops. (Images of oops trace on a console is rarely helpful as part of the output invariably scrolls up.)

tcpdump is an useful tool to capture the network traffic. Unlike the filesystem, which persists on the drive, everything the dlm does is in memory and communicated to other nodes via messages. Trapping these messages on all nodes allows us to get a peek into the state of the dlm. When capturing, always capture packets on all nodes.

The following command captures traffic at port 7777 on eth1 interface. It limits the size of the files to 10M and count of such files to 15.

# tcpdump -i eth1 -C 10 -W 15 -s 10000 -Sw /tmp/`hostname -s`_tcpdump.log -ttt 'port 7777' &

If someone is interested, we have the ocfs2 patch for ethereal (now wireshark). The rpm for some distros/arch is available here.

echo t > /proc/sysrq-trigger is a good way to dump the kernel stack. However, this should be avoided, as far as possible, when ocfs2 is active. The problem is that if the heartbeat timeout is set too low, it may reboot the box. However, as most users set the timeout to 60 to 120 secs, it may work but I have not tested it yet.

The fs_locks command in debugfs.ocfs2 prints out all the locks that the filesystem is tracking. This is usually the first thing to look into in hang situations. The scanlocks script dumps the locks from all the mounted volumes. While it lists the busy locks on the terminal, the full list of locks for each mount is stored in /tmp/_fsl_sdX. The bad news is that it does not work on SLES9 but should on SLES10.

Issues

Following are the list of issues encountered by users.

A kernel panics/oopses a machine invariably due to a code bug. In order to resolve the oops, the full kernel version and the complete oops trace is required. The easiest way to catch full oops traces is via netconsole.

If a filesystem operation is hanging, start with fs_locks. It will list all the Busy locks.

Let's look at a sample output:

Lockres: M000000000000000006672078b84822  Mode: Protected Read
Flags: Initialized Attached
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Protected Read  Blocking Mode: Invalid

First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. A lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation. To get the inode number and generation from lockname, do:

#echo "stat <M000000000000000006672078b84822>" | debugfs.ocfs2 -n /dev/sdX
Inode: 419616   Mode: 0666   Generation: 2025343010 (0x78b84822)
....

To map the lockname to a directory entry, do:

# echo "locate <M000000000000000006672078b84822>" | debugfs.ocfs2 -n /dev/sdX
419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

One could also provide the inode number instead of the lockname.

# echo "locate <419616>" | debugfs.ocfs2 -n /dev/sdX
419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

To get a lockname from a directory entry, do:

# echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | debugfs.ocfs2 -n /dev/sdX
M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822

The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource.

The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive.

If you have a dlm hang, the resource to look for would be one with the "Busy" flag set.

The next step would be to query the dlm for the lock resource.

Note: The dlm debugging is still a work in progress.

To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID.

# echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: | while read a b ; do echo $b ; done
82DA8137A49A47E4B187F74E09FBBB4B

Then do:

# echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug

For example:

# echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug
# dmesg | tail
struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=79, key=965960985
  lockres: M000000000000000006672078b84822, owner=75, state=0 last used: 0, on purge list: no
    granted queue:
      type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
    converting queue:
    blocked queue:

It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource.

To get a list of all processes with the names of the functions that the process is sleeping in, do:

# ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN

To get a list of o2net messages that have not been acknowledged, do:

# cat /proc/fs/ocfs2_nodemanager/send_tracking

To get a list of all socket containers, do:

# cat /proc/fs/ocfs2_nodemanager/sock_containers

To get a list of total number of locally and remotely mastered locks, do:

# cat /proc/fs/ocfs2_dlm/*/stat

This is just to give a flavor of dlm debugging.


2011-12-23 01:01