[Ocfs2-users] nfsd hanging with ocfs2 1.4.7...

James Abbott j.abbott at imperial.ac.uk
Fri Mar 25 08:49:02 PDT 2011


Hello,

I've recently setup an ocfs2 volume via a 4Gb/s SAN which is directly
mounted on two CentOS 5.5 machines (2.6.18-194.32.1.el5). Both servers
are exporting the volume via NFS3 to our HPC cluster. This is to replace
a single NFS server exporting an ext3 volume which was unable to keep up
with our IO requirements. I switched over to using the new ocfs2 volume on
Monday, and it had been performing pretty well overall. This morning,
however, I saw significant loads appearing on both the NFS servers (load
>30, which is not unheard of since we are running 32 NFS threads per
machine), however attempting ot access the shared volume resulted in a
hanging connection. 

Logging into the NFS servers showed that the ocfs volume could be
accessed fine, and was responsive, however the load on the machines was
clearly coming from nfsd. iostat showed there was no substantial activity
on the ocfs2 volume despite the NFS load. dmesg outputs on both servers
show a number of hung task warnings:

Mar 25 12:02:13 bss-adm2 kernel: INFO: task nfsd:996 blocked for more than 120 seconds.
Mar 25 12:02:13 bss-adm2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 12:02:13 bss-adm2 kernel: nfsd          D ffff81041ab1d7e0     0 996      1          1008  1017 (L-TLB)
Mar 25 12:02:13 bss-adm2 kernel:  ffff8102b8e05d00 0000000000000046 0000000000000246 ffffffff889678b9
Mar 25 12:02:13 bss-adm2 kernel:  ffff8103eaa20000 000000000000000a ffff81041262e040 ffff81041ab1d7e0
Mar 25 12:02:13 bss-adm2 kernel:  00022676fafd0a61 00000000000021b6 ffff81041262e228 00000007de477ba0
Mar 25 12:02:13 bss-adm2 kernel: Call Trace:
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff889678b9>] :ocfs2:ocfs2_cluster_unlock+0x290/0x30d
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff88979236>] :ocfs2:ocfs2_permission+0x137/0x1a4
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8000d9d8>] permission+0x81/0xc8
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8882c981>] :nfsd:nfsd_lookup_dentry+0x306/0x418
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff887ab4b4>] :sunrpc:ip_map_match+0x19/0x30
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8882cab5>] :nfsd:nfsd_lookup+0x22/0xb0
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff887ab59e>] :sunrpc:ip_map_lookup+0xbc/0xc3
Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8883347d>] :nfsd:nfsd3_proc_lookup+0xc5/0xd2
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888281db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff887a8651>] :sunrpc:svc_process+0x454/0x71b
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff88828746>] :nfsd:nfsd+0x1a5/0x2cb
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Although these are obviously nfsd hangs, the fact they occurrent on both
servers at the same time make me suspect something on the ocfs2 side. It
was necessary to shutdown nfsd and restart the cluster nodes in order
for them to resume. 

Being new to ocfs I'm not sure quite where to look for clues as to what
caused this. I'm gussing from the ocfs2_cluster_unlock at the top of the
stack trace that this is a o2cb locking issue. The NFS traffic is going
over the same (1Gb) network connections as the o2cb heartbeat, so I'm
wondering if that may have contributed to the problem. I should be able
to add a separate fabric for the oc2b heartbeat if that might be the
cause, however neither of the servers were fenced.

Anyone have any suggestions?

Many thanks,
James

-- 
Dr. James Abbott
Bioinformatics Software Developer
Imperial College, London



More information about the Ocfs2-users mailing list