[Ocfs2-users] nfsd hanging with ocfs2 1.4.7...

Mon Mar 28 00:48:42 PDT 2011

Ah...I had initially mounted the volume with nordirplus, but it looks
like I missed it from the fstab entry. I'll see if that fixes the problem.

Thanks,
James

On Fri, Mar 25, 2011 at 08:39:36PM +0000, Sunil Mushran wrote:
> Are you mount with nordirplus?
> 
> For more refer to this email.
> http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
> 
> On 03/25/2011 08:49 AM, James Abbott wrote:
> > Hello,
> >
> > I've recently setup an ocfs2 volume via a 4Gb/s SAN which is directly
> > mounted on two CentOS 5.5 machines (2.6.18-194.32.1.el5). Both servers
> > are exporting the volume via NFS3 to our HPC cluster. This is to replace
> > a single NFS server exporting an ext3 volume which was unable to keep up
> > with our IO requirements. I switched over to using the new ocfs2 volume on
> > Monday, and it had been performing pretty well overall. This morning,
> > however, I saw significant loads appearing on both the NFS servers (load
> >> 30, which is not unheard of since we are running 32 NFS threads per
> > machine), however attempting ot access the shared volume resulted in a
> > hanging connection.
> >
> > Logging into the NFS servers showed that the ocfs volume could be
> > accessed fine, and was responsive, however the load on the machines was
> > clearly coming from nfsd. iostat showed there was no substantial activity
> > on the ocfs2 volume despite the NFS load. dmesg outputs on both servers
> > show a number of hung task warnings:
> >
> > Mar 25 12:02:13 bss-adm2 kernel: INFO: task nfsd:996 blocked for more than 120 seconds.
> > Mar 25 12:02:13 bss-adm2 kernel: "echo 0>  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Mar 25 12:02:13 bss-adm2 kernel: nfsd          D ffff81041ab1d7e0     0 996      1          1008  1017 (L-TLB)
> > Mar 25 12:02:13 bss-adm2 kernel:  ffff8102b8e05d00 0000000000000046 0000000000000246 ffffffff889678b9
> > Mar 25 12:02:13 bss-adm2 kernel:  ffff8103eaa20000 000000000000000a ffff81041262e040 ffff81041ab1d7e0
> > Mar 25 12:02:13 bss-adm2 kernel:  00022676fafd0a61 00000000000021b6 ffff81041262e228 00000007de477ba0
> > Mar 25 12:02:13 bss-adm2 kernel: Call Trace:
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff889678b9>] :ocfs2:ocfs2_cluster_unlock+0x290/0x30d
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff88979236>] :ocfs2:ocfs2_permission+0x137/0x1a4
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8000d9d8>] permission+0x81/0xc8
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8882c981>] :nfsd:nfsd_lookup_dentry+0x306/0x418
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff887ab4b4>] :sunrpc:ip_map_match+0x19/0x30
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8882cab5>] :nfsd:nfsd_lookup+0x22/0xb0
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff887ab59e>] :sunrpc:ip_map_lookup+0xbc/0xc3
> > Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8883347d>] :nfsd:nfsd3_proc_lookup+0xc5/0xd2
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888281db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff887a8651>] :sunrpc:svc_process+0x454/0x71b
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff88828746>] :nfsd:nfsd+0x1a5/0x2cb
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
> > Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> >
> > Although these are obviously nfsd hangs, the fact they occurrent on both
> > servers at the same time make me suspect something on the ocfs2 side. It
> > was necessary to shutdown nfsd and restart the cluster nodes in order
> > for them to resume.
> >
> > Being new to ocfs I'm not sure quite where to look for clues as to what
> > caused this. I'm gussing from the ocfs2_cluster_unlock at the top of the
> > stack trace that this is a o2cb locking issue. The NFS traffic is going
> > over the same (1Gb) network connections as the o2cb heartbeat, so I'm
> > wondering if that may have contributed to the problem. I should be able
> > to add a separate fabric for the oc2b heartbeat if that might be the
> > cause, however neither of the servers were fenced.
> >
> > Anyone have any suggestions?
> >
> > Many thanks,
> > James
> >
> 

-- 
Dr. James Abbott
Bioinformatics Software Developer
Imperial College, London