[Ocfs2-users] nfsd hanging with ocfs2 1.4.7...

Fri Mar 25 13:39:36 PDT 2011

Are you mount with nordirplus?

For more refer to this email.
http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html

On 03/25/2011 08:49 AM, James Abbott wrote:
> Hello,
>
> I've recently setup an ocfs2 volume via a 4Gb/s SAN which is directly
> mounted on two CentOS 5.5 machines (2.6.18-194.32.1.el5). Both servers
> are exporting the volume via NFS3 to our HPC cluster. This is to replace
> a single NFS server exporting an ext3 volume which was unable to keep up
> with our IO requirements. I switched over to using the new ocfs2 volume on
> Monday, and it had been performing pretty well overall. This morning,
> however, I saw significant loads appearing on both the NFS servers (load
>> 30, which is not unheard of since we are running 32 NFS threads per
> machine), however attempting ot access the shared volume resulted in a
> hanging connection.
>
> Logging into the NFS servers showed that the ocfs volume could be
> accessed fine, and was responsive, however the load on the machines was
> clearly coming from nfsd. iostat showed there was no substantial activity
> on the ocfs2 volume despite the NFS load. dmesg outputs on both servers
> show a number of hung task warnings:
>
> Mar 25 12:02:13 bss-adm2 kernel: INFO: task nfsd:996 blocked for more than 120 seconds.
> Mar 25 12:02:13 bss-adm2 kernel: "echo 0>  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Mar 25 12:02:13 bss-adm2 kernel: nfsd          D ffff81041ab1d7e0     0 996      1          1008  1017 (L-TLB)
> Mar 25 12:02:13 bss-adm2 kernel:  ffff8102b8e05d00 0000000000000046 0000000000000246 ffffffff889678b9
> Mar 25 12:02:13 bss-adm2 kernel:  ffff8103eaa20000 000000000000000a ffff81041262e040 ffff81041ab1d7e0
> Mar 25 12:02:13 bss-adm2 kernel:  00022676fafd0a61 00000000000021b6 ffff81041262e228 00000007de477ba0
> Mar 25 12:02:13 bss-adm2 kernel: Call Trace:
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff889678b9>] :ocfs2:ocfs2_cluster_unlock+0x290/0x30d
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff88979236>] :ocfs2:ocfs2_permission+0x137/0x1a4
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8000d9d8>] permission+0x81/0xc8
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8882c981>] :nfsd:nfsd_lookup_dentry+0x306/0x418
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff887ab4b4>] :sunrpc:ip_map_match+0x19/0x30
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8882cab5>] :nfsd:nfsd_lookup+0x22/0xb0
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff887ab59e>] :sunrpc:ip_map_lookup+0xbc/0xc3
> Mar 25 12:02:13 bss-adm2 kernel:  [<ffffffff8883347d>] :nfsd:nfsd3_proc_lookup+0xc5/0xd2
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888281db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff887a8651>] :sunrpc:svc_process+0x454/0x71b
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff88828746>] :nfsd:nfsd+0x1a5/0x2cb
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb
> Mar 25 12:02:14 bss-adm2 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>
> Although these are obviously nfsd hangs, the fact they occurrent on both
> servers at the same time make me suspect something on the ocfs2 side. It
> was necessary to shutdown nfsd and restart the cluster nodes in order
> for them to resume.
>
> Being new to ocfs I'm not sure quite where to look for clues as to what
> caused this. I'm gussing from the ocfs2_cluster_unlock at the top of the
> stack trace that this is a o2cb locking issue. The NFS traffic is going
> over the same (1Gb) network connections as the o2cb heartbeat, so I'm
> wondering if that may have contributed to the problem. I should be able
> to add a separate fabric for the oc2b heartbeat if that might be the
> cause, however neither of the servers were fenced.
>
> Anyone have any suggestions?
>
> Many thanks,
> James
>