[Ocfs2-users] OCFS2 + NFS setup deadlocking

Wed May 21 12:11:15 PDT 2008

If the hang you see is after a node (with a mounted ocfs2 volume) dies,
then it is a known one. This specific recovery bug was introduced in 1.2.7
and fixed in 1.2.8-2. 1.2.8-SLES-r3074 maps to 1.2.8-1. The fixed one should
be version r3080 or more.

If so, upgrade to the latest SLES10 SP1 kernel. This was detected and
fixed few months ago.
http://oss.oracle.com/pipermail/ocfs2-commits/2008-January/002350.html

But, is that the real issue? As in, you don't mention a server going down
in your original problem but only during test. Does a server go down
during regular op too?

One change I would recommend is that your network idle is too low.
We've increased the default for that to 30 secs
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT

Sunil

Sérgio Surkamp wrote:
> Hi all,
>
> We setup a OCFS2 cluster on our storage, and exported it using NFS to 
> other network servers. It was working fine, but suddenly it locked up 
> all NFS clients and unlocked only rebooting all servers (including the 
> OCFS2 servers). It seems that under heavy load, the OCFS2+NFS solution 
> is deadlocking.
>
> Setup:
> * 1 Dell Storage AX100
> * 2 Dell servers, running SuSE 10-sp1 x86_64 and attached to storage 
> using fibre channel qlogic HBA
> * 4 Dell servers, running FreeBSD and accessing the shared storage by NFS
>
> The FreeBSD servers are conected in 2 rows. 2 of them mount the suse #1 
> nfsd and 2 mount to suse #2 nfsd, to split the load. Network interfaces 
> are connected by a gigabit network with a dedicated switch to NFS and 
> OCFS2 (Heartbit/sync messages) traffic.
>
> Without NFS and it seems to work fine. We rushed the filesystem using 
> 'iozone' manytimes on both serveres at sametime and it worked like expected.
>
> During deadlock recovery, we rebooted the slave OCFS2 server (suse01) 
> first and checked the 'dmesg' on master:
>
> o2net: connection to node suse01 (num 1) at 192.168.0.1:7777 has been 
> idle for 10.0 seconds, shutting it down.
> (0,0):o2net_idle_timer:1434 here are some times that might help debug 
> the situation: (tmr 1211375306.9290 now 1211375316.11998 dr 
> 1211375306.9272 adv 1211375306.9313:1211375306.9314 func (300d6acb:502) 
> 1211374816.37752:1211374816.37756)
> o2net: no longer connected to node suse01 (num 1) at 192.168.0.1:7777
> (15331,4):dlm_get_lock_resource:932 
> F59B45831EEA41F384BADE6C4B7A932B:M000000000000000000001ba9d5b7e0: at 
> least one node (1) torecover before lock mastery can begin
> (5313,4):dlm_get_lock_resource:932 
> F59B45831EEA41F384BADE6C4B7A932B:$RECOVERY: at least one node (1) 
> torecover before lock mastery can begin
> (5313,4):dlm_get_lock_resource:966 F59B45831EEA41F384BADE6C4B7A932B: 
> recovery map is not empty, but must master $RECOVERY lock now
> (15331,4):ocfs2_replay_journal:1173 Recovering node 1 from slot 1 on 
> device (8,17)
> kjournald starting.  Commit interval 5 seconds
> o2net: accepted connection from node suse01 (num 1) at 192.168.0.1:7777
> ocfs2_dlm: Node 1 joins domain F59B45831EEA41F384BADE6C4B7A932B
> ocfs2_dlm: Nodes in domain ("F59B45831EEA41F384BADE6C4B7A932B"): 0 1
>
> It seems to me that something is deadlocking on DLM resource manager. I 
> used the debugfs.ocfs2 to show me the active locks and it has many of 
> them with "Blocking Mode" and/or "Requested Mode" marked as "Invalid", 
> can it be one of the problems? Why there is a Invalid Blocking Mode for 
> DLM locks? Is it just a pre-allocated empty lock?
>
> System configuration:
> --> o2cb:
> # O2CB_ENABELED: 'true' means to load the driver on boot.
> O2CB_ENABLED=true
>
> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> O2CB_BOOTCLUSTER=ocfs2
>
> # TIMEOUT - 600s
> O2CB_HEARTBEAT_THRESHOLD=301
>
> --> cluster.conf:
> node:
>          ip_port = 7777
>          ip_address = 192.168.0.10
>          number = 0
>          name = suse02
>          cluster = ocfs2
>
> node:
>          ip_port = 7777
>          ip_address = 192.168.0.1
>          number = 1
>          name = suse01
>          cluster = ocfs2
>
> cluster:
>          node_count = 2
>          name = ocfs2
>
> FreeBSD setup:
> * Default NFS Client configuration.
> * nfslocking daemon disabled.
> * NFS not soft mounted.
>
> SuSE package versions:
> ocfs2-tools-1.2.3-0.7
> ocfs2console-1.2.3-0.7
> nfs-utils-1.0.7-36.26
> nfsidmap-0.12-16.17
>
> OCFS2 kernel driver version:
> OCFS2 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
> OCFS2 Node Manager 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build 
> sles)
> OCFS2 DLM 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
> OCFS2 DLMFS 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
>
> Any tip on what is going on?
>
> Thanks for any help.
>
> Regards,
>