[Ocfs2-users] OCFS2 + NFS setup deadlocking

Sérgio Surkamp sergio at gruposinternet.com.br
Wed May 21 11:10:34 PDT 2008


Hi all,

We setup a OCFS2 cluster on our storage, and exported it using NFS to 
other network servers. It was working fine, but suddenly it locked up 
all NFS clients and unlocked only rebooting all servers (including the 
OCFS2 servers). It seems that under heavy load, the OCFS2+NFS solution 
is deadlocking.

Setup:
* 1 Dell Storage AX100
* 2 Dell servers, running SuSE 10-sp1 x86_64 and attached to storage 
using fibre channel qlogic HBA
* 4 Dell servers, running FreeBSD and accessing the shared storage by NFS

The FreeBSD servers are conected in 2 rows. 2 of them mount the suse #1 
nfsd and 2 mount to suse #2 nfsd, to split the load. Network interfaces 
are connected by a gigabit network with a dedicated switch to NFS and 
OCFS2 (Heartbit/sync messages) traffic.

Without NFS and it seems to work fine. We rushed the filesystem using 
'iozone' manytimes on both serveres at sametime and it worked like expected.

During deadlock recovery, we rebooted the slave OCFS2 server (suse01) 
first and checked the 'dmesg' on master:

o2net: connection to node suse01 (num 1) at 192.168.0.1:7777 has been 
idle for 10.0 seconds, shutting it down.
(0,0):o2net_idle_timer:1434 here are some times that might help debug 
the situation: (tmr 1211375306.9290 now 1211375316.11998 dr 
1211375306.9272 adv 1211375306.9313:1211375306.9314 func (300d6acb:502) 
1211374816.37752:1211374816.37756)
o2net: no longer connected to node suse01 (num 1) at 192.168.0.1:7777
(15331,4):dlm_get_lock_resource:932 
F59B45831EEA41F384BADE6C4B7A932B:M000000000000000000001ba9d5b7e0: at 
least one node (1) torecover before lock mastery can begin
(5313,4):dlm_get_lock_resource:932 
F59B45831EEA41F384BADE6C4B7A932B:$RECOVERY: at least one node (1) 
torecover before lock mastery can begin
(5313,4):dlm_get_lock_resource:966 F59B45831EEA41F384BADE6C4B7A932B: 
recovery map is not empty, but must master $RECOVERY lock now
(15331,4):ocfs2_replay_journal:1173 Recovering node 1 from slot 1 on 
device (8,17)
kjournald starting.  Commit interval 5 seconds
o2net: accepted connection from node suse01 (num 1) at 192.168.0.1:7777
ocfs2_dlm: Node 1 joins domain F59B45831EEA41F384BADE6C4B7A932B
ocfs2_dlm: Nodes in domain ("F59B45831EEA41F384BADE6C4B7A932B"): 0 1

It seems to me that something is deadlocking on DLM resource manager. I 
used the debugfs.ocfs2 to show me the active locks and it has many of 
them with "Blocking Mode" and/or "Requested Mode" marked as "Invalid", 
can it be one of the problems? Why there is a Invalid Blocking Mode for 
DLM locks? Is it just a pre-allocated empty lock?

System configuration:
--> o2cb:
# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# TIMEOUT - 600s
O2CB_HEARTBEAT_THRESHOLD=301

--> cluster.conf:
node:
         ip_port = 7777
         ip_address = 192.168.0.10
         number = 0
         name = suse02
         cluster = ocfs2

node:
         ip_port = 7777
         ip_address = 192.168.0.1
         number = 1
         name = suse01
         cluster = ocfs2

cluster:
         node_count = 2
         name = ocfs2

FreeBSD setup:
* Default NFS Client configuration.
* nfslocking daemon disabled.
* NFS not soft mounted.

SuSE package versions:
ocfs2-tools-1.2.3-0.7
ocfs2console-1.2.3-0.7
nfs-utils-1.0.7-36.26
nfsidmap-0.12-16.17

OCFS2 kernel driver version:
OCFS2 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
OCFS2 Node Manager 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build 
sles)
OCFS2 DLM 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)
OCFS2 DLMFS 1.2.8-SLES-r3074 Fri Jan  4 23:47:26 UTC 2008 (build sles)

Any tip on what is going on?

Thanks for any help.

Regards,
-- 
   .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
   `:,   ,.:'     *Grupos Internet S.A.*
     `: :'        R. Laulo Linhares, 2123 Torre B - Sala 201
      : :         Trindade - Florianópolis - SC
      :.'
      ::          +55 48 3234-9109
      :
      '           http://www.gruposinternet.com.br




More information about the Ocfs2-users mailing list