[Ocfs2-users] OCFS2 + NFS setup deadlocking
Sérgio Surkamp
sergio at gruposinternet.com.br
Wed May 21 11:10:34 PDT 2008
Hi all,
We setup a OCFS2 cluster on our storage, and exported it using NFS to
other network servers. It was working fine, but suddenly it locked up
all NFS clients and unlocked only rebooting all servers (including the
OCFS2 servers). It seems that under heavy load, the OCFS2+NFS solution
is deadlocking.
Setup:
* 1 Dell Storage AX100
* 2 Dell servers, running SuSE 10-sp1 x86_64 and attached to storage
using fibre channel qlogic HBA
* 4 Dell servers, running FreeBSD and accessing the shared storage by NFS
The FreeBSD servers are conected in 2 rows. 2 of them mount the suse #1
nfsd and 2 mount to suse #2 nfsd, to split the load. Network interfaces
are connected by a gigabit network with a dedicated switch to NFS and
OCFS2 (Heartbit/sync messages) traffic.
Without NFS and it seems to work fine. We rushed the filesystem using
'iozone' manytimes on both serveres at sametime and it worked like expected.
During deadlock recovery, we rebooted the slave OCFS2 server (suse01)
first and checked the 'dmesg' on master:
o2net: connection to node suse01 (num 1) at 192.168.0.1:7777 has been
idle for 10.0 seconds, shutting it down.
(0,0):o2net_idle_timer:1434 here are some times that might help debug
the situation: (tmr 1211375306.9290 now 1211375316.11998 dr
1211375306.9272 adv 1211375306.9313:1211375306.9314 func (300d6acb:502)
1211374816.37752:1211374816.37756)
o2net: no longer connected to node suse01 (num 1) at 192.168.0.1:7777
(15331,4):dlm_get_lock_resource:932
F59B45831EEA41F384BADE6C4B7A932B:M000000000000000000001ba9d5b7e0: at
least one node (1) torecover before lock mastery can begin
(5313,4):dlm_get_lock_resource:932
F59B45831EEA41F384BADE6C4B7A932B:$RECOVERY: at least one node (1)
torecover before lock mastery can begin
(5313,4):dlm_get_lock_resource:966 F59B45831EEA41F384BADE6C4B7A932B:
recovery map is not empty, but must master $RECOVERY lock now
(15331,4):ocfs2_replay_journal:1173 Recovering node 1 from slot 1 on
device (8,17)
kjournald starting. Commit interval 5 seconds
o2net: accepted connection from node suse01 (num 1) at 192.168.0.1:7777
ocfs2_dlm: Node 1 joins domain F59B45831EEA41F384BADE6C4B7A932B
ocfs2_dlm: Nodes in domain ("F59B45831EEA41F384BADE6C4B7A932B"): 0 1
It seems to me that something is deadlocking on DLM resource manager. I
used the debugfs.ocfs2 to show me the active locks and it has many of
them with "Blocking Mode" and/or "Requested Mode" marked as "Invalid",
can it be one of the problems? Why there is a Invalid Blocking Mode for
DLM locks? Is it just a pre-allocated empty lock?
System configuration:
--> o2cb:
# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true
# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2
# TIMEOUT - 600s
O2CB_HEARTBEAT_THRESHOLD=301
--> cluster.conf:
node:
ip_port = 7777
ip_address = 192.168.0.10
number = 0
name = suse02
cluster = ocfs2
node:
ip_port = 7777
ip_address = 192.168.0.1
number = 1
name = suse01
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
FreeBSD setup:
* Default NFS Client configuration.
* nfslocking daemon disabled.
* NFS not soft mounted.
SuSE package versions:
ocfs2-tools-1.2.3-0.7
ocfs2console-1.2.3-0.7
nfs-utils-1.0.7-36.26
nfsidmap-0.12-16.17
OCFS2 kernel driver version:
OCFS2 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles)
OCFS2 Node Manager 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build
sles)
OCFS2 DLM 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles)
OCFS2 DLMFS 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles)
Any tip on what is going on?
Thanks for any help.
Regards,
--
.:''''':.
.:' ` Sérgio Surkamp | Gerente de Rede
:: ........ sergio at gruposinternet.com.br
`:. .:'
`:, ,.:' *Grupos Internet S.A.*
`: :' R. Laulo Linhares, 2123 Torre B - Sala 201
: : Trindade - Florianópolis - SC
:.'
:: +55 48 3234-9109
:
' http://www.gruposinternet.com.br
More information about the Ocfs2-users
mailing list