[Ocfs2-users] error on a 12 PCs testbed

Wed Jun 28 06:15:26 CDT 2006

[reposted as I omitted the attachment]

dear all,
I'm evaluating ocfs2 as the fs of choice in my setup at university lab.
all the PCs access a Infortrend  A16F-R1211 storage system via FC SAN
based on QLogic boards.

today it was the first day of heavy use in our cluster. and we got a problem.
I try to tell the story I'm trying to reconstruct.

main question, how can I recover from this situation ? I can't umount
the partition as umount freezes...
I'm trying with shutdown of involved hosts then fsck.ocfs2

all PCs mounts 4 ocfs2 partitions:
LABEL=disk1             /storage/disk1          ocfs2    noauto,_netdev  0 0
LABEL=disk2             /storage/disk2          ocfs2    noauto,_netdev  0 0
LABEL=disk3             /storage/disk3          ocfs2    noauto,_netdev  0 0
LABEL=disk4             /storage/disk4          ocfs2    noauto,_netdev  0 0

theboss is the front PC, rack[1-8] are calculation PCs

rack1 and rack2 were running jobs which required the I/O of big data
sets to/from /storage/disk1.

during the day, I experienced slowdown of theboss, probably related to
heavy I/O done by rack[12]

rack3...rack8 did not use ocfs2 partitions

then, we had some problems on rack9 and/or rack10, with subsequent
reset of both of them, maybe unrelated to ocfs2

I collected data from a bunch of PCs via the command:
egrep 'Jun 26.*(dlm|o2|ocfs).*' /var/log/messages
they are in attachment together with the cluster.conf

particularly informative is bugreport/ocfs_theboss.ape.log

Jun 26 18:20:19 theboss kernel:
(17499,2):dlm_send_remote_convert_request:393 ERROR: status = -107
Jun 26 18:20:19 theboss kernel: (17499,2):dlm_wait_for_node_death:285
5AFE69831DFC414A90CEA2B8718644C4: wai
ting 5000ms for notification of death of node 10
Jun 26 18:20:23 theboss kernel: (2458,0):o2net_set_nn_state:415
accepted connection from node rack9 (num 10
) at 10.0.2.29:7777
Jun 26 18:20:24 theboss kernel:
(17499,2):dlm_send_remote_convert_request:393 ERROR: status = -92
Jun 26 18:20:24 theboss kernel: (17499,2):dlm_wait_for_node_death:285
5AFE69831DFC414A90CEA2B8718644C4: wai
ting 5000ms for notification of death of node 10
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:377 Nodes
in my domain ("5AFE69831DFC414A90CEA2B
8718644C4"):
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 1
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 2
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 3
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 4
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 5
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 6
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 7
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 8
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 9
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 10
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381  node 11
Jun 26 18:20:27 theboss kernel:
(2458,0):dlm_assert_master_handler:1599 ERROR: assert_master from 9,
but cu
rrent owner is 10! (S000000000000000000000200000000)
Jun 26 18:20:27 theboss kernel:
(2458,0):dlm_assert_master_handler:1691 ERROR: Bad message received
from an
other node.  Dumping state and killing the other node now!  This node
is OK and can continue.
Jun 26 18:20:27 theboss kernel: (2458,0):dlm_dump_lock_resources:125
struct dlm_ctxt: 5AFE69831DFC414A90CEA
2B8718644C4, node=3, key=821947029
Jun 26 18:20:27 theboss kernel:
(2458,0):dlm_print_one_lock_resource:52 lockres:
M00000000000000379e03e2b25
92ded, owner=3, state=0

a tgz with logs and cluster.conf is at
http://apegate.roma1.infn.it/~rossetti/apeNEXT/bugreport.tgz
-- 
davide.rossetti at gmail.com ICQ:290677265 SKYPE:d.rossetti