[Ocfs2-users] error on a 12 PCs testbed
davide rossetti
davide.rossetti at gmail.com
Wed Jun 28 06:15:26 CDT 2006
[reposted as I omitted the attachment]
dear all,
I'm evaluating ocfs2 as the fs of choice in my setup at university lab.
all the PCs access a Infortrend A16F-R1211 storage system via FC SAN
based on QLogic boards.
today it was the first day of heavy use in our cluster. and we got a problem.
I try to tell the story I'm trying to reconstruct.
main question, how can I recover from this situation ? I can't umount
the partition as umount freezes...
I'm trying with shutdown of involved hosts then fsck.ocfs2
all PCs mounts 4 ocfs2 partitions:
LABEL=disk1 /storage/disk1 ocfs2 noauto,_netdev 0 0
LABEL=disk2 /storage/disk2 ocfs2 noauto,_netdev 0 0
LABEL=disk3 /storage/disk3 ocfs2 noauto,_netdev 0 0
LABEL=disk4 /storage/disk4 ocfs2 noauto,_netdev 0 0
theboss is the front PC, rack[1-8] are calculation PCs
rack1 and rack2 were running jobs which required the I/O of big data
sets to/from /storage/disk1.
during the day, I experienced slowdown of theboss, probably related to
heavy I/O done by rack[12]
rack3...rack8 did not use ocfs2 partitions
then, we had some problems on rack9 and/or rack10, with subsequent
reset of both of them, maybe unrelated to ocfs2
I collected data from a bunch of PCs via the command:
egrep 'Jun 26.*(dlm|o2|ocfs).*' /var/log/messages
they are in attachment together with the cluster.conf
particularly informative is bugreport/ocfs_theboss.ape.log
Jun 26 18:20:19 theboss kernel:
(17499,2):dlm_send_remote_convert_request:393 ERROR: status = -107
Jun 26 18:20:19 theboss kernel: (17499,2):dlm_wait_for_node_death:285
5AFE69831DFC414A90CEA2B8718644C4: wai
ting 5000ms for notification of death of node 10
Jun 26 18:20:23 theboss kernel: (2458,0):o2net_set_nn_state:415
accepted connection from node rack9 (num 10
) at 10.0.2.29:7777
Jun 26 18:20:24 theboss kernel:
(17499,2):dlm_send_remote_convert_request:393 ERROR: status = -92
Jun 26 18:20:24 theboss kernel: (17499,2):dlm_wait_for_node_death:285
5AFE69831DFC414A90CEA2B8718644C4: wai
ting 5000ms for notification of death of node 10
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:377 Nodes
in my domain ("5AFE69831DFC414A90CEA2B
8718644C4"):
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 1
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 2
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 3
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 4
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 5
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 6
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 7
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 8
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 9
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 10
Jun 26 18:20:27 theboss kernel: (2458,0):__dlm_print_nodes:381 node 11
Jun 26 18:20:27 theboss kernel:
(2458,0):dlm_assert_master_handler:1599 ERROR: assert_master from 9,
but cu
rrent owner is 10! (S000000000000000000000200000000)
Jun 26 18:20:27 theboss kernel:
(2458,0):dlm_assert_master_handler:1691 ERROR: Bad message received
from an
other node. Dumping state and killing the other node now! This node
is OK and can continue.
Jun 26 18:20:27 theboss kernel: (2458,0):dlm_dump_lock_resources:125
struct dlm_ctxt: 5AFE69831DFC414A90CEA
2B8718644C4, node=3, key=821947029
Jun 26 18:20:27 theboss kernel:
(2458,0):dlm_print_one_lock_resource:52 lockres:
M00000000000000379e03e2b25
92ded, owner=3, state=0
a tgz with logs and cluster.conf is at
http://apegate.roma1.infn.it/~rossetti/apeNEXT/bugreport.tgz
--
davide.rossetti at gmail.com ICQ:290677265 SKYPE:d.rossetti
More information about the Ocfs2-users
mailing list