[Ocfs2-users] RE: Access to OCFS2 volume paused when a node crashes

Tue Oct 9 07:12:18 PDT 2007

Many thanks Marcos.

Kind regards

Paul Fretter

From: Marcos E. Matsunaga [mailto:Marcos.Matsunaga at oracle.com] 
Sent: 09 October 2007 13:31
To: paul fretter (TOC)
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] RE: Access to OCFS2 volume paused when a node crashes

You may want to try to increase the network timeout. You will have to do it on all nodes.

See the FAQ http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT  with special attention to #104 and 105.

Regards,

Marcos Eduardo Matsunaga

Oracle USA
Linux Engineering

paul fretter (TOC) wrote: 

To clarify,

The host "node1" is the OCFS node 0 in the config file.

The log entries are from another system in the cluster.

Kind regards
Paul

	-----Original Message-----
	From: paul fretter (TOC)
	Sent: 09 October 2007 11:41
	To: ocfs2-users at oss.oracle.com
	Subject: Access to OCFS2 volume paused when a node crashes

	There is a node (node1) on our cluster that for some reason hangs

every

	now and again, but it seems that when it happens it also pauses access
	to the OCFS2 volume for the other nodes.

	We are running the latest version of OCFS2 and the tools, on RHEL4
	(x86_64) with kernel 2.6.9-42.  All nodes area connected by
	fibrechannel to a common LUN for data sharing.

	I guess there may be something I can do with configuring timeouts
	etc(?), but I thought I'd check with this list first.  Here is the
	relevant info from /va/log/messages

	Oct  9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num
	0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it
	down.
	Oct  9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are
	some times  that might help debug the situation: (tmr

1191925471.993435

	now 1191925481.9942 92 dr 1191925471.993425 adv
	1191925471.993436:1191925471.993437 func (98e2d068:5 07)
	1191924562.14841:1191924562.14844)
	Oct  9 11:24:41 jic55124 kernel: o2net: no longer connected to node
	node1 (num 0 ) at 10.10.10.1:7777
	Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418
	ERROR: link to 0 went down!
	Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995
	ERROR: status  = -112
	[root at jic55124 ~]# tail /var/log/messages
	Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
	ERROR: status = -107
	Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418
	ERROR: link to 0 went down!
	Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
	ERROR: status = -107
	Oct  9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921
	6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
	least one node (0) torecover before lock mastery can begin
	Oct  9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119
	device (8,80): dlm has evicted node 0
	Oct  9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976
	6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
	least one node (0) torecover before lock mastery can begin
	Oct  9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301
	ERROR: node down! 0
	Oct  9 11:33:46 jic55124 kernel:

(727,3):dlm_wait_for_lock_mastery:1118

	ERROR: status = -11
	Oct  9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167
	Recovering node 0 from slot 5 on device (8,80)
	Oct  9 11:33:50 jic55124 kernel: kjournald starting.  Commit interval

5

	seconds

	Many thanks
	Paul Fretter

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20071009/a8342dfa/attachment-0001.html