[Ocfs2-users] Production node reboots

neil campbell campbell.neil at hotmail.com
Sun Nov 21 18:14:10 PST 2010


Hi all,

I have started experiencing node reboots in our 16 node cluster.
This is Red Hat 4 update 8 (kernel 2.6.89.0.26.ELsmp).

cat /proc/fs/ocfs2/version

  OCFS2 1.2.9 Mon Jun 21 20:03:07 PDT 2010 (build 5e8325ec7f66b5189c65c7a8710fe8cb)

/etc/init.d/o2cb status

  Module "configfs": Loaded
  Filesystem "configfs": Mounted
  Module "ocfs2_nodemanager": Loaded
  Module "ocfs2_dlm": Loaded
  Module "ocfs2_dlmfs": Loaded
  Filesystem "ocfs2_dlmfs": Mounted
  Checking O2CB cluster UATocfs2: Online
    Heartbeat dead threshold: 61
    Network idle timeout: 30000
    Network keepalive delay: 2000
    Network reconnect delay: 2000
  Checking O2CB heartbeat: Active

Most recent reboot happened early yesterday morning, and resulted in 5 nodes rebooting over the space of just over an hour.
/var/log/messages from the first node to reboot shows the following

  Nov 21 02:05:09 dcapp08 kernel: o2net: connection to node dcapp02 (num 1) at 10.255.255.2:10007 has been idle for 30.0 seconds, shutting it down.
  Nov 21 02:05:09 dcapp08 kernel: (0,0):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.449900 now 1290265509.449647 dr 1290265479.426428 adv 1290265479.44
  9905:1290265479.449905 func (d399da91:500) 1290265475.684306:1290265475.700022)
  Nov 21 02:05:09 dcapp08 kernel: o2net: no longer connected to node dcapp02 (num 1) at 10.255.255.2:10007
  Nov 21 02:05:09 dcapp08 kernel: (21573,1):dlm_do_master_request:1418 ERROR: link to 1 went down!
  Nov 21 02:05:09 dcapp08 kernel: (21573,1):dlm_get_lock_resource:995 ERROR: status = -112
  Nov 21 02:05:16 dcapp08 kernel: o2net: connected to node dc-c02b02-app02 (num 1) at 10.255.255.2:10007
  Nov 21 02:03:02 dcapp08 syslogd 1.4.1: restart.

On one of the other nodes, there is additional information in messages logfile

Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb03 (num 10) at 10.255.255.11:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265477.576542 now 1290265507.576428 dr 1290265479.576562 adv 129026547
7.576547:1290265477.576548 func (d399da91:502) 1290260960.121797:1290260960.121799)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp03 (num 2) at 10.255.255.3:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265477.606847 now 1290265507.607428 dr 1290265507.607353 adv 129026547
7.606850:1290265477.606850 func (d399da91:504) 1290265452.506254:1290265452.506310)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp04 (num 3) at 10.255.255.4:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265477.987792 now 1290265507.988438 dr 1290265498.890055 adv 129026547
7.987792:1290265477.987793 func (d399da91:500) 1290265475.987731:1290265475.987739)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp01 (num 0) at 10.255.255.1:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265478.881795 now 1290265508.881459 dr 1290265480.889380 adv 129026547
8.889657:1290265478.889659 func (d399da91:500) 1290265478.881795:1290265478.889640)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp08 (num 7) at 10.255.255.8:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.425622 now 1290265509.425473 dr 1290265481.449378 adv 129026547
9.425678:1290265479.425679 func (d399da91:500) 1290265479.425623:1290265479.425673)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb02 (num 9) at 10.255.255.10:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.439787 now 1290265509.439473 dr 1290265481.439720 adv 129026547
9.439838:1290265479.439839 func (d399da91:500) 1290265479.439788:1290265479.439834)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcapp07 (num 6) at 10.255.255.7:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.469144 now 1290265509.469475 dr 1290265509.450361 adv 129026547
9.473102:1290265479.473104 func (d399da91:502) 1290265479.469146:1290265479.473087)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcb07 (num 14) at 10.255.255.15:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.473112 now 1290265509.473475 dr 1290265485.575784 adv 129026547
9.480961:1290265479.480963 func (d399da91:500) 1290265479.473113:1290265479.480947)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb04 (num 11) at 10.255.255.12:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.480970 now 1290265509.481477 dr 1290265481.481060 adv 129026547
9.481016:1290265479.481016 func (d399da91:500) 1290265479.480970:1290265479.481012)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb08 (num 15) at 10.255.255.16:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.485633 now 1290265509.485475 dr 1290265481.492977 adv 129026547
9.493484:1290265479.493487 func (d399da91:500) 1290265479.485633:1290265479.493467)
Nov 21 02:05:16 dcapp02 kernel: o2net: connection to node dcdb01 (num 8) at 10.255.255.9:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290265479.493493 now 1290265509.493475 dr 1290265509.484478 adv 129026547
9.493488:1290265479.484610 func (d399da91:502) 1290265479.493495:1290265479.484606)
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_send_tcp_msg:840 ERROR: sendmsg returned -32 instead of 24
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb01 (num 8) at 10.255.255.9:10007
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_send_tcp_msg:840 ERROR: sendmsg returned -32 instead of 24
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp07 (num 6) at 10.255.255.7:10007
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcdb01 (num 8) at 10.255.255.9:10007 failed with -32
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb03 (num 10) at 10.255.255.11:10007
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp03 (num 2) at 10.255.255.3:10007
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_send_tcp_msg:840 ERROR: sendmsg returned -32 instead of 24
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp04 (num 3) at 10.255.255.4:10007
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp04 (num 3) at 10.255.255.4:10007 failed with -32
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp01 (num 0) at 10.255.255.1:10007 failed with -32
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcapp01 (num 0) at 10.255.255.1:10007
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp01 (num 0) at 10.255.255.1:10007 failed with -32
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp08 (num 7) at 10.255.255.8:10007 failed with -32
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb02 (num 9) at 10.255.255.10:10007
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcapp07 (num 6) at 10.255.255.7:10007 failed with -32
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcb07 (num 14) at 10.255.255.15:10007
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb04 (num 11) at 10.255.255.12:10007
Nov 21 02:05:16 dcapp02 kernel: o2net: no longer connected to node dcdb08 (num 15) at 10.255.255.16:10007
Nov 21 02:05:16 dcapp02 kernel: (31112,2):o2net_sendpage:874 ERROR: sendpage of size 24 to node dcdb01 (num 8) at 10.255.255.9:10007 failed with -32
Nov 21 02:05:16 dcapp02 kernel: (32175,2):dlm_send_remote_convert_request:398 ERROR: status = -107
Nov 21 02:05:16 dcapp02 kernel: (32175,2):dlm_wait_for_node_death:365 C5C06C9B675D41B99B60DE2EB28CE0F7: waiting 5000ms for notification of death of node 3
Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 0 went down!
Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -112
Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 2 went down!
Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107
Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 3 went down!
Nov 21 02:05:16 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 6 went down!
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 7 went down!
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 8 went down!
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_do_master_request:1418 ERROR: link to 9 went down!
Nov 21 02:05:17 dcapp02 kernel: (15445,2):dlm_get_lock_resource:995 ERROR: status = -107
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb03 (num 10) at 10.255.255.11:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp03 (num 2) at 10.255.255.3:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp04 (num 3) at 10.255.255.4:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb02 (num 9) at 10.255.255.10:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp08 (num 7) at 10.255.255.8:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcapp07 (num 6) at 10.255.255.7:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcb07 (num 14) at 10.255.255.15:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb04 (num 11) at 10.255.255.12:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb01 (num 8) at 10.255.255.9:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: accepted connection from node dcdb08 (num 15) at 10.255.255.16:10007
Nov 21 02:05:17 dcapp02 kernel: o2net: connected to node dcapp01 (num 0) at 10.255.255.1:10007
Nov 21 02:58:30 dcapp02 kernel: o2net: connection to node dcapp08 (num 7) at 10.255.255.8:10007 has been idle for 30.0 seconds, shutting it down.
Nov 21 02:58:30 dcapp02 kernel: (0,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1290268680.817422 now 1290268710.817946 dr 1290268680.817412 adv 1290268680.81
7427:1290268680.817428 func (d399da91:500) 1290268678.817606:1290268678.817620)
Nov 21 02:58:30 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007
Nov 21 02:59:00 dcapp02 kernel: (31112,2):o2net_connect_expired:1585 ERROR: no connection established with node 7 after 30.0 seconds, giving up and returning errors.
Nov 21 03:00:00 dcapp02 last message repeated 2 times
Nov 21 03:00:06 dcapp02 kernel: (31112,2):ocfs2_dlm_eviction_cb:119 device (120,1): dlm has evicted node 7
Nov 21 03:00:11 dcapp02 kernel: (15445,0):dlm_restart_lock_mastery:1301 ERROR: node down! 7
Nov 21 03:00:11 dcapp02 kernel: (15445,0):dlm_wait_for_lock_mastery:1118 ERROR: status = -11
Nov 21 03:00:12 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin
Nov 21 03:00:45 dcapp02 last message repeated 3 times
Nov 21 03:03:29 dcapp02 kernel: o2net: accepted connection from node dcapp08 (num 7) at 10.255.255.8:10007
Nov 21 03:03:30 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin
Nov 21 03:04:03 dcapp02 last message repeated 3 times
Nov 21 03:04:58 dcapp02 last message repeated 5 times
Nov 21 03:05:05 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007
Nov 21 03:05:09 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin
Nov 21 03:05:20 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin
Nov 21 03:05:21 dcapp02 kernel: o2net: accepted connection from node dcapp08 (num 7) at 10.255.255.8:10007
Nov 21 03:05:31 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin
Nov 21 03:06:04 dcapp02 last message repeated 3 times
Nov 21 03:06:48 dcapp02 last message repeated 4 times
Nov 21 03:06:55 dcapp02 kernel: o2net: no longer connected to node dcapp08 (num 7) at 10.255.255.8:10007
Nov 21 03:06:59 dcapp02 kernel: (15445,0):dlm_get_lock_resource:976 C5C06C9B675D41B99B60DE2EB28CE0F7:M0000000000000000a8e4377b2ead66: at least one node (7) torecover before lock mastery can begin
Nov 21 03:07:32 dcapp02 last message repeated 3 times


I notice that the Idle Timeout is set to 30 seconds, but as we are using bonded interface perhaps I should increase to 60 seconds?

Looking at sar report, I couldn't see anything strange about the network over the reboot period.

I wonder if anyone has any suggestions on troubleshooting this, as this is the second time in 3 days I have had this issue.

Many thanks
Neil
 		 	   		  


More information about the Ocfs2-users mailing list