[Ocfs2-users] frequent production node reboots

Mon Oct 27 08:47:01 PDT 2008

Hi All,

We have been having frequent node reboots in our 4 node production RAC cluster.
We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM for database files). 
We are using OCFS2 for the cluster files. 

cat /proc/fs/ocfs2/version
OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build 9c7ae8bb50ef6d8791df2912775adcc5)

/etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
  Heartbeat dead threshold: 61
  Network idle timeout: 60000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active

Most recent reboot happened this morning around 9:20 am. 
/var/log/messages on the node that rebooted (db3)
Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down.
Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1225117376.63515 now 1225117436.561
20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func (f6ed8616:500) 1225117376.63517:1225117376.63644)
Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at 10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down.
Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1225117376.279971 now 1225117436.27
2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func (f6ed8616:504) 1225117356.455405:1225117356.455412)
Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down.
Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1225117376.281033 now 1225117436.27
4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func (f6ed8616:502) 1225117376.281035:1225117376.280774)
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: sendmsg returned -32 instead of 24
Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1 (num 1) at 10.10.100.51:7777
Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 (num 2) at 10.10.100.52:7777
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed with -32
Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 (num 0) at 10.10.100.50:7777
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed with -32
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed with -32
Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed with -32
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to 0 went down!
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status = -112
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to 1 went down!
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status = -107
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 ERROR: link to 2 went down!
Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: status = -107
Oct 27 09:28:27 db3 syslogd 1.4.1: restart.
Oct 27 09:28:27 db3 syslog: syslogd startup succeeded

>From the logs, it looks like the node that reboots is not able to communicate with the other nodes for more than 60 seconds and thus reboots itself.
Initially the Idle Timeout was set to 30 seconds, but since we are using bonded interface, we increased it to 60 seconds, and still the nodes are rebooting.
We also upgraded the NIC drivers to the latest version to make sure that the driver is not causing such issues.
Now we are not sure what else could be causing the frequent reboots. Please help us to debug the situation. Please let me know if more information is needed.

 Regards,

Saranya Sivakumar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081027/c5ae792d/attachment.html