[Ocfs2-users] frequent production node reboots

Mon Oct 27 09:50:17 PDT 2008

Do upgrade to ocfs2 1.2.9-1. It has a fix for oss bugzilla#919 that could
be causing the timeouts. The symptom for that issue is o2net spinning at
100% shortly before the timeout/fence.

Saranya Sivakumar wrote:
> Hi All,
>
> We have been having frequent node reboots in our 4 node production RAC 
> cluster.
> We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM 
> for database files).
> We are using OCFS2 for the cluster files.
>
> cat /proc/fs/ocfs2/version
> OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build 
> 9c7ae8bb50ef6d8791df2912775adcc5)
>
> /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
>   Heartbeat dead threshold: 61
>   Network idle timeout: 60000
>   Network keepalive delay: 2000
>   Network reconnect delay: 2000
> Checking O2CB heartbeat: Active
>
> Most recent reboot happened this morning around 9:20 am.
> /var/log/messages on the node that rebooted (db3)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at 
> 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.63515 
> now 1225117436.561
> 20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func 
> (f6ed8616:500) 1225117376.63517:1225117376.63644)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at 
> 10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.279971 
> now 1225117436.27
> 2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func 
> (f6ed8616:504) 1225117356.455405:1225117356.455412)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at 
> 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.281033 
> now 1225117436.27
> 4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func 
> (f6ed8616:502) 1225117376.281035:1225117376.280774)
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: 
> sendmsg returned -32 instead of 24
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1 
> (num 1) at 10.10.100.51:7777
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 
> (num 2) at 10.10.100.52:7777
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 
> (num 0) at 10.10.100.50:7777
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 0 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -112
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 1 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -107
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 2 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -107
> Oct 27 09:28:27 db3 syslogd 1.4.1: restart.
> Oct 27 09:28:27 db3 syslog: syslogd startup succeeded
>
> From the logs, it looks like the node that reboots is not able to 
> communicate with the other nodes for more than 60 seconds and thus 
> reboots itself.
> Initially the Idle Timeout was set to 30 seconds, but since we are 
> using bonded interface, we increased it to 60 seconds, and still the 
> nodes are rebooting.
> We also upgraded the NIC drivers to the latest version to make sure 
> that the driver is not causing such issues.
> Now we are not sure what else could be causing the frequent reboots. 
> Please help us to debug the situation. Please let me know if more 
> information is needed.
>
>  
> Regards,
>
> Saranya Sivakumar
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users