<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman,new york,times,serif;font-size:10pt">Hi,<br><br>We upgraded to ocfs2 1.2.9-1 and that seems to have fixed our reboot issue.<br>But after the upgrade, we are noticing that the load averages on all our RAC nodes are gradually increasing everyday (cumulatively).<br>That is if the load average was 5 yesterday, it goes up to 6 today and then continues to increase like that.<br><br>We rebooted the database nodes over the weekend, and it looked like the load averages became normal.<br>But since the reboot, again the loads are gradually increasing.<br><br>We can't see any change tin db activity that is contributing to the load average to increase like this.<br>And we haven't made any change to the system other than the ocfs2 upgrade.<br>We are noticing this behavior right after the night we upgraded ocfs2 (for the past 2 weeks)<br><div>Please
advice.<br> </div>Regards,<br><div>Saranya Sivakumar<div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 10pt;"><br><div style="font-family: arial,helvetica,sans-serif; font-size: 13px;"><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Sunil Mushran <sunil.mushran@oracle.com><br><b><span style="font-weight: bold;">To:</span></b> Saranya Sivakumar <sarlavk@yahoo.com><br><b><span style="font-weight: bold;">Cc:</span></b> ocfs2-users@oss.oracle.com<br><b><span style="font-weight: bold;">Sent:</span></b> Monday, October 27, 2008 11:50:17 AM<br><b><span style="font-weight: bold;">Subject:</span></b> Re: [Ocfs2-users] frequent production node reboots<br></font><br>
Do upgrade to ocfs2 1.2.9-1. It has a fix for oss bugzilla#919 that could<br>be causing the timeouts. The symptom for that issue is o2net spinning at<br>100% shortly before the timeout/fence.<br><br><br>Saranya Sivakumar wrote:<br>> Hi All,<br>><br>> We have been having frequent node reboots in our 4 node production RAC <br>> cluster.<br>> We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM <br>> for database files).<br>> We are using OCFS2 for the cluster files.<br>><br>> cat /proc/fs/ocfs2/version<br>> OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build <br>> 9c7ae8bb50ef6d8791df2912775adcc5)<br>><br>> /etc/init.d/o2cb status<br>> Module "configfs": Loaded<br>> Filesystem "configfs": Mounted<br>> Module "ocfs2_nodemanager": Loaded<br>> Module "ocfs2_dlm": Loaded<br>> Module "ocfs2_dlmfs": Loaded<br>> Filesystem "ocfs2_dlmfs": Mounted<br>> Checking O2CB cluster ocfs2:
Online<br>> Heartbeat dead threshold: 61<br>> Network idle timeout: 60000<br>> Network keepalive delay: 2000<br>> Network reconnect delay: 2000<br>> Checking O2CB heartbeat: Active<br>><br>> Most recent reboot happened this morning around 9:20 am.<br>> /var/log/messages on the node that rebooted (db3)<br>> Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at <br>> 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down.<br>> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are <br>> some times that might help debug the situation: (tmr 1225117376.63515 <br>> now 1225117436.561<br>> 20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func <br>> (f6ed8616:500) 1225117376.63517:1225117376.63644)<br>> Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at <br>> 10.10.100.50:7777 has been idle for 60.0 seconds,
shutting it down.<br>> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are <br>> some times that might help debug the situation: (tmr 1225117376.279971 <br>> now 1225117436.27<br>> 2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func <br>> (f6ed8616:504) 1225117356.455405:1225117356.455412)<br>> Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at <br>> 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down.<br>> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are <br>> some times that might help debug the situation: (tmr 1225117376.281033 <br>> now 1225117436.27<br>> 4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func <br>> (f6ed8616:502) 1225117376.281035:1225117376.280774)<br>> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: <br>> sendmsg returned -32 instead of 24<br>> Oct 27 09:23:59 db3
kernel: o2net: no longer connected to node db1 <br>> (num 1) at 10.10.100.51:7777<br>> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 <br>> (num 2) at 10.10.100.52:7777<br>> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: <br>> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed <br>> with -32<br>> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 <br>> (num 0) at 10.10.100.50:7777<br>> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: <br>> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed <br>> with -32<br>> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: <br>> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed <br>> with -32<br>> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: <br>> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed
<br>> with -32<br>> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 <br>> ERROR: link to 0 went down!<br>> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: <br>> status = -112<br>> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 <br>> ERROR: link to 1 went down!<br>> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: <br>> status = -107<br>> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 <br>> ERROR: link to 2 went down!<br>> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: <br>> status = -107<br>> Oct 27 09:28:27 db3 syslogd 1.4.1: restart.<br>> Oct 27 09:28:27 db3 syslog: syslogd startup succeeded<br>><br>> From the logs, it looks like the node that reboots is not able to <br>> communicate with the other nodes for more than 60 seconds and thus <br>> reboots itself.<br>>
Initially the Idle Timeout was set to 30 seconds, but since we are <br>> using bonded interface, we increased it to 60 seconds, and still the <br>> nodes are rebooting.<br>> We also upgraded the NIC drivers to the latest version to make sure <br>> that the driver is not causing such issues.<br>> Now we are not sure what else could be causing the frequent reboots. <br>> Please help us to debug the situation. Please let me know if more <br>> information is needed.<br>><br>> <br>> Regards,<br>><br>> Saranya Sivakumar<br>><br>><br>> ------------------------------------------------------------------------<br>><br>> _______________________________________________<br>> Ocfs2-users mailing list<br>> <a ymailto="mailto:Ocfs2-users@oss.oracle.com" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a><br>> <a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users"
target="_blank">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a><br><br><br>_______________________________________________<br>Ocfs2-users mailing list<br><a ymailto="mailto:Ocfs2-users@oss.oracle.com" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a><br><a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users" target="_blank">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a><br></div></div></div></div><br>
</body></html>