[Ocfs2-users] load averages going up since upgrade to 1.2.9-1

Tue Nov 18 13:43:12 PST 2008

Ping Oracle Support.

Saranya Sivakumar wrote:
> Hi,
>
> We upgraded to ocfs2 1.2.9-1 and that seems to have fixed our reboot 
> issue.
> But after the upgrade, we are noticing that the load averages on all 
> our RAC nodes are gradually increasing everyday (cumulatively).
> That is if the load average was 5 yesterday, it goes up to 6 today and 
> then continues to increase like that.
>
> We rebooted the database nodes over the weekend, and it looked like 
> the load averages became normal.
> But since the reboot, again the loads are gradually increasing.
>
> We can't see any change tin db activity that is contributing to the 
> load average to increase like this.
> And we haven't made any change to the system other than the ocfs2 upgrade.
> We are noticing this behavior right after the night we upgraded ocfs2 
> (for the past 2 weeks)
> Please advice.
>  
> Regards,
> Saranya Sivakumar
>
>
> ------------------------------------------------------------------------
> *From:* Sunil Mushran <sunil.mushran at oracle.com>
> *To:* Saranya Sivakumar <sarlavk at yahoo.com>
> *Cc:* ocfs2-users at oss.oracle.com
> *Sent:* Monday, October 27, 2008 11:50:17 AM
> *Subject:* Re: [Ocfs2-users] frequent production node reboots
>
> Do upgrade to ocfs2 1.2.9-1. It has a fix for oss bugzilla#919 that could
> be causing the timeouts. The symptom for that issue is o2net spinning at
> 100% shortly before the timeout/fence.
>
>
> Saranya Sivakumar wrote:
> > Hi All,
> >
> > We have been having frequent node reboots in our 4 node production RAC
> > cluster.
> > We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM
> > for database files).
> > We are using OCFS2 for the cluster files.
> >
> > cat /proc/fs/ocfs2/version
> > OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build
> > 9c7ae8bb50ef6d8791df2912775adcc5)
> >
> > /etc/init.d/o2cb status
> > Module "configfs": Loaded
> > Filesystem "configfs": Mounted
> > Module "ocfs2_nodemanager": Loaded
> > Module "ocfs2_dlm": Loaded
> > Module "ocfs2_dlmfs": Loaded
> > Filesystem "ocfs2_dlmfs": Mounted
> > Checking O2CB cluster ocfs2: Online
> >  Heartbeat dead threshold: 61
> >  Network idle timeout: 60000
> >  Network keepalive delay: 2000
> >  Network reconnect delay: 2000
> > Checking O2CB heartbeat: Active
> >
> > Most recent reboot happened this morning around 9:20 am.
> > /var/log/messages on the node that rebooted (db3)
> > Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at
> > 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down.
> > Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are
> > some times that might help debug the situation: (tmr 1225117376.63515
> > now 1225117436.561
> > 20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func
> > (f6ed8616:500) 1225117376.63517:1225117376.63644)
> > Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at
> > 10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down.
> > Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are
> > some times that might help debug the situation: (tmr 1225117376.279971
> > now 1225117436.27
> > 2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func
> > (f6ed8616:504) 1225117356.455405:1225117356.455412)
> > Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at
> > 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down.
> > Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are
> > some times that might help debug the situation: (tmr 1225117376.281033
> > now 1225117436.27
> > 4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func
> > (f6ed8616:502) 1225117376.281035:1225117376.280774)
> > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR:
> > sendmsg returned -32 instead of 24
> > Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1
> > (num 1) at 10.10.100.51:7777
> > Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2
> > (num 2) at 10.10.100.52:7777
> > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR:
> > sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed
> > with -32
> > Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0
> > (num 0) at 10.10.100.50:7777
> > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR:
> > sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed
> > with -32
> > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR:
> > sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed
> > with -32
> > Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR:
> > sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed
> > with -32
> > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418
> > ERROR: link to 0 went down!
> > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR:
> > status = -112
> > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418
> > ERROR: link to 1 went down!
> > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR:
> > status = -107
> > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418
> > ERROR: link to 2 went down!
> > Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR:
> > status = -107
> > Oct 27 09:28:27 db3 syslogd 1.4.1: restart.
> > Oct 27 09:28:27 db3 syslog: syslogd startup succeeded
> >
> > From the logs, it looks like the node that reboots is not able to
> > communicate with the other nodes for more than 60 seconds and thus
> > reboots itself.
> > Initially the Idle Timeout was set to 30 seconds, but since we are
> > using bonded interface, we increased it to 60 seconds, and still the
> > nodes are rebooting.
> > We also upgraded the NIC drivers to the latest version to make sure
> > that the driver is not causing such issues.
> > Now we are not sure what else could be causing the frequent reboots.
> > Please help us to debug the situation. Please let me know if more
> > information is needed.
> >
> > 
> > Regards,
> >
> > Saranya Sivakumar
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com>
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com>
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>