[Ocfs2-users] load averages going up since upgrade to 1.2.9-1

Tue Nov 18 12:52:33 PST 2008

Hi,

We upgraded to ocfs2 1.2.9-1 and that seems to have fixed our reboot issue.
But after the upgrade, we are noticing that the load averages on all our RAC nodes are gradually increasing everyday (cumulatively).
That is if the load average was 5 yesterday, it goes up to 6 today and then continues to increase like that.

We rebooted the database nodes over the weekend, and it looked like the load averages became normal.
But since the reboot, again the loads are gradually increasing.

We can't see any change tin db activity that is contributing to the load average to increase like this.
And we haven't made any change to the system other than the ocfs2 upgrade.
We are noticing this behavior right after the night we upgraded ocfs2 (for the past 2 weeks)

Please advice.
 Regards,

Saranya Sivakumar

________________________________
From: Sunil Mushran <sunil.mushran at oracle.com>
To: Saranya Sivakumar <sarlavk at yahoo.com>
Cc: ocfs2-users at oss.oracle.com
Sent: Monday, October 27, 2008 11:50:17 AM
Subject: Re: [Ocfs2-users] frequent production node reboots

Do upgrade to ocfs2 1.2.9-1. It has a fix for oss bugzilla#919 that could
be causing the timeouts. The symptom for that issue is o2net spinning at
100% shortly before the timeout/fence.

Saranya Sivakumar wrote:
> Hi All,
>
> We have been having frequent node reboots in our 4 node production RAC 
> cluster.
> We are using 10.2.0.2 Clusterware + 10.2.0.2 RAC Database(using ASM 
> for database files).
> We are using OCFS2 for the cluster files.
>
> cat /proc/fs/ocfs2/version
> OCFS2 1.2.8 Tue Jan 22 12:00:30 PST 2008 (build 
> 9c7ae8bb50ef6d8791df2912775adcc5)
>
> /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
>   Heartbeat dead threshold: 61
>   Network idle timeout: 60000
>   Network keepalive delay: 2000
>   Network reconnect delay: 2000
> Checking O2CB heartbeat: Active
>
> Most recent reboot happened this morning around 9:20 am.
> /var/log/messages on the node that rebooted (db3)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db2 (num 2) at 
> 10.10.100.52:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.63515 
> now 1225117436.561
> 20 dr 1225117378.63161 adv 1225117376.63652:1225117376.63654 func 
> (f6ed8616:500) 1225117376.63517:1225117376.63644)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db0 (num 0) at 
> 10.10.100.50:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.279971 
> now 1225117436.27
> 2115 dr 1225117436.272042 adv 1225117376.279977:1225117376.279978 func 
> (f6ed8616:504) 1225117356.455405:1225117356.455412)
> Oct 27 09:23:56 db3 kernel: o2net: connection to node db1 (num 1) at 
> 10.10.100.51:7777 has been idle for 60.0 seconds, shutting it down.
> Oct 27 09:23:56 db3 kernel: (13428,2):o2net_idle_timer:1426 here are 
> some times that might help debug the situation: (tmr 1225117376.281033 
> now 1225117436.27
> 4121 dr 1225117436.273839 adv 1225117376.281030:1225117376.280783 func 
> (f6ed8616:502) 1225117376.281035:1225117376.280774)
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_send_tcp_msg:841 ERROR: 
> sendmsg returned -32 instead of 24
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db1 
> (num 1) at 10.10.100.51:7777
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db2 
> (num 2) at 10.10.100.52:7777
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: o2net: no longer connected to node db0 
> (num 0) at 10.10.100.50:7777
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db0 (num 0) at 10.10.100.50:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (13428,2):o2net_sendpage:875 ERROR: 
> sendpage of size 24 to node db1 (num 1) at 10.10.100.51:7777 failed 
> with -32
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 0 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -112
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 1 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -107
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_do_master_request:1418 
> ERROR: link to 2 went down!
> Oct 27 09:23:59 db3 kernel: (14918,3):dlm_get_lock_resource:995 ERROR: 
> status = -107
> Oct 27 09:28:27 db3 syslogd 1.4.1: restart.
> Oct 27 09:28:27 db3 syslog: syslogd startup succeeded
>
> From the logs, it looks like the node that reboots is not able to 
> communicate with the other nodes for more than 60 seconds and thus 
> reboots itself.
> Initially the Idle Timeout was set to 30 seconds, but since we are 
> using bonded interface, we increased it to 60 seconds, and still the 
> nodes are rebooting.
> We also upgraded the NIC drivers to the latest version to make sure 
> that the driver is not causing such issues.
> Now we are not sure what else could be causing the frequent reboots. 
> Please help us to debug the situation. Please let me know if more 
> information is needed.
>
>  
> Regards,
>
> Saranya Sivakumar
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081118/7898d02a/attachment.html