[Ocfs2-users] Unstable Cluster

Fri Dec 9 14:47:35 PST 2011

Thanks for the info Werner.
I already have the heartbeat increased to 61, that helped tremendously with reboot issues in the beginning.
I'll go ahead and increase the others to see if that helps.
I can't imagine why it should be so troublesome to keep this cluster stable.

I am seeing output drops incrementing on every GigE interface on the switch that is a server is hanging off of though.
That is certainly interesting, but not sure really why it's happening.
I have an isolated gigE switch just for iSCSI and all the servers have 2 interfaces, 1 dedicated just to iSCSI.
All the servers and the RAID are on the same switch.  The servers are all connected via GigE, and the RAID 
is connected via 10GigE.  I'm seeing an usually high number of output drops on the switch though for all the servers.
Not sure what that is about because there are no errors whatsoever on the 10GigE switch and there is certainly not
more than a GigE worth of traffic coming from any particular server.  

On Dec 9, 2011, at 9:13 AM, Werner Flamme wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Tony Rios [09.12.2011 09:38]:
>> To add to my previous message,
>> 
>> After some time waiting, I try to bring the mount point up again, I
>> get these messages, appearing like it is going to work.....
>> 
>> [  127.520176] (mount.ocfs2,1388,0):dlm_join_domain:1857 Timed out
>> joining dlm domain A3AA504BE42E4D3D8A15248D8FCD49BB after 94000
>> msecs [  127.543603] ocfs2: Unmounting device (8,32) on (node 0) [
>> 127.780023] o2net: no longer connected to node pedge38 (num 4) at
>> 10.88.0.38:7777 [  745.068033] o2dlm: Nodes in domain
>> A3AA504BE42E4D3D8A15248D8FCD49BB: 1 [  745.119791] ocfs2: Mounting
>> device (8,32) on (node 1, slot 1) with ordered data mode. [
>> 745.141503] (ocfs2rec,2060,0):ocfs2_replay_journal:1601 Recovering
>> node 3 from slot 0 on device (8,32) [  757.582921] o2net: accepted
>> connection from node pedge38 (num 4) at 10.88.0.38:7777 [
>> 758.424804] o2net: accepted connection from node pedge36 (num 3) at
>> 10.88.0.36:7777
>> 
>> 
>> Then I issue a df -k to see the magic, it locks up, I wait, and
>> eventually there is a kernel panic.
>> 
>> Of course none of this is actually sending over netconsole to the
>> remote logging server.
>> 
>> Tony
> 
> Hi Tony,
> 
> since it is a timeout: what values have you configured? We had
> timeouts when the load went up on the boxes, resulting in occasional
> reboots every now and then. Plus, iSCSI data must pass over the same
> net, what might lead to line congestion.
> 
> My values are:
> O2CB_HEARTBEAT_THRESHOLD=61
> O2CB_IDLE_TIMEOUT_MS=60000
> O2CB_KEEPALIVE_DELAY_MS=10000
> O2CB_RECONNECT_DELAY_MS=10000
> 
> And since we use these settings, there was no OCFS2 timeout. Yet...
> 
> Oh, and we use SLES 11 SP 1 inside ESXi VMs, with Raw Disk access to
> the RAID.
> 
> HTH
> Werner
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.18 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk7iQb4ACgkQk33Krq8b42N8QQCdGPlOqM28Wl8/fKP/yBDbRjRd
> 6A0AmgLdoyotAAvTc/N4szS0r0thlI1U
> =Piu2
> -----END PGP SIGNATURE-----
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users