[Ocfs2-users] Unstable Cluster

Wed Dec 21 01:47:21 PST 2011

So just as an update on this issue, it turns out that I had the Ethernet interface used for iSCSI traffic as the same interface for handling the OCFS2 cluster and it just couldn't keep up.

Raising the timeouts and moving to separate GigE ports helped tremendously.

The next part that seems to have really made a huge difference was to enable jumbo frames all the way through the iSCSI network.  It would seem that the iSCSI overhead was heavy enough that each packet wasn't carrying much.  

Now after several weeks of hell (fingers crossed) we should be back in good shape.

Thanks for everyone's input here.  I definitely have a much better understanding of the how's and why's of OCFS2 now, than before this exercise.

Tony

On Dec 9, 2011, at 9:13 AM, Werner Flamme <werner.flamme at ufz.de> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Tony Rios [09.12.2011 09:38]:
>> To add to my previous message,
>> 
>> After some time waiting, I try to bring the mount point up again, I
>> get these messages, appearing like it is going to work.....
>> 
>> [  127.520176] (mount.ocfs2,1388,0):dlm_join_domain:1857 Timed out
>> joining dlm domain A3AA504BE42E4D3D8A15248D8FCD49BB after 94000
>> msecs [  127.543603] ocfs2: Unmounting device (8,32) on (node 0) [
>> 127.780023] o2net: no longer connected to node pedge38 (num 4) at
>> 10.88.0.38:7777 [  745.068033] o2dlm: Nodes in domain
>> A3AA504BE42E4D3D8A15248D8FCD49BB: 1 [  745.119791] ocfs2: Mounting
>> device (8,32) on (node 1, slot 1) with ordered data mode. [
>> 745.141503] (ocfs2rec,2060,0):ocfs2_replay_journal:1601 Recovering
>> node 3 from slot 0 on device (8,32) [  757.582921] o2net: accepted
>> connection from node pedge38 (num 4) at 10.88.0.38:7777 [
>> 758.424804] o2net: accepted connection from node pedge36 (num 3) at
>> 10.88.0.36:7777
>> 
>> 
>> Then I issue a df -k to see the magic, it locks up, I wait, and
>> eventually there is a kernel panic.
>> 
>> Of course none of this is actually sending over netconsole to the
>> remote logging server.
>> 
>> Tony
> 
> Hi Tony,
> 
> since it is a timeout: what values have you configured? We had
> timeouts when the load went up on the boxes, resulting in occasional
> reboots every now and then. Plus, iSCSI data must pass over the same
> net, what might lead to line congestion.
> 
> My values are:
> O2CB_HEARTBEAT_THRESHOLD=61
> O2CB_IDLE_TIMEOUT_MS=60000
> O2CB_KEEPALIVE_DELAY_MS=10000
> O2CB_RECONNECT_DELAY_MS=10000
> 
> And since we use these settings, there was no OCFS2 timeout. Yet...
> 
> Oh, and we use SLES 11 SP 1 inside ESXi VMs, with Raw Disk access to
> the RAID.
> 
> HTH
> Werner
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.18 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk7iQb4ACgkQk33Krq8b42N8QQCdGPlOqM28Wl8/fKP/yBDbRjRd
> 6A0AmgLdoyotAAvTc/N4szS0r0thlI1U
> =Piu2
> -----END PGP SIGNATURE-----
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users