[Ocfs2-users] disk heartbeat timeout poll

Wed Oct 11 19:12:14 PDT 2006

> 1. What is the your disk heartbeat timeout? If you are unsure,
> "cat /etc/sysconfig/o2cb".
31

>
> 2. What is your shared disk setup like? Fiber Channel, iscsi, AoE, etc.
> Provide as much detail as you can.
iSCSI, on NetApp cluster, sopftware initiator. Tested on FibreChannel as
well.

System is SLES9 SP3

>
> 3. Are you using some sort of multipathing? If so, provide details.
Embedded iSCSI multi port support. Can test on FC and system multipath.

>
> 4. What is the cluster used for? Oracle database, mailserver, etc.
Oracle - archive logs and backups ONLY. Other cluster (testing) - aplication
binaries and configurations.

>
> 5. How many nodes in your cluster?
3 (2 RAC + 1 backup server)
2

>
> 6. Any other relevant information?
SAN convergence time is:
- On NetApp - 1 minute
- on Ethernet - 50 seconds
- on FibreChannel network - 1 minute (timeouts on HDS Solaris multipath, for
example)

Network switch reboot time - about 40 seconds.

Events:
- rebooting one server - no problems.
- power outage (10 seconds) on network switches, caused both interfaces gow
down - all servers in all clusters rebooted (by OCFSv2, 1 by Oracle CSS).
- problems noticed:
  * when I used cluster for document storage (I tested it), high CPU during
heavy io operations; I tested and the decided to use heartbeat cluster +
ReiserFS.
  * when my oracle server locked up memory (on spinlock) so that system
freeze for 30 sseconds, it resulted in damaged OCFS (1 time - fatal, and 1
time - repairable).
 * since we began to use OCFSv2 for low IO file systems only, no big problem
except fencing even if system have not pending IO on it.

wishes:
- clustered lvm2 (not evms - evms is too complicated and is really heavy
overhead for 90% tasks);
- online resize (at least if we have 1 node left in the system).
- multi interface heartbeat;
- self-fencing ONLY if system have pending IO (configurable);
- if OCFSv2 cluster see, that ALL servers aroiund can not run heartbeat
(disk IO delay), no need to self-fence any of them until at least one can
run heartbeat on disk again. For now, if al servers lost access to the disk,
they all (except 1) reboot; in reality, if they see each other, they dont
need to reboot because they can classify failure as GLOBAL.
- emergency local mount mode.

>
> Again, feel free to mail me directly.
>
> Thanks
> Sunil
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>