Hmm, here is an example. Re: [Ocfs2-users] Also just a comment to theOracle guys

Alexei_Roudnev Alexei_Roudnev at exigengroup.com
Sun Feb 11 01:52:58 PST 2007


Absolutely. I know how redo and RAC interacts, you are absolutely correct.

Sometimes CSSD reboots one node and that's all - good luck. Sometimes OCFS reboots one node and CSSD reboots another
node - bad luck. That's why it is important do not mix different cluster managers on the same servers, or at least allow
them to interact and make similar  decision _who is master today_ (so who will survive split-brain situation).

RAC is little simple case because Oracle is usually primary service - so if it decide to reboot, it's reasonable
decision. OCFSv2 is another story - sometimes
it is _secondary_ service (for example, it is used for the backups only), and if it is secondary then it should better
stop working then reboot.

It reveals 2 big problems (both, Oracle and OCFSv2, are affected):
- single interface heartbeat is not reliable. You CAN NOT build reliable cluster using single heartbeat channel.
Classical clusters (Veritas VCS)
uses 2 - 4 different heartbeat media (we use 2 independent Ethernet HUBS and 2 generic Ethernet LAN-s in Veritas, use 2
Ethernets + 1 Serial in Linux clusters,
use 2 Ethernet + 1 Serial in Cisco PIX cluster, and so on). Both OCFS and Oracle RAC can not use more then one (to be
correct, you can configure few interfaces
for RAC interconnection in SPFile, but it wil not affect CSSD). In addition, OCFS defaults are very strange and
unrealistic - Ethernet and FC can not guarantee heartbeat times better than 1 minute, in average (I mean - it case of
any network reconfiguration heartbeat wil experience 30 - 50 seconds delay, so if you configure 12 seconds timeout
/default in OCFSv2/ you are at least naive.

- too easy _self fencing_. Just again, if OCFS node lost connection to the disk, it should not self fence - it can send
data to another nodes (or request them from
another nodes), it can unmount file system and try to remount it, it can release control and resume operations.
Immediate fencing is necessary in SOME cases
but not in all. If FS have not pending operations, then Fencing by reboot don't make much difference with just
_remount_. It's not so simple as I explain here,
but the truth is that fencing decisions are not flexible enough and decrease reliability dramatically (I posted a list
of scenarios when fencing should not happen).

IN addition, I noticed other problems with OCFSv2 too (such as excessive CPU usage in some cases).

I use OCFSv2, even in production. But I do it with a grain of salt, have a backup plan _how to run without it_, and
don't use it for heavily loaded
file systems with million files (I use heartbeat, reiserfs and APC switch fencing - and 3 independent heartbeats, with
40 seconds timeout). For now, I had one glitch on OCFSv2 (when it remounted read only on one node) and that's all - no
other problems  in production (OCFSv2 is used during start/stops only, so it is safe). But I run stress tests in the
lab, I am running it in the lab clusters now (including RAC), and conclusion is simple - as a cluster, it is not
reliable; as a file system, it may have hidden bugs so be extra careful with it.

PS. Good point - it improves every month. Some problems are in the past already.

PPS. All this lab reboots have been caused by extremely heavy load or by hardware failures (simulated or real). It works
better in real life. But my experience says me, that if I can break something in the lab in 3 days, it's a matter of few
month, when it broke in production.

  ----- Original Message ----- 
  From: Luis Freitas
  To: Ocfs2-users at oss.oracle.com
  Sent: Saturday, February 10, 2007 4:52 PM
  Subject: Re: Hmm,here is an example. Re: [Ocfs2-users] Also just a comment to theOracle guys


  Alexei,

      Actually your log seems to show that CSSD (Oracle CRS) rebooted the node before OCFS2 got a chance to do it.

      On a RAC cluster, if the interconnect is interrupted, all the nodes hang until a split brain resolution is
complete and the recovery of all the crashed nodes is completed. This is needed because every read on a Oracle datablock
needs a ping to the other nodes.

      The view of the data must be consistent, when one node read a particular data block, the Oracle Database first
ping the other nodes to ensure that they did not modify the block and still have not flushed it to disk. Another node
may even forward a reply with the block, preventing the disk access (Cache Fusion).

      When a split brain occurs, there is the loss of these blocks not flushed to disk, and they are rebuilt using the
redo threads of the particular nodes that crashed. During this interval all the database instances "freeze", since
before the node recovery is complete there is no way to guarantee that a block read from disk has not been altered on
the crashed node.

      So the fencing is needed even if there is no disk activity, as the entire cluster becomes "hang" the moment the
interconnect is down. And the timeout for the fencing must be as small as possible to prevent a long cluster
reconfiguration delay. Of course the timeout must be tuned so as to be larger than ethernet switch failovers, or storage
controller or disk multipath failovers. Or if possible the failover times should be reduced.

     Now, on the other hand, I am too having problems with OCFS2. It seems much less robust than ASM and the previous
version, OCFS, specially under heavy disk activity. But I do expect these problems to get solved in the near future, as
did the 2.4 kernel VM problems.

  Regards,
  Luis

  Alexei_Roudnev <Alexei_Roudnev at exigengroup.com> wrote:
    Additional info - node had not ANY active OCFSv2 operations (OCFSv2 used for backups only and from another node
only). So, if system just SUSPEND all FS operations and try to rejoin to the cluster, it all could work (moreover,
connection to the disk system was intact, so it could close file sytem gracefully).

    It reveals 3 problems at once:
    - single heartbeat link (instead of multiple links)
    - timeout too short (ethernet can't guarantee 10 seconds, it can guarantee 1 minute minimum);
    - fencing even if system is passive and can remount / reconnect instead of rebooting.

    All we did in the lab was _disconnect 1 of trunks between switches for a few seconds, then insert it back into the
socket_. No one other application failed
    (including heartbeat clusters). Database cluster was not doing anything on OCFS in time of failure (even backups).

    I will try heartbeat between loopback interfaces (and OCFS protocol) next time (I am just curios if it can provide
10 seconds for network reconfiguration).

    ...
    Feb  1 12:19:13 testrac12 kernel: o2net: connection to node testrac11 (num 0) at 10.254.32.111:7777 has been idle
for 10 seconds, shutting it down.
    Feb  1 12:19:13 testrac12 kernel: (13,3):o2net_idle_timer:1310 here are some times that might help debug the
situation: (tmr 1170361135.521061 now 1170361145.520476 dr 1170361141.852795 adv 1170361135.521063:1170361135.521064
func (c4378452:505) 1170361067.762941:1170361067.762967)
    Feb  1 12:19:13 testrac12 kernel: o2net: no longer connected to node testrac11 (num 0) at 10.254.32.111:7777
    Feb  1 12:19:13 testrac12 kernel: (1855,3):dlm_send_remote_convert_request:398 ERROR: status = -107
    Feb  1 12:19:13 testrac12 kernel: (1855,3):dlm_wait_for_node_death:371 5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting
5000ms for notification of death of node 0
    Feb  1 12:19:13 testrac12 kernel: (1855,1):dlm_send_remote_convert_request:398 ERROR: status = -107
    Feb  1 12:19:13 testrac12 kernel: (1855,1):dlm_wait_for_node_death:371 5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting
5000ms for notification of death of node 0
    Feb  1 12:22:22 testrac12 kernel: (1855,2):dlm_send_remote_convert_request:398 ERROR: status = -107
    Feb  1 12:22:22 testrac12 kernel: (1855,2):dlm_wait_for_node_death:371 5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting
5000ms for notification of death of node 0
    Feb  1 12:22:27 testrac12 kernel: (13,3):o2quo_make_decision:144 ERROR: fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest active node 0
    Feb  1 12:22:27 testrac12 kernel: (13,3):o2hb_stop_all_regions:1889 ERROR: stopping heartbeat on all active regions.
    Feb  1 12:22:27 testrac12 kernel: Kernel panic: ocfs2 is very sorry to be fencing this system by panicing
    Feb  1 12:22:27 testrac12 kernel:
    Feb  1 12:22:28 testrac12 su: pam_unix2: session finished for user oracle, service su
    Feb  1 12:22:29 testrac12 logger: Oracle CSSD failure.  Rebooting for cluster integrity.
    Feb  1 12:22:32 testrac12 su: pam_unix2: session finished for user oracle, service su
    ...
    _______________________________________________
    Ocfs2-users mailing list
    Ocfs2-users at oss.oracle.com
    http://oss.oracle.com/mailman/listinfo/ocfs2-users




------------------------------------------------------------------------------
  Expecting? Get great news right away with email Auto-Check.
  Try the Yahoo! Mail Beta.


------------------------------------------------------------------------------


  _______________________________________________
  Ocfs2-users mailing list
  Ocfs2-users at oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20070211/9ee43683/attachment-0001.html


More information about the Ocfs2-users mailing list