Hmm, here is an example. Re: [Ocfs2-users] Also just a commentto theOracle guys

Mon Feb 12 09:58:01 PST 2007

There is one more problem with OCFSv2.

If disk system experience temporary (!) outage (such as SAN system restart),
then all nodes expereince IO delay. It don't bring any troubles for most
standalone servers (because timeouts are usually longer), but for OCFSv2, it
usually cause fencing of all nodes around after they detects hedisk
heartbeat timeout. System (OCFSv2) should recognize such specific (and very
common) case (all nodes can't run disk heartbeat) and just suspend operation
and wait, until at least one node can proceeed, before fencing anything.

ANd yes, option specifying _what to do in case of failure_ can be very
usefull - it really opens a road to use OCFSv2 in RAC (now it is dangerous
because of possibility, that OCFS panic node, even when you have not
critical information on this OCFS).

Btw, I never tested, what happen if RAC cluster experience problem with a
single tablespace on the single node - does it shutdown node, or switch
tablespace offline on all nodes, or crash?

----- Original Message ----- 
From: "Andy Phillips" <andrew.phillips at betfair.com>
To: "Luis Freitas" <lfreitas34 at yahoo.com>
Cc: <Ocfs2-users at oss.oracle.com>
Sent: Monday, February 12, 2007 4:15 AM
Subject: Re: Hmm, here is an example. Re: [Ocfs2-users] Also just a
commentto theOracle guys

> As a thought, borrowing from solaris,
>
> How about a mount option?
>
> The following text is from the mount_ufs manpage. That probably
> belongs to someone, but I took this from a random web page.
>
> onerror = action This option specifies the action  that  UFS
> should  take  to  recover  from an internal inconsistency on  a  file
> system.  Specify action  as  panic,  lock,  or umount. These
> values cause a forced  system  shutdown,  a file  system lock to be
> applied to the file system, or the file system to  be  forcibly
> unmounted,  respectively.  The  default  is panic.
>
> That preserves existing behaviour, but allows people running rac and
> ocfs2 to avoid multiple hits if rac and ocfs2 reboot different nodes.
>
> Andy
>
> On Sun, 2007-02-11 at 08:44 -0800, Luis Freitas wrote:
> > Alexei,
> >
> >      I think you got a point too, maybe OCFS2 could behave like
> > Netapp, and simply hang when there is a problem and leave the fencing
> > for CRS or whathever other clusterware is in use.
> >
> >      Anyone from Oracle got a opinion on this?
> >
> > Regards,
> > Luis
> >
> > Alexei_Roudnev <Alexei_Roudnev at exigengroup.com> wrote:
> >         Absolutely. I know how redo and RAC interacts, you are
> >         absolutely correct.
> >
> >         Sometimes CSSD reboots one node and that's all - good luck.
> >         Sometimes OCFS reboots one node and CSSD reboots another node
> >         - bad luck. That's why it is important do not mix different
> >         cluster managers on the same servers, or at least allow them
> >         to interact and make similar  decision _who is master today_
> >         (so who will survive split-brain situation).
> >
> >         RAC is little simple case because Oracle is usually primary
> >         service - so if it decide to reboot, it's reasonable decision.
> >         OCFSv2 is another story - sometimes
> >         it is _secondary_ service (for example, it is used for the
> >         backups only), and if it is secondary then it should better
> >         stop working then reboot.
> >
> >         It reveals 2 big problems (both, Oracle and OCFSv2, are
> >         affected):
> >         - single interface heartbeat is not reliable. You CAN NOT
> >         build reliable cluster using single heartbeat channel.
> >         Classical clusters (Veritas VCS)
> >         uses 2 - 4 different heartbeat media (we use 2 independent
> >         Ethernet HUBS and 2 generic Ethernet LAN-s in Veritas, use 2
> >         Ethernets + 1 Serial in Linux clusters,
> >         use 2 Ethernet + 1 Serial in Cisco PIX cluster, and so on).
> >         Both OCFS and Oracle RAC can not use more then one (to be
> >         correct, you can configure few interfaces
> >         for RAC interconnection in SPFile, but it wil not affect
> >         CSSD). In addition, OCFS defaults are very strange and
> >         unrealistic - Ethernet and FC can not guarantee heartbeat
> >         times better than 1 minute, in average (I mean - it case of
> >         any network reconfiguration heartbeat wil experience 30 - 50
> >         seconds delay, so if you configure 12 seconds timeout /default
> >         in OCFSv2/ you are at least naive.
> >
> >         - too easy _self fencing_. Just again, if OCFS node lost
> >         connection to the disk, it should not self fence - it can send
> >         data to another nodes (or request them from
> >         another nodes), it can unmount file system and try to remount
> >         it, it can release control and resume operations. Immediate
> >         fencing is necessary in SOME cases
> >         but not in all. If FS have not pending operations, then
> >         Fencing by reboot don't make much difference with just
> >         _remount_. It's not so simple as I explain here,
> >         but the truth is that fencing decisions are not flexible
> >         enough and decrease reliability dramatically (I posted a list
> >         of scenarios when fencing should not happen).
> >
> >         IN addition, I noticed other problems with OCFSv2 too (such as
> >         excessive CPU usage in some cases).
> >
> >         I use OCFSv2, even in production. But I do it with a grain of
> >         salt, have a backup plan _how to run without it_, and don't
> >         use it for heavily loaded
> >         file systems with million files (I use heartbeat, reiserfs and
> >         APC switch fencing - and 3 independent heartbeats, with 40
> >         seconds timeout). For now, I had one glitch on OCFSv2 (when it
> >         remounted read only on one node) and that's all - no other
> >         problems  in production (OCFSv2 is used during start/stops
> >         only, so it is safe). But I run stress tests in the lab, I am
> >         running it in the lab clusters now (including RAC), and
> >         conclusion is simple - as a cluster, it is not reliable; as a
> >         file system, it may have hidden bugs so be extra careful with
> >         it.
> >
> >         PS. Good point - it improves every month. Some problems are in
> >         the past already.
> >
> >         PPS. All this lab reboots have been caused by extremely heavy
> >         load or by hardware failures (simulated or real). It works
> >         better in real life. But my experience says me, that if I can
> >         break something in the lab in 3 days, it's a matter of few
> >         month, when it broke in production.
> >
> >                 ----- Original Message ----- 
> >                 From: Luis Freitas
> >                 To: Ocfs2-users at oss.oracle.com
> >                 Sent: Saturday, February 10, 2007 4:52 PM
> >                 Subject: Re: Hmm,here is an example. Re: [Ocfs2-users]
> >                 Also just a comment to theOracle guys
> >
> >
> >                 Alexei,
> >
> >                     Actually your log seems to show that CSSD (Oracle
> >                 CRS) rebooted the node before OCFS2 got a chance to do
> >                 it.
> >
> >                     On a RAC cluster, if the interconnect is
> >                 interrupted, all the nodes hang until a split brain
> >                 resolution is complete and the recovery of all the
> >                 crashed nodes is completed. This is needed because
> >                 every read on a Oracle datablock needs a ping to the
> >                 other nodes.
> >
> >                     The view of the data must be consistent, when one
> >                 node read a particular data block, the Oracle
> >                 Database first ping the other nodes to ensure that
> >                 they did not modify the block and still have not
> >                 flushed it to disk. Another node may even forward a
> >                 reply with the block, preventing the disk access
> >                 (Cache Fusion).
> >
> >                     When a split brain occurs, there is the loss of
> >                 these blocks not flushed to disk, and they are rebuilt
> >                 using the redo threads of the particular nodes that
> >                 crashed. During this interval all the database
> >                 instances "freeze", since before the node recovery is
> >                 complete there is no way to guarantee that a
> >                 block read from disk has not been altered on the
> >                 crashed node.
> >
> >                     So the fencing is needed even if there is no disk
> >                 activity, as the entire cluster becomes "hang" the
> >                 moment the interconnect is down. And the timeout for
> >                 the fencing must be as small as possible to prevent a
> >                 long cluster reconfiguration delay. Of course the
> >                 timeout must be tuned so as to be larger than ethernet
> >                 switch failovers, or storage controller or disk
> >                 multipath failovers. Or if possible the failover times
> >                 should be reduced.
> >
> >                    Now, on the other hand, I am too having problems
> >                 with OCFS2. It seems much less robust than ASM and the
> >                 previous version, OCFS, specially under heavy disk
> >                 activity. But I do expect these problems to get solved
> >                 in the near future, as did the 2.4 kernel VM problems.
> >
> >                 Regards,
> >                 Luis
> >
> >                 Alexei_Roudnev <Alexei_Roudnev at exigengroup.com> wrote:
> >                         Additional info - node had not ANY active
> >                         OCFSv2 operations (OCFSv2 used for backups
> >                         only and from another node only). So, if
> >                         system just SUSPEND all FS operations and try
> >                         to rejoin to the cluster, it all could work
> >                         (moreover, connection to the disk system was
> >                         intact, so it could close file sytem
> >                         gracefully).
> >
> >                         It reveals 3 problems at once:
> >                         - single heartbeat link (instead of multiple
> >                         links)
> >                         - timeout too short (ethernet can't guarantee
> >                         10 seconds, it can guarantee 1 minute
> >                         minimum);
> >                         - fencing even if system is passive and can
> >                         remount / reconnect instead of rebooting.
> >
> >                         All we did in the lab was _disconnect 1 of
> >                         trunks between switches for a few seconds,
> >                         then insert it back into the socket_. No one
> >                         other application failed
> >                         (including heartbeat clusters). Database
> >                         cluster was not doing anything on OCFS in time
> >                         of failure (even backups).
> >
> >                         I will try heartbeat between loopback
> >                         interfaces (and OCFS protocol) next time (I am
> >                         just curios if it can provide 10 seconds for
> >                         network reconfiguration).
> >
> >                         ...
> >                         Feb  1 12:19:13 testrac12 kernel: o2net:
> >                         connection to node testrac11 (num 0) at
> >                         10.254.32.111:7777 has been idle for 10
> >                         seconds, shutting it down.
> >                         Feb  1 12:19:13 testrac12 kernel:
> >                         (13,3):o2net_idle_timer:1310 here are some
> >                         times that might help debug the situation:
> >                         (tmr 1170361135.521061 now 1170361145.520476
> >                         dr 1170361141.852795 adv
> >                         1170361135.521063:1170361135.521064 func
> >                         (c4378452:505)
> >                         1170361067.762941:1170361067.762967)
> >                         Feb  1 12:19:13 testrac12 kernel: o2net: no
> >                         longer connected to node testrac11 (num 0) at
> >                         10.254.32.111:7777
> >                         Feb  1 12:19:13 testrac12 kernel:
> >                         (1855,3):dlm_send_remote_convert_request:398
> >                         ERROR: status = -107
> >                         Feb  1 12:19:13 testrac12 kernel:
> >                         (1855,3):dlm_wait_for_node_death:371
> >                         5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting
> >                         5000ms for notification of death of node 0
> >                         Feb  1 12:19:13 testrac12 kernel:
> >                         (1855,1):dlm_send_remote_convert_request:398
> >                         ERROR: status = -107
> >                         Feb  1 12:19:13 testrac12 kernel:
> >                         (1855,1):dlm_wait_for_node_death:371
> >                         5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting
> >                         5000ms for notification of death of node 0
> >                         Feb  1 12:22:22 testrac12 kernel:
> >                         (1855,2):dlm_send_remote_convert_request:398
> >                         ERROR: status = -107
> >                         Feb  1 12:22:22 testrac12 kernel:
> >                         (1855,2):dlm_wait_for_node_death:371
> >                         5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting
> >                         5000ms for notification of death of node 0
> >                         Feb  1 12:22:27 testrac12 kernel:
> >                         (13,3):o2quo_make_decision:144 ERROR: fencing
> >                         this node because it is connected to a
> >                         half-quorum of 1 out of 2 nodes which doesn't
> >                         include the lowest active node 0
> >                         Feb  1 12:22:27 testrac12 kernel:
> >                         (13,3):o2hb_stop_all_regions:1889 ERROR:
> >                         stopping heartbeat on all active regions.
> >                         Feb  1 12:22:27 testrac12 kernel: Kernel
> >                         panic: ocfs2 is very sorry to be fencing this
> >                         system by panicing
> >                         Feb  1 12:22:27 testrac12 kernel:
> >                         Feb  1 12:22:28 testrac12 su: pam_unix2:
> >                         session finished for user oracle, service su
> >                         Feb  1 12:22:29 testrac12 logger: Oracle CSSD
> >                         failure.  Rebooting for cluster integrity.
> >                         Feb  1 12:22:32 testrac12 su: pam_unix2:
> >                         session finished for user oracle, service su
> >                         ...
> >                         _______________________________________________
> >                         Ocfs2-users mailing list
> >                         Ocfs2-users at oss.oracle.com
> >
http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
> >
> >
> >                 ______________________________________________________
> >                 Expecting? Get great news right away with email
> >                 Auto-Check.
> >                 Try the Yahoo! Mail Beta.
> >
> >
> >                 ______________________________________________________
> >
> >                 _______________________________________________
> >                 Ocfs2-users mailing list
> >                 Ocfs2-users at oss.oracle.com
> >                 http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
> >
> >
> >
> > ______________________________________________________________________
> > We won't tell. Get more on shows you hate to love
> > (and love to hate): Yahoo! TV's Guilty Pleasures list.
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> -- 
> Andy Phillips
> Systems Architecture Manager, Betfair.com
>
> Office: 0208 8348436
>
> Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP
> Company No. 5140986
> The information in this e-mail and any attachment is confidential and is
> intended only for the named recipient(s). The e-mail may not be
> disclosed or used by any person other than the addressee, nor may it be
> copied in any way. If you are not a named recipient please notify the
> sender immediately and delete any copies of this message. Any
> unauthorized copying, disclosure or distribution of the material in this
> e-mail is strictly forbidden. Any view or opinions presented are solely
> those of the author and do not necessarily represent those of the
> company.
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>