[Ocfs2-users] 2 OCFS2 clusters that affect each other

Thu Feb 15 12:04:25 PST 2007

Do you have the full oops trace?

Nathan Ehresman wrote:
> I have a strange OCFS2 problem that has been plaguing me.  I have 2 
> separate OCFS2 clusters, each consisting of 3 machines.  One is an 
> Oracle RAC, the other is used as a shared DocumentRoot for a web 
> cluster.  All 6 machines are in an IBM Bladecenter and thus are nearly 
> identical hardware and use the same ethernet switch and FC switch.  
> All 6 machines connect to the same SAN but mount completely different 
> partitions (LVMed).  The 3 RAC nodes are running RHEL 
> 2.6.9-34.0.2.ELsmp and the 3 web heads are running kernel 
> 2.6.9-42.0.3.  All 6 machines are running OCFS2 1.2.4.  Also, all 6 
> nodes that their O2CB_HEARTBEAT_THRESHOLD set at 31 as it appears the 
> timeout on my HBAs is set at 60 seconds.
>
> Every once in a while if two of the web heads are powered on at the 
> same time and begin to mount the shared OCFS2 partition, one of my 
> Oracle nodes will complain that OCFS2 is self fencing itself and then 
> reboot itself (thanks to the hangcheck timer).  It is always the 2nd 
> node in the RAC cluster that does this while nodes 1 and 3 stay up 
> just fine.  I have the following stack trace taken from a netdump of 
> the kernel on RAC node 2 when it goes down, but I am not familiar 
> enough with OCFS2 internals to read it.  Can anybody read this and 
> give me any insight into what might be causing this problem?
>
>
>  [<c0129a20>] check_timer_failed+0x3c/0x58
>  [<c0129c7d>] del_timer+0x12/0x65
>  [<f88f326b>] qla2x00_done+0x2c6/0x37a [qla2xxx]
>  [<f88fe7f6>] qla2300_intr_handler+0x25a/0x267 [qla2xxx]
>  [<c0107472>] handle_IRQ_event+0x25/0x4f
>  [<c01079d2>] do_IRQ+0x11c/0x1ae
>  =======================
>  [<c02d304c>] common_interrupt+0x18/0x20
>  [<f8c9007b>] ocfs2_do_truncate+0x37a/0xb84 [ocfs2]
>  [<c02d122b>] _spin_lock+0x27/0x34
>  [<f8c9700c>] ocfs2_cluster_lock+0xf2/0x894 [ocfs2]
>  [<f8c96ea1>] ocfs2_status_completion_cb+0x0/0xa [ocfs2]
>  [<f8c99444>] ocfs2_meta_lock_full+0x1e7/0x57e [ocfs2]
>  [<c016e4c0>] dput+0x34/0x1a7
>  [<c01668c8>] link_path_walk+0x94/0xbe
>  [<c01672e3>] open_namei+0x99/0x579
>  [<f8ca7625>] ocfs2_inode_revalidate+0x11a/0x1f9 [ocfs2]
>  [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2]
>  [<f8ca386b>] ocfs2_getattr+0x63/0x14d [ocfs2]
>  [<f8ca3808>] ocfs2_getattr+0x0/0x14d [ocfs2]
>  [<c0161fa2>] vfs_getattr+0x35/0x88
>  [<c016201d>] vfs_stat+0x28/0x3a
>  [<c01672e3>] open_namei+0x99/0x579
>  [<c015990b>] filp_open+0x66/0x70
>  [<c0162612>] sys_stat64+0xf/0x23
>  [<c02d0ca2>] __cond_resched+0x14/0x39
>  [<c01c23c2>] direct_strncpy_from_user+0x3e/0x5d
>  [<c0159c7f>] sys_open+0x6a/0x7d
>  [<c02d268f>] syscall_call+0x7/0xb
>
>
> Thanks,
>
> Nathan