[Ocfs2-users] Self-fencing issues (RHEL4)

Tue Apr 18 19:38:59 CDT 2006

There should be more messages on the netdump server.

Yegor Gorshkov wrote:
> Hi.
>
> I'm running RHEL4 for my test system, Adaptec Firewire controllers,
> Maxtor One Touch III shared disk (see the details below),
> 100Mb/s dedicated interconnect. It panics with no load about each
> 20 minutes (error message from netconsole attached)
>
> Any clues?
>
> Yegor
>
> ---
> [root at rac1 ~]# cat /proc/fs/ocfs2/version
> OCFS2 1.2.0 Tue Mar  7 15:51:20 PST 2006 (build 
> db06cd9cd891710e73c5d89a6b4d8812)
> [root at rac1 ~]#
> ---
> [root at rac1 ~]# lspci
> 06:02.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61)
> [root at rac1 ~]#
> ---
> [root at rac1 ~]# cat /boot/grub/menu.lst
> default=0
> timeout=3
> title Red Hat Enterprise Linux AS (2.6.9-34.ELsmp)
>          root (hd0,0)
>          kernel /vmlinuz-2.6.9-34.ELsmp ro root=/dev/VolGroup00/LogVol00 
> elevator=deadline vga=0xf05
>          initrd /initrd-2.6.9-34.ELsmp.img
> title Red Hat Enterprise Linux AS-up (2.6.9-34.EL)
>          root (hd0,0)
>          kernel /vmlinuz-2.6.9-34.EL ro root=/dev/VolGroup00/LogVol00 
> elevator=deadline vga=0xf05
>          initrd /initrd-2.6.9-34.EL.img
> [root at rac1 ~]#
> ---
> [root at rac1 ~]# cat /etc/sysconfig/o2cb
> O2CB_ENABLED=true
> O2CB_BOOTCLUSTER=ocfs2
> O2CB_HEARTBEAT_THRESHOLD=16
> [root at rac1 ~]#
> ---
>
>
> Crash message:
> ---
> Apr 18 15:54:43 rac1/rac1 (2858,1):o2net_set_nn_state:426 accepted 
> connection from node rac2 (num 1) at 10.0.1.2:7777
> Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my 
> domain ("CA641AEC0417495BA7302FC14F6F99B7"):
> Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 0
> Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 1
> Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my 
> domain ("8BD4774D69C44FDC8FD8EC5E13EA9996"):
> Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 0
> Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 1
> Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_write_timeout:164 ERROR: Heartbeat 
> write timeout to device sda5 after 30000 milliseconds
> Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_stop_all_regions:1727 ERROR: 
> stopping heartbeat on all active regions.
> Apr 18 15:56:43 rac2/rac2 Kernel panic - not syncing: ocfs2 is very 
> sorry to be fencing this system by panicing
> Apr 18 15:56:43 rac2/rac2
> Apr 18 15:56:45 rac1/rac1 (2903,0):o2net_set_nn_state:411 no longer 
> connected to node rac2 (num 1) at 10.0.1.2:7777
> Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_send_proxy_ast_msg:448 ERROR: 
> status = -107
> Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_flush_asts:556 ERROR: status = -107
> Apr 18 15:56:46 rac1/rac1 (19545,0):ocfs2_replay_journal:1172 Recovering 
> node 1 from slot 1 on device (8,41)
> Apr 18 15:56:46 rac1/rac1 (19544,0):ocfs2_replay_journal:1172 Recovering 
> node 1 from slot 1 on device (8,37)
> Apr 18 15:56:51 rac2/rac2 <0>Rebooting in 60 
> seconds..<5>(3,0):o2net_idle_timer:1310 connection to node rac1 (num 0) 
> at 10.0.1.1:7777 has been idle for 10 seconds, shutting it down.
> Apr 18 15:56:51 rac2/rac2 (3,0):o2net_idle_timer:1321 here are some 
> times that might help debug the situation: (tmr 1145401001.986417 now 
> 1145401011.984614 dr 1145401005.947636 adv 
> 1145401001.986418:1145401001.986418 func (f7672ffa:505) 
> 1145400976.990658:1145400976.990672)
> ---
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>