[Ocfs2-users] Self-fencing issues (RHEL4)

Yegor Gorshkov oracle-dba at tonesoft.com
Tue Apr 18 18:38:58 CDT 2006


Hi.

I'm running RHEL4 for my test system, Adaptec Firewire controllers,
Maxtor One Touch III shared disk (see the details below),
100Mb/s dedicated interconnect. It panics with no load about each
20 minutes (error message from netconsole attached)

Any clues?

Yegor

---
[root at rac1 ~]# cat /proc/fs/ocfs2/version
OCFS2 1.2.0 Tue Mar  7 15:51:20 PST 2006 (build 
db06cd9cd891710e73c5d89a6b4d8812)
[root at rac1 ~]#
---
[root at rac1 ~]# lspci
06:02.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61)
[root at rac1 ~]#
---
[root at rac1 ~]# cat /boot/grub/menu.lst
default=0
timeout=3
title Red Hat Enterprise Linux AS (2.6.9-34.ELsmp)
         root (hd0,0)
         kernel /vmlinuz-2.6.9-34.ELsmp ro root=/dev/VolGroup00/LogVol00 
elevator=deadline vga=0xf05
         initrd /initrd-2.6.9-34.ELsmp.img
title Red Hat Enterprise Linux AS-up (2.6.9-34.EL)
         root (hd0,0)
         kernel /vmlinuz-2.6.9-34.EL ro root=/dev/VolGroup00/LogVol00 
elevator=deadline vga=0xf05
         initrd /initrd-2.6.9-34.EL.img
[root at rac1 ~]#
---
[root at rac1 ~]# cat /etc/sysconfig/o2cb
O2CB_ENABLED=true
O2CB_BOOTCLUSTER=ocfs2
O2CB_HEARTBEAT_THRESHOLD=16
[root at rac1 ~]#
---


Crash message:
---
Apr 18 15:54:43 rac1/rac1 (2858,1):o2net_set_nn_state:426 accepted 
connection from node rac2 (num 1) at 10.0.1.2:7777
Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my 
domain ("CA641AEC0417495BA7302FC14F6F99B7"):
Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 0
Apr 18 15:54:47 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 1
Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:384 Nodes in my 
domain ("8BD4774D69C44FDC8FD8EC5E13EA9996"):
Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 0
Apr 18 15:54:51 rac1/rac1 (2858,1):__dlm_print_nodes:388  node 1
Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_write_timeout:164 ERROR: Heartbeat 
write timeout to device sda5 after 30000 milliseconds
Apr 18 15:56:43 rac2/rac2 (3,0):o2hb_stop_all_regions:1727 ERROR: 
stopping heartbeat on all active regions.
Apr 18 15:56:43 rac2/rac2 Kernel panic - not syncing: ocfs2 is very 
sorry to be fencing this system by panicing
Apr 18 15:56:43 rac2/rac2
Apr 18 15:56:45 rac1/rac1 (2903,0):o2net_set_nn_state:411 no longer 
connected to node rac2 (num 1) at 10.0.1.2:7777
Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_send_proxy_ast_msg:448 ERROR: 
status = -107
Apr 18 15:56:45 rac1/rac1 (2897,1):dlm_flush_asts:556 ERROR: status = -107
Apr 18 15:56:46 rac1/rac1 (19545,0):ocfs2_replay_journal:1172 Recovering 
node 1 from slot 1 on device (8,41)
Apr 18 15:56:46 rac1/rac1 (19544,0):ocfs2_replay_journal:1172 Recovering 
node 1 from slot 1 on device (8,37)
Apr 18 15:56:51 rac2/rac2 <0>Rebooting in 60 
seconds..<5>(3,0):o2net_idle_timer:1310 connection to node rac1 (num 0) 
at 10.0.1.1:7777 has been idle for 10 seconds, shutting it down.
Apr 18 15:56:51 rac2/rac2 (3,0):o2net_idle_timer:1321 here are some 
times that might help debug the situation: (tmr 1145401001.986417 now 
1145401011.984614 dr 1145401005.947636 adv 
1145401001.986418:1145401001.986418 func (f7672ffa:505) 
1145400976.990658:1145400976.990672)
---



More information about the Ocfs2-users mailing list