[Ocfs2-users] kernel panic - not syncing

Andy Phillips andrew.phillips at betfair.com
Mon Jan 22 10:48:19 PST 2007


With respect sunil, 

 the observed problems I see normally go like this;

- o2net timeout - socket closes.
Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1154545576.798263 now

- Upper layers realise they have no connection, and panic the box.
  
Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
(num 0) at 172.16.6.10:7777
Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest
active node 0

Irrespective of that. The o2net message observed comes about due to the
value of O2NET_HEARTBEAT_TIMEOUT not the o2cb heartbeat.
The code that is probably giving you that error message is;

The function o2net_idle_timer, which is referenced in your error
message, is in ocfs2-1.2.3/fs/ocfs2/cluster/tcp.c

printk(KERN_INFO "o2net: connection to " SC_NODEF_FMT " has been idle
for 10 "
             "seconds, shutting it down.\n", SC_NODEF_ARGS(sc));
        mlog(ML_NOTICE, "here are some times that might help debug the "
             "situation: (tmr %ld.%ld now %ld.%ld dr %ld.%ld adv "
             "%ld.%ld:%ld.%ld func (%08x:%u) %ld.%ld:%ld.%ld)\n",
             sc->sc_tv_timer.tv_sec, sc->sc_tv_timer.tv_usec,
             now.tv_sec, now.tv_usec,
             sc->sc_tv_data_ready.tv_sec, sc->sc_tv_data_ready.tv_usec,
             sc->sc_tv_advance_start.tv_sec,
sc->sc_tv_advance_start.tv_usec,
             sc->sc_tv_advance_stop.tv_sec,
sc->sc_tv_advance_stop.tv_usec,
             sc->sc_msg_key, sc->sc_msg_type,
             sc->sc_tv_func_start.tv_sec, sc->sc_tv_func_start.tv_usec,
             sc->sc_tv_func_stop.tv_sec, sc->sc_tv_func_stop.tv_usec);

The original post only posted that error message, but the other error
messages usually follow. If I'm wrong, please email me directly and help
sort out my understanding. 

Andy

On Mon, 2007-01-22 at 10:38 -0800, Sunil Mushran wrote:
> o2net timeout cannot cause the o2hb panic. The two are totally
> different. From the outputs, I would guess o2hb is timing out but
> I cannot say for sure till I don't see the full logs.
> 
> Andy Phillips wrote:
> > Its worth pointing out that the o2net idle timer is triggering on the 
> > network heartbeat, which is 10 seconds, in the current 1.2.x series.
> >
> >
> > O2CB_HEARTBEAT_THRESHOLD has no effect on this, because its another part
> > of the code which causes the problem.
> >
> > see ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h
> > #define O2NET_IDLE_TIMEOUT_SECS         10
> >
> > Andy
> >
> >
> > On Mon, 2007-01-22 at 09:29 -0800, Srinivas Eeda wrote:
> >   
> >> problem appears to be that IO is taking more time than effective O2CB_HEARTBEAT_THRESHOLD. Your configured value "31" doesn't seem to be effective?
> >>
> >> Index 6: took 1995 ms to do msleepIndex 
> >> Index 17: took 1996 ms to do msleep
> >> Index 22: took 10001 ms to do waiting for read completion.
> >>
> >> Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify. 
> >>
> >> Thanks,
> >> --Srini.
> >>
> >>
> >>
> >>
> >> Consulente3 wrote:
> >>     
> >>> Hi all, 
> >>>
> >>> my test environment, is composed by 2 server with centos 4.4
> >>> nodes is exporting with aoe6-43 + vblade-14
> >>>
> >>> kernel-2.6.9-42.0.3.EL
> >>> ocfs2-tools-1.2.2-1
> >>> ocfs2console-1.2.2-1
> >>> ocfs2-2.6.9-42.0.3.EL-1.2.3-1
> >>>
> >>> /dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local)
> >>> /dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local)
> >>>
> >>> Device                FS     Nodes
> >>> /dev/etherd/e2.0      ocfs2  ocfs2, becks
> >>> /dev/etherd/e3.0      ocfs2  ocfs2, becks
> >>>
> >>> Device                FS     UUID                                  Label
> >>> /dev/etherd/e2.0      ocfs2  b24cc18d-af89-4980-a75e-a87530b1b878  test1
> >>> /dev/etherd/e3.0      ocfs2  101a92fd-b83b-4294-8bfc-fbaa069c3239  nfs4
> >>>
> >>> O2CB_HEARTBEAT_THRESHOLD=31
> >>>
> >>> when i try to make stress test:
> >>>
> >>> Index 4: took 0 ms to do checking slots
> >>> Index 5: took 2 ms to do waiting for write completion
> >>> Index 6: took 1995 ms to do msleep
> >>> Index 7: took 0 ms to do allocating bios for read
> >>> Index 8: took 0 ms to do bio alloc read
> >>> Index 9: took 0 ms to do bio add page read
> >>> Index 10: took 0 ms to do submit_bio for read
> >>> Index 11: took 2 ms to do waiting for read completion
> >>> Index 12: took 0 ms to do bio alloc write
> >>> Index 13: took 0 ms to do bio add page write
> >>> Index 14: took 0 ms to do submit_bio for write
> >>> Index 15: took 0 ms to do checking slots
> >>> Index 16: took 1 ms to do waiting for write completion
> >>> Index 17: took 1996 ms to do msleep
> >>> Index 18: took 0 ms to do allocating bios for read
> >>> Index 19: took 0 ms to do bio allo read
> >>> Index 20: took 0 ms to do bio add page read
> >>> Index 21: took 0 ms to do submit_bio for read
> >>> Index 22: took 10001 ms to do waiting for read completion
> >>> (3,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active
> >>> regions.
> >>> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> >>> system by panicing
> >>>
> >>>
> >>> <6>o2net: connection to node ocfs2 (num 2) at 10.1.7.107:777 has been
> >>> idle for 10 seconds, shutting it down
> >>> (3,0): o2net_idle_timer:1309 here are some times that might help debug
> >>> the situation:
> >>> (tmr: 1169487957.71650 now 1169487967.69569 dr 1169487962.88883 adv
> >>> 1169487957.71671:1159487957.71674
> >>> func 83bce37b2:505) 1169487901.984644:1169487901.984676)
> >>>
> >>> the kernel panic occurs always on the same node, and the other node
> >>> still responding
> >>>
> >>> thanks!
> >>>                                                                  
> >>>
> >>> _______________________________________________
> >>> Ocfs2-users mailing list
> >>> Ocfs2-users at oss.oracle.com
> >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>   
> >>>       
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>     
-- 
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP 
Company No. 5140986 
The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.






More information about the Ocfs2-users mailing list