[Ocfs2-users] kernel panic - not syncing

Mon Jan 22 11:21:27 PST 2007

I understand that. But that's not what the user experienced in this case.
One node ran into the o2hb timeout (and panic) that caused the o2net
message on the other node.

These are two separate issues. FWIW, I am trying to get the o2net config
backported to the 1.2 tree.

Andy Phillips wrote:
> With respect sunil, 
>
>  the observed problems I see normally go like this;
>
> - o2net timeout - socket closes.
> Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
> Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
> times that might help debug the situation: (tmr 1154545576.798263 now
>
> - Upper layers realise they have no connection, and panic the box.
>   
> Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
> (num 0) at 172.16.6.10:7777
> Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
> fencing this node because it is connected to
> a half-quorum of 1 out of 2 nodes which doesn't include the lowest
> active node 0
>
> Irrespective of that. The o2net message observed comes about due to the
> value of O2NET_HEARTBEAT_TIMEOUT not the o2cb heartbeat.
> The code that is probably giving you that error message is;
>
> The function o2net_idle_timer, which is referenced in your error
> message, is in ocfs2-1.2.3/fs/ocfs2/cluster/tcp.c
>
> printk(KERN_INFO "o2net: connection to " SC_NODEF_FMT " has been idle
> for 10 "
>              "seconds, shutting it down.\n", SC_NODEF_ARGS(sc));
>         mlog(ML_NOTICE, "here are some times that might help debug the "
>              "situation: (tmr %ld.%ld now %ld.%ld dr %ld.%ld adv "
>              "%ld.%ld:%ld.%ld func (%08x:%u) %ld.%ld:%ld.%ld)\n",
>              sc->sc_tv_timer.tv_sec, sc->sc_tv_timer.tv_usec,
>              now.tv_sec, now.tv_usec,
>              sc->sc_tv_data_ready.tv_sec, sc->sc_tv_data_ready.tv_usec,
>              sc->sc_tv_advance_start.tv_sec,
> sc->sc_tv_advance_start.tv_usec,
>              sc->sc_tv_advance_stop.tv_sec,
> sc->sc_tv_advance_stop.tv_usec,
>              sc->sc_msg_key, sc->sc_msg_type,
>              sc->sc_tv_func_start.tv_sec, sc->sc_tv_func_start.tv_usec,
>              sc->sc_tv_func_stop.tv_sec, sc->sc_tv_func_stop.tv_usec);
>
> The original post only posted that error message, but the other error
> messages usually follow. If I'm wrong, please email me directly and help
> sort out my understanding. 
>
> Andy
>
> On Mon, 2007-01-22 at 10:38 -0800, Sunil Mushran wrote:
>   
>> o2net timeout cannot cause the o2hb panic. The two are totally
>> different. From the outputs, I would guess o2hb is timing out but
>> I cannot say for sure till I don't see the full logs.
>>
>> Andy Phillips wrote:
>>     
>>> Its worth pointing out that the o2net idle timer is triggering on the 
>>> network heartbeat, which is 10 seconds, in the current 1.2.x series.
>>>
>>>
>>> O2CB_HEARTBEAT_THRESHOLD has no effect on this, because its another part
>>> of the code which causes the problem.
>>>
>>> see ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h
>>> #define O2NET_IDLE_TIMEOUT_SECS         10
>>>
>>> Andy
>>>
>>>
>>> On Mon, 2007-01-22 at 09:29 -0800, Srinivas Eeda wrote:
>>>   
>>>       
>>>> problem appears to be that IO is taking more time than effective O2CB_HEARTBEAT_THRESHOLD. Your configured value "31" doesn't seem to be effective?
>>>>
>>>> Index 6: took 1995 ms to do msleepIndex 
>>>> Index 17: took 1996 ms to do msleep
>>>> Index 22: took 10001 ms to do waiting for read completion.
>>>>
>>>> Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify. 
>>>>
>>>> Thanks,
>>>> --Srini.
>>>>
>>>>
>>>>
>>>>
>>>> Consulente3 wrote:
>>>>     
>>>>         
>>>>> Hi all, 
>>>>>
>>>>> my test environment, is composed by 2 server with centos 4.4
>>>>> nodes is exporting with aoe6-43 + vblade-14
>>>>>
>>>>> kernel-2.6.9-42.0.3.EL
>>>>> ocfs2-tools-1.2.2-1
>>>>> ocfs2console-1.2.2-1
>>>>> ocfs2-2.6.9-42.0.3.EL-1.2.3-1
>>>>>
>>>>> /dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local)
>>>>> /dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local)
>>>>>
>>>>> Device                FS     Nodes
>>>>> /dev/etherd/e2.0      ocfs2  ocfs2, becks
>>>>> /dev/etherd/e3.0      ocfs2  ocfs2, becks
>>>>>
>>>>> Device                FS     UUID                                  Label
>>>>> /dev/etherd/e2.0      ocfs2  b24cc18d-af89-4980-a75e-a87530b1b878  test1
>>>>> /dev/etherd/e3.0      ocfs2  101a92fd-b83b-4294-8bfc-fbaa069c3239  nfs4
>>>>>
>>>>> O2CB_HEARTBEAT_THRESHOLD=31
>>>>>
>>>>> when i try to make stress test:
>>>>>
>>>>> Index 4: took 0 ms to do checking slots
>>>>> Index 5: took 2 ms to do waiting for write completion
>>>>> Index 6: took 1995 ms to do msleep
>>>>> Index 7: took 0 ms to do allocating bios for read
>>>>> Index 8: took 0 ms to do bio alloc read
>>>>> Index 9: took 0 ms to do bio add page read
>>>>> Index 10: took 0 ms to do submit_bio for read
>>>>> Index 11: took 2 ms to do waiting for read completion
>>>>> Index 12: took 0 ms to do bio alloc write
>>>>> Index 13: took 0 ms to do bio add page write
>>>>> Index 14: took 0 ms to do submit_bio for write
>>>>> Index 15: took 0 ms to do checking slots
>>>>> Index 16: took 1 ms to do waiting for write completion
>>>>> Index 17: took 1996 ms to do msleep
>>>>> Index 18: took 0 ms to do allocating bios for read
>>>>> Index 19: took 0 ms to do bio allo read
>>>>> Index 20: took 0 ms to do bio add page read
>>>>> Index 21: took 0 ms to do submit_bio for read
>>>>> Index 22: took 10001 ms to do waiting for read completion
>>>>> (3,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active
>>>>> regions.
>>>>> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
>>>>> system by panicing
>>>>>
>>>>>
>>>>> <6>o2net: connection to node ocfs2 (num 2) at 10.1.7.107:777 has been
>>>>> idle for 10 seconds, shutting it down
>>>>> (3,0): o2net_idle_timer:1309 here are some times that might help debug
>>>>> the situation:
>>>>> (tmr: 1169487957.71650 now 1169487967.69569 dr 1169487962.88883 adv
>>>>> 1169487957.71671:1159487957.71674
>>>>> func 83bce37b2:505) 1169487901.984644:1169487901.984676)
>>>>>
>>>>> the kernel panic occurs always on the same node, and the other node
>>>>> still responding
>>>>>
>>>>> thanks!
>>>>>                                                                  
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>   
>>>>>       
>>>>>           
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>     
>>>>