[Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

Fri Oct 29 02:40:10 PDT 2010

2010/10/29 Tao Ma <tao.ma at oracle.com>:
> Hi Ronald,

Hi Tao,

Thanks for looking into this.

> On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:
>>
>> Hello,
>>
>> I was testing kernel 2.6.36 (vanilla mainline) and encountered the
>> following BUG():
>>
>> [157756.266000] o2net: no longer connected to node app01 (num 0) at
>> 10.2.25.13:7777
>> [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
>> has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
>> [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>> least one node (0) to recover before lock mastery can begin
>> [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>> least one node (0) to recover before lock mastery can begin
>> [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
>> 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
>> recover before lock mastery can begin
>> [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
>> 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
>> master $RECOVERY lock now
>> [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
>> Node 1 is the Recovery Master for the Dead Node 0 for Domain
>> 5FA56B1D0A9249099CE58C82CFEC873A
>> [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
>> Recovering node 0 from slot 0 on device (8,32)
>> [157772.850182] ------------[ cut here ]------------
>> [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!
>
> Strange. the bug line is
> BUG_ON(osb->node_num == node_num);
> and it detects the same node number in the cluster.
>
> So could you please grab the mount info from the system log of the 2 nodes.
> The message looks like:
>
> Oct 27 16:24:21 ocfs2-test2 kernel: ocfs2: Mounting device (8,8) on (node 2,
> slot 0) with ordered data mode.
>
> It tell us which node and slot the volume used.

Sure, here you go:

For Node '0':
Oct 27 13:23:41 app01 kernel: [  198.205657] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.
Oct 27 13:43:09 app01 kernel: [   67.202551] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.
Oct 27 14:04:14 app01 kernel: [ 1330.011053] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.
Oct 27 14:08:04 app01 kernel: [ 1559.934303] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.
Oct 27 14:36:27 app01 kernel: [   66.360001] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.
Oct 29 10:48:41 app01 kernel: [  216.426770] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.
Oct 29 11:00:00 app01 kernel: [  894.551493] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.
Oct 29 11:22:34 app01 kernel: [   67.169230] ocfs2: Mounting device
(8,32) on (node 0, slot 0) with writeback data mode.

For Node '1':
Oct 27 13:54:20 app02 kernel: [   97.953174] ocfs2: Mounting device
(8,32) on (node 1, slot 1) with writeback data mode.
Oct 27 14:08:16 app02 kernel: [  933.019800] ocfs2: Mounting device
(8,32) on (node 1, slot 1) with writeback data mode.
Oct 27 14:31:06 app02 kernel: [   67.006843] ocfs2: Mounting device
(8,32) on (node 1, slot 1) with writeback data mode.

So no surprises there I guess.

Regards,
Ronald.