[Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

Mon Nov 1 01:25:54 PDT 2010

2010/11/1 Tao Ma <tao.ma at oracle.com>:
> Hi Ronald,
>
> On 10/29/2010 06:03 PM, Ronald Moesbergen wrote:
>>
>> 2010/10/29 Ronald Moesbergen<intercommit at gmail.com>:
>>>
>>> 2010/10/29 Tao Ma<tao.ma at oracle.com>:
>>>>
>>>> Hi Ronald,
>>>
>>> Hi Tao,
>>>
>>> Thanks for looking into this.
>>>
>>>> On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I was testing kernel 2.6.36 (vanilla mainline) and encountered the
>>>>> following BUG():
>>>>>
>>>>> [157756.266000] o2net: no longer connected to node app01 (num 0) at
>>>>> 10.2.25.13:7777
>>>>> [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
>>>>> has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
>>>>> [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
>>>>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>>>>> least one node (0) to recover before lock mastery can begin
>>>>> [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
>>>>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>>>>> least one node (0) to recover before lock mastery can begin
>>>>> [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
>>>>> 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
>>>>> recover before lock mastery can begin
>>>>> [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
>>>>> 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
>>>>> master $RECOVERY lock now
>>>>> [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
>>>>> Node 1 is the Recovery Master for the Dead Node 0 for Domain
>>>>> 5FA56B1D0A9249099CE58C82CFEC873A
>>>>> [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
>>>>> Recovering node 0 from slot 0 on device (8,32)
>>>>> [157772.850182] ------------[ cut here ]------------
>>>>> [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!
>>>>
>>>> Strange. the bug line is
>>>> BUG_ON(osb->node_num == node_num);
>>>> and it detects the same node number in the cluster.
>>
>> I just tried to reproduce it and succeeded. Here's what I did:
>> - unmount the filesystem on node app02
>> - shutdown the o2cb services on app02
>> - Do a halt -f on app01, which still has the OCFS2 volume mounted.
>> - Start o2cb services on app02
>> - Mount the OCFS2 filesystem ->  BUG
>>
>> Works everytime. So one of the 2 variables checked in that BUG_ON
>> statement must no be set correctly somewhere.
>
> I have tried several times in my local test env, but with no luck by now.
> And it seems to me quite strange at least from the code.
>
> So could you please file a bug in oss.oracle.com/bugzilla so that it is
> easier to track and discuss? Great thanks.

Ok, it's filed as:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1296

I've tried to reproduce on another cluster and there I don't see the
bug either, so it must be something specific to this setup.

Regards,
Ronald.