[Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

Fri Oct 29 04:49:41 PDT 2010

Ronald Moesbergen wrote:
> 2010/10/29 Ronald Moesbergen <intercommit at gmail.com>:
>   
>> 2010/10/29 Tao Ma <tao.ma at oracle.com>:
>>     
>>> Hi Ronald,
>>>       
>> Hi Tao,
>>
>> Thanks for looking into this.
>>
>>     
>>> On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:
>>>       
>>>> Hello,
>>>>
>>>> I was testing kernel 2.6.36 (vanilla mainline) and encountered the
>>>> following BUG():
>>>>
>>>> [157756.266000] o2net: no longer connected to node app01 (num 0) at
>>>> 10.2.25.13:7777
>>>> [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
>>>> has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
>>>> [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
>>>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>>>> least one node (0) to recover before lock mastery can begin
>>>> [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
>>>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>>>> least one node (0) to recover before lock mastery can begin
>>>> [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
>>>> 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
>>>> recover before lock mastery can begin
>>>> [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
>>>> 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
>>>> master $RECOVERY lock now
>>>> [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
>>>> Node 1 is the Recovery Master for the Dead Node 0 for Domain
>>>> 5FA56B1D0A9249099CE58C82CFEC873A
>>>> [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
>>>> Recovering node 0 from slot 0 on device (8,32)
>>>> [157772.850182] ------------[ cut here ]------------
>>>> [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!
>>>>         
>>> Strange. the bug line is
>>> BUG_ON(osb->node_num == node_num);
>>> and it detects the same node number in the cluster.
>>>       
>
> I just tried to reproduce it and succeeded. Here's what I did:
> - unmount the filesystem on node app02
> - shutdown the o2cb services on app02
> - Do a halt -f on app01, which still has the OCFS2 volume mounted.
> - Start o2cb services on app02
> - Mount the OCFS2 filesystem -> BUG
>   
Thanks for the test. I will look at it. Thanks.

Regards,
Tao