[Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

Sun Oct 31 23:04:06 PDT 2010

Hi Ronald,

On 10/29/2010 06:03 PM, Ronald Moesbergen wrote:
> 2010/10/29 Ronald Moesbergen<intercommit at gmail.com>:
>> 2010/10/29 Tao Ma<tao.ma at oracle.com>:
>>> Hi Ronald,
>>
>> Hi Tao,
>>
>> Thanks for looking into this.
>>
>>> On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:
>>>>
>>>> Hello,
>>>>
>>>> I was testing kernel 2.6.36 (vanilla mainline) and encountered the
>>>> following BUG():
>>>>
>>>> [157756.266000] o2net: no longer connected to node app01 (num 0) at
>>>> 10.2.25.13:7777
>>>> [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
>>>> has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
>>>> [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
>>>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>>>> least one node (0) to recover before lock mastery can begin
>>>> [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
>>>> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
>>>> least one node (0) to recover before lock mastery can begin
>>>> [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
>>>> 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
>>>> recover before lock mastery can begin
>>>> [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
>>>> 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
>>>> master $RECOVERY lock now
>>>> [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
>>>> Node 1 is the Recovery Master for the Dead Node 0 for Domain
>>>> 5FA56B1D0A9249099CE58C82CFEC873A
>>>> [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
>>>> Recovering node 0 from slot 0 on device (8,32)
>>>> [157772.850182] ------------[ cut here ]------------
>>>> [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!
>>>
>>> Strange. the bug line is
>>> BUG_ON(osb->node_num == node_num);
>>> and it detects the same node number in the cluster.
>
> I just tried to reproduce it and succeeded. Here's what I did:
> - unmount the filesystem on node app02
> - shutdown the o2cb services on app02
> - Do a halt -f on app01, which still has the OCFS2 volume mounted.
> - Start o2cb services on app02
> - Mount the OCFS2 filesystem ->  BUG
>
> Works everytime. So one of the 2 variables checked in that BUG_ON
> statement must no be set correctly somewhere.
I have tried several times in my local test env, but with no luck by 
now. And it seems to me quite strange at least from the code.

So could you please file a bug in oss.oracle.com/bugzilla so that it is 
easier to track and discuss? Great thanks.

Regards,
Tao