[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Sunil Mushran sunil.mushran at oracle.com
Mon Nov 24 16:17:22 PST 2008


This is strange. There is something weird going on in your setup.
When heartbeat starts, it generates a random generation number
that remains constant as long as the heartbeat region is alive.

The log file shows that the hb gen is not constant for node 0.
Meaning, some other node is running the heartbeat thread claiming
itself to be node 0. Node 1's hb gen remains the same.

What is strange is that there are 8 different hb generations numbers.
That explains the error message saying a different node is hbing in
that slot.

However, you have to find stray o2hb threads that have not been stopped.
They could be running on node 0 itself. Or, some other nodes that you
had setup as node 0 before. I am guessing here. Go thru each node that is
physically connected to the storage.

You can start with running "ps | aux | grep o2hb" to see how many of those
threads are running on your nodes. It is should be 1 per mount per node.

Sunil


Fri Nov 21 13:28:56 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270bf7 44588ab5b5bb4ddc e9c92a6c
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:04 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c00 9bf2a853e46fba5f b3736682
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:05 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c01 44588ab5b5bb4ddc 49371c11
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:06 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c02 3e47b90030af5dfd 7786e320
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:07 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c03 44588ab5b5bb4ddc 69d2aa41
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:08 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c04 58b983437004059c 79b4a671
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:10 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c05 44588ab5b5bb4ddc 08fc70b1
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:11 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c07 e6b901ad46180797 adc0a96e
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:12 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c07 44588ab5b5bb4ddc 2819c6e1
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:22 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c12 f87033cdd779681d e63030fb
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:23 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c13 44588ab5b5bb4ddc b58e1e80
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:24 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c14 bf0c82508b3f190d 93d006f3
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:25 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c15 44588ab5b5bb4ddc d4a0c470
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487
Fri Nov 21 13:29:27 CST 2008
        node: node              seq       generation checksum
           0:    0 0000000049270c16 2535047a999203a9 5e896498
           1:    1 0000000049266fa3 bc1f697ec3c0e60b a7c6f487





Daniel Keisling wrote:
> Sorry for the delay in getting back to you.
>
> I never catch a core during the segfault of the umount.
>
> I tried to delete the heartbeat again and the command completed
> successfully, but the messages are still appearing in syslog.  A
> subsequent issue of the command brings:
>
> [root at ausracdbd01 ~]# ocfs2_hb_ctl -K -d /dev/dm-30 o2cb 
> ocfs2_hb_ctl: Unable to access cluster service while stopping heartbeat 
>
>
> Please see attached for the log you requested.
>
> Daniel
>
>   
>> -----Original Message-----
>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
>> Sent: Thursday, October 30, 2008 5:42 PM
>> To: Daniel Keisling
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
>> our slot! errors with LUN removal/addition
>>
>> So manually stopping the heartbeat worked.
>>
>> Did you catch a coredump for the segfault during umount?
>>
>> There is a small difference in the stop heartbeat that is called
>> as part of umount and the one called by hand. But have not
>> been able to figure out the source of the segfault.
>>
>> The above coredump will help.
>>
>> The other thing you could do is run the following command when you
>> see the "another node heartbeating..." message.
>> $ for i in `seq 30` ; do date >>/tmp/hb.out ; debugfs.ocfs2 -R "hb" 
>> /dev/dmX >>/tmp/hb.out ; sleep 1; done;
>>
>> Replace the device name with the one that is in the logs. 
>> Email me the 
>> output.
>>
>> Sunil
>>     



More information about the Ocfs2-users mailing list