[Ocfs2-users] Failover testing problem and a heartbeat question

Sunil Mushran sunil.mushran at oracle.com
Wed May 26 14:21:49 PDT 2010


On 05/26/2010 01:39 PM, Daniel McDonald wrote:
>
>> ocfs2 does not reset without a log message. Do you have netconsole
>> setup? Messages logged a tick before reset can only be captured by
>> netconsole/kdump etc.
>>      
> Unfortunately no. Here are the two lines in /var/log/message prior to the
> un-intended reboot and then syslog restarting:
>
> May 25 22:26:03 ST2540_X4450_1 kernel: ocfs2_dlm: Nodes in domain ("7CCC109F8F16433DB7DB79526A29375A"): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
> May 26 04:05:27 ST2540_X4450_1 init: Trying to re-exec init
> May 26 11:49:31 ST2540_X4450_1 syslogd 1.4.1: restart.
>
> At approximately 11:46:34 a fibre cable was intentionally pulled out of the
> SAN. Prior to that, all 15 OCFS2 nodes were performing I/O operations
> with OCFS2 volumes on the SAN. 8 or so nodes fenced, but two simply
> reboot.
>
> Any ideas? I'm curious as to if you believe this reboot could be attributed
> to OCFS2 or possibly a separate issue. I was surprised to see some
> machines with fencing messages and then these two without.
>
> fyi, the same test, when performed with a disk heartbeat threshold of 61,
> did not result in any nodes dropping off.
>    

The only way to know for sure is to get the netconsole logs. As in,
if ocfs2 is the cause of the reboot, then netconsole will capture it.
If not, then it is likely something else. If you are going to use this in
a prod environment, you should seriously consider setting up an
old box as a netconsole server to capture the logs.



More information about the Ocfs2-users mailing list