[Ocfs2-users] Failover testing problem and a heartbeat question

Daniel McDonald wasade at gmail.com
Fri May 28 10:38:23 PDT 2010


Sunil,

I setup netconsole and verified that the machines are fencing. Thank you for your assistance.
-Daniel

On May 26, 2010, at 3:21 PM, Sunil Mushran wrote:

> On 05/26/2010 01:39 PM, Daniel McDonald wrote:
>> 
>>> ocfs2 does not reset without a log message. Do you have netconsole
>>> setup? Messages logged a tick before reset can only be captured by
>>> netconsole/kdump etc.
>>>     
>> Unfortunately no. Here are the two lines in /var/log/message prior to the
>> un-intended reboot and then syslog restarting:
>> 
>> May 25 22:26:03 ST2540_X4450_1 kernel: ocfs2_dlm: Nodes in domain ("7CCC109F8F16433DB7DB79526A29375A"): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
>> May 26 04:05:27 ST2540_X4450_1 init: Trying to re-exec init
>> May 26 11:49:31 ST2540_X4450_1 syslogd 1.4.1: restart.
>> 
>> At approximately 11:46:34 a fibre cable was intentionally pulled out of the
>> SAN. Prior to that, all 15 OCFS2 nodes were performing I/O operations
>> with OCFS2 volumes on the SAN. 8 or so nodes fenced, but two simply
>> reboot.
>> 
>> Any ideas? I'm curious as to if you believe this reboot could be attributed
>> to OCFS2 or possibly a separate issue. I was surprised to see some
>> machines with fencing messages and then these two without.
>> 
>> fyi, the same test, when performed with a disk heartbeat threshold of 61,
>> did not result in any nodes dropping off.
>>   
> 
> The only way to know for sure is to get the netconsole logs. As in,
> if ocfs2 is the cause of the reboot, then netconsole will capture it.
> If not, then it is likely something else. If you are going to use this in
> a prod environment, you should seriously consider setting up an
> old box as a netconsole server to capture the logs.




More information about the Ocfs2-users mailing list