[Ocfs2-users] Catatonic nodes under SLES10

Mon Apr 9 15:18:10 PDT 2007

For io fencing to be graceful, one requires better hardware. Read expensive.
As in, switches where one can choke off all the ios to the storage from 
a specific
node.

Read the following for a discussion on force umounts. In short, not 
possible as yet.
http://lwn.net/Articles/192632/

Readonly does not work wrt to io fencing. As in, ro only stops any new 
userspace
writes but cannot stop pending writes. And writes could be lodged in any 
io layer.
A reboot is the cheapest way to avoid corruption. (While a reboot is 
painful, it is
much less painful than a corrupted fs.)

With 1.2.5 you should be able to increase the network timeouts and 
hopefully avoid
the problem.

David Miller wrote:
> Alexei_Roudnev wrote:
>> Did you checked
>>
>>  /proc/sys/kernel/panic  /proc/sys/kernel/panic_on_oops
>>
>> system variables?
>>   
>
> No.  Maybe I'm missing something here.
>
> Are you saying that a panic/freeze/reboot is the expected/desirable 
> behavior?  That nothing more graceful could be done, like to just 
> dismount the ocfs2 file systems, or force them to a read-only mount or 
> something like that?  We have to reload the kernel?
>
> Thanks,
>
> --- David
>
>> ----- Original Message ----- From: "David Miller" <syslog at d.sparks.net>
>> To: <ocfs2-users at oss.oracle.com>
>> Sent: Monday, April 02, 2007 9:01 AM
>> Subject: [Ocfs2-users] Catatonic nodes under SLES10
>>   
>
> [snip]
>
>> Both servers will be connected to a dual-host external RAID system.  
>> I've setup ocfs2 on a couple of test systems and everything appears 
>> to work fine.
>>
>> Until, that is, one of the systems loses network connectivity.
>>
>> When the systems can't talk to each other anymore, but the disk 
>> heartbeat is still alive, the high numbered node goes catatonic.  
>> Under SLES 9 it fenced itself off with a kernel panic; under 10 it 
>> simply stops responding to network or console.  A power cycling is 
>> required to bring it back up.
>>
>> The desired behavior would be for the higher numbered node to lose 
>> access to the ocfs2 file system(s).  I don't really care whether it 
>> would simply timeout ala stale NFS mounts, or immediately error like 
>> access to non-existent files.
>>
>>   
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users