[Ocfs-users] machine is freezing with ocfs 1.0.14

Sandhya Suman sandhya.suman at patni.com
Sun Feb 19 11:21:11 CST 2006


Hi,

I am using ocfs 1.0.14 on two of my cluster nodes.RAC is not yet
installed.The Linux kernel i am using is AS3 U5 with AMD64 processor.On
both the nodes i have mounted an ocfs filesystem on a common LU.The LU is
shared by both the nodes of cluster.It works fine till i do some abnormal
task.

While doing I/O from both the nodes on same LU,I disconnected the FC on
both the nodes at the same time.Both the nodes are completely disconnected
from storage.
And i find the i/o on both the node in hang state.After the sometime i/o
at one machine returns error and the machine is in normal state.But on
another machine the i/o freezes and any successive I/O on the same LU gets
freezes the io_wait % goes to 100%,keventd and other ocfs found in "D"
state with "io_sched" status in "ps-elf" output.

The device i had mounted is not the standard SD device.I had created my
own wrapper device driver.And my device is just a wrapper device for SD
device.It does extra processing for error handling.

The problem i suspecting is that when an I/O comes to my customized
device,and if there is a path disconnection before returning it schedules
some task for error processing.The scheduling i am doing using Kernel API
schedule_task().

When i read ocfs design specification,i found that ocfs also schedules the
task for DLM(distributed lock processing).In case of path disconnection
when DLM schedules a task i.e for reading vote request and publish sectors
it sends an I/O to the disk.The task is executing in Interrupt context.And
in case of path error(FC disconnection)my driver schedules another task
for error processing.

As per my 2.4 Linux kernel knowledge schduling a process in interrupt
context is not safe.That is scheduling from a alraedy scheduled task.

But the phenonmenon occurs only on one node.And further the phenonmenon
does not occur when "comm_voting" paramater is set to 0(i.e Disk DLM is
used.It occurs only when "comm_voting" is et to 1 i.e when Network DLM is
used.

Can anybody throw any light on the root cause of the problem.

Any help in this regard will be very much appreciated.

Thanks in Advance
Sandhya




More information about the Ocfs-users mailing list