[Ocfs-devel] machine is freezing with ocfs 1.0.14

sandhya Suman sandhya.suman at gmail.com
Sun Feb 19 11:09:45 CST 2006


Hi,

I am using ocfs 1.0.14 on two of my cluster nodes.RAC is not yet
installed.The Linux kernel i am using is AS3 U5 with AMD64 processor.On both
the nodes i have mounted an ocfs filesystem on a common LU.The LU is shared
by both the nodes of cluster.It works fine till i do some abnormal task.

While doing I/O from both the nodes on same LU,I disconnected the FC on both
the nodes at the same time.Both the nodes are completely disconnected from
storage.
And i find the i/o on both the node in hang state.After the sometime i/o at
one machine returns error and the machine is in normal state.But on another
machine the i/o freezes and any successive I/O on the same LU gets freezes
the io_wait % goes to 100%,keventd and other ocfs found in "D" state with
"io_sched" status in "ps-elf" output.

The device i had mounted is not the standard SD device.I had created my own
wrapper device driver.And my device is just a wrapper device for SD
device.It does extra processing for error handling.

The problem i suspecting is that when an I/O comes to my customized
device,and if there is a path disconnection before returning it schedules
some task for error processing.The scheduling i am doing using Kernel API
schedule_task().

When i read ocfs design specification,i found that ocfs also schedules the
task for DLM(distributed lock processing).In case of path disconnection when
DLM schedules a task i.e for reading vote request and publish sectors it
sends an I/O to the disk.The task is executing in Interrupt context.And in
case of path error(FC disconnection)my driver schedules another task for
error processing.

As per my 2.4 Linux kernel knowledge scheduling a process in interrupt
context is not safe.That is scheduling from a alraedy scheduled task.

But the phenomenon occurs only on one node.And further the phenomenon does
not occur when "comm_voting" parameter is set to 0(i.e Disk DLM is
used.Itoccurs only when "comm_voting" is et to 1
i.e when Network DLM is used.

Can anybody throw any light on the root cause of the problem.

Any help in this regard will be very much appreciated.

Thanks in Advance
Sandhya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs-devel/attachments/20060219/2ffb6661/attachment.html


More information about the Ocfs-devel mailing list