Hi,<br>
<br>
I am using ocfs 1.0.14 on two of my cluster nodes.RAC is not yet
installed.The Linux kernel i am using is AS3 U5 with AMD64 processor.On both the
nodes i have mounted an ocfs filesystem on a common LU.The LU is shared by both
the nodes of cluster.It works fine till i do some abnormal task.<br>
<br>
While
doing I/O from both the nodes on same LU,I disconnected the FC on both the nodes
at the same time.Both the nodes are completely disconnected from storage.<br>
And
i find the i/o on both the node in hang state.After the sometime i/o at one
machine returns error and the machine is in normal state.But on another machine
the i/o freezes and any successive I/O on the same LU gets freezes the io_wait %
goes to 100%,keventd and other ocfs found in "D" state with "io_sched" status in
"ps-elf" output.<br>
<br>
The device i had mounted is not the standard SD device.I
had created my own wrapper device driver.And my device is just a wrapper device
for SD device.It does extra processing for error handling.<br>
<br>
The problem i
suspecting is that when an I/O comes to my customized device,and if there is a
path disconnection before returning it schedules some task for error
processing.The scheduling i am doing using Kernel API
schedule_task().<br>
<br>
When i read ocfs design specification,i found that ocfs
also schedules the task for DLM(distributed lock processing).In case of path
disconnection when DLM schedules a task i.e for reading vote request and publish
sectors it sends an I/O to the disk.The task is executing in Interrupt
context.And in case of path error(FC disconnection)my driver schedules another
task for error processing.<br>
<br>
As per my 2.4 Linux kernel knowledge scheduling
a process in interrupt context is not safe.That is scheduling from a alraedy
scheduled task.<br>
<br>
But the phenomenon occurs only on one node.And further
the phenomenon does not occur when "comm_voting" parameter is set to 0(i.e Disk
DLM is used.It occurs only when "comm_voting" is et to 1 i.e when Network DLM is
used.<br>
<br>
Can anybody throw any light on the root cause of the
problem.<br>
<br>
Any help in this regard will be very much
appreciated.<br>
<br>
Thanks in Advance<br>
Sandhya <br>