[Ocfs-users] machine is freezing with ocfs 1.0.14
Sunil Mushran
Sunil.Mushran at oracle.com
Mon Feb 20 19:39:45 CST 2006
Have you tried looking into the kernel stack traces for the D state
processes.
# echo t >/proc/sysrq-trigger
May provide a clue as to where the processes are "hanging".
Sandhya Suman wrote:
> Hi,
>
> I am using ocfs 1.0.14 on two of my cluster nodes.RAC is not yet
> installed.The Linux kernel i am using is AS3 U5 with AMD64 processor.On
> both the nodes i have mounted an ocfs filesystem on a common LU.The LU is
> shared by both the nodes of cluster.It works fine till i do some abnormal
> task.
>
> While doing I/O from both the nodes on same LU,I disconnected the FC on
> both the nodes at the same time.Both the nodes are completely disconnected
> from storage.
> And i find the i/o on both the node in hang state.After the sometime i/o
> at one machine returns error and the machine is in normal state.But on
> another machine the i/o freezes and any successive I/O on the same LU gets
> freezes the io_wait % goes to 100%,keventd and other ocfs found in "D"
> state with "io_sched" status in "ps-elf" output.
>
> The device i had mounted is not the standard SD device.I had created my
> own wrapper device driver.And my device is just a wrapper device for SD
> device.It does extra processing for error handling.
>
> The problem i suspecting is that when an I/O comes to my customized
> device,and if there is a path disconnection before returning it schedules
> some task for error processing.The scheduling i am doing using Kernel API
> schedule_task().
>
> When i read ocfs design specification,i found that ocfs also schedules the
> task for DLM(distributed lock processing).In case of path disconnection
> when DLM schedules a task i.e for reading vote request and publish sectors
> it sends an I/O to the disk.The task is executing in Interrupt context.And
> in case of path error(FC disconnection)my driver schedules another task
> for error processing.
>
> As per my 2.4 Linux kernel knowledge schduling a process in interrupt
> context is not safe.That is scheduling from a alraedy scheduled task.
>
> But the phenonmenon occurs only on one node.And further the phenonmenon
> does not occur when "comm_voting" paramater is set to 0(i.e Disk DLM is
> used.It occurs only when "comm_voting" is et to 1 i.e when Network DLM is
> used.
>
> Can anybody throw any light on the root cause of the problem.
>
> Any help in this regard will be very much appreciated.
>
> Thanks in Advance
> Sandhya
>
>
> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users
>
More information about the Ocfs-users
mailing list