[Ocfs2-users] (no subject)

Tue Sep 2 11:57:49 PDT 2008

The process won't be a zombie. It will be a D state process. The
thing to see is where it is stuck. The ps below can help with that.

Break the lock where? Note, unlike nfs, in a cfs each node reads
and writes directly from/to the device. Meaning, we cannot
support a lock leasing scheme that allows the server to revoke
the lock of a holder. Allowing that will lead to corruption.
The easier route is to kill the process or reset the node (if the
process is D state). Yes, the DLM will recover the lock and the
fs will replay the journal. Bottomline, while resetting a node sounds
harsh, it is the quickest way to recover in such a situation.

The wiki was written for 1.2 in which reading the dlm lock state
is clunky. The 1.4 user's guide shows how one can do the same in the
new release. We have added the -B option in fs_locks to limit the
output to the relevant ones and also added dlm_locks to dump all
or a specific dlm lockres.

Now, while getting the info and deciphering is easy, deciding the
best course of option is not. It requires understanding of how
the kernel operates. So, when you encounter such a condition, file
a bugzilla with all the info.

1. /var/log/messages
2. ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN > /tmp/ps.out
3. debugfs.ocfs2 -R "stats" /dev/sdX >/tmp/stats_sdX.out
4. debugfs.ocfs2 -R "fs_locks" /dev/sdX >/tmp/fs_sdX.out
5. debugfs.ocfs2 -R "dlm_locks" /dev/sdX > /tmp/dlm_sdX.out
6. cat /sys/kernel/debug/o2dlm/<domain>/dlm_state > /tmp/state_sdX.out
7. cat /sys/kernel/debug/o2net/send_tracking > /tmp/tracking.out
8. cat /sys/kernel/debug/o2net/sock_containers > /tmp/sockets.out

This is for all nodes in the cluster. Also, stats, fs_locks and dlm_locks
should be for all ocfs2 mounts, dlm_state for all domains.

This will give us enough info to see how we can improve the fs.

Sunil

Andrew Phillips wrote:
> Sunil,
>
>  Thanks for the response. During this,I spent a lot of time looking at this 
> page;
>   http://oss.oracle.com/osswiki/OCFS2/Debugging
>
>  Which is where google told me to go for "ocfs2 lock debug". A 
> short note saying that that information is old or applies to 1.2
> would be helpful, along with a pointer to the 1.4 user guide. 
>
>   Having read the 1.4 guide, there are a few more things to try.
>
>   The guidance seems to be to kill the process thats holding the
> locks. If the process holding the lock is a zombie, that becomes 
> a bit hard to do. Is there any way of reaching into ocfs2 and 
> telling it to break the lock manually, and we'll accept the consequences.
>
>    Or alternatively, if you've rebooted the system that holds the lock
> would the others reclaim locks held and carry on as normal?
>
>    Andy
>
>
>
> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> Sent: Tue 02/09/2008 05:21
> To: Andrew Phillips
> Cc: ocfs2-users at oss.oracle.com; atp at tradefair.com
> Subject: Re: [Ocfs2-users] (no subject)
>  
> So in 1.4, we have a much improved debugging infrastructure for
> such issues. Check out the write on dlm debugging in the 1.4
> user's guide in the chapter titled notes.
>
> In short, you have correctly identified the lock resource. But we
> need to go a step further and get the info from the dlm and see
> as to which node is holding onto the lock and why.
>
> Read the writeup and of you have any qs, ping me.
>
> Sunil
>
> Andrew Phillips wrote:
>   
>> Hello,
>>
>>  We just experienced a hang that looks superficially very similar to 
>> http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg02359.html
>>
>>  There are 3 nodes in the cluster ocfs2-1.4.1 rhel 5.2. Versions, uname's
>> in the attached text file which also includes fs_locks dumps and various
>> other diagnostics. 
>>
>> The lock up happened when we were restarting a java application that 
>> was writing to the /journal directory, being read by another java app
>> on a second node.  Restarting the machine that the 
>> jvm was running on did not help - indicating a locking issue. 
>>
>> ls of the directory hangs the process on the machine that was writing.
>> An ls on the machine that was reading initially worked. An rm command
>> on the reader then caused that to lock up as well. 
>>
>> Here's an extract showing what they're waiting on.
>>
>>  2222 D    bash            ocfs2_wait_for_mask
>>  2282 Zl   java <defunct>  exit
>>  2567 Zl   java <defunct>  exit
>>  2736 D    ls              ocfs2_wait_for_mask
>>  2770 D    ls              ocfs2_wait_for_mask
>>
>> Andy
>>
>>  
>>
>>
>> ________________________________________________________________________
>> In order to protect our email recipients, Betfair Group use SkyScan from 
>> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>>
>> ________________________________________________________________________
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>     
>
>
>
> ________________________________________________________________________
> In order to protect our email recipients, Betfair Group use SkyScan from 
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>
> ________________________________________________________________________
>