[Ocfs2-users] fcntl exclusive lock implementation in ocfs2

Jeff Fookson jfookson at as.arizona.edu
Wed Apr 4 17:02:50 PDT 2007


I am currently testing ocfs2 for use in a two-node cluster that will run 
the Cyrus imapd and am having issues
that seem to be related to occasionally long times being needed while 
the software blocks waiting to get a writelock
via the 'fcntl' system call. I am aware that the current ocfs2 supports 
neither a writable mmap nor a cluster-aware
flock, so my tests are done doing all writing to only one node of the 
cluster and the Cyrus configuration
is such that none of the requisite databases require a writable 'mmap' 
(i.e. all databases are skiplist, not Berkeley DB).
I am using drbd to provide the appropriate
support for having the disks on the two nodes to behave as a shared 
resource; as permitted by drbd, version 8,
the disks on both nodes are drbd primaries and mounted on their 
respective machines. I am testing by having modest
size mail messages delivered to just one of the machines at the rate of 
1/sec. The system will run fine in this mode, sometimes for
days but then will get hopelessly wedged with many   'lmtpd' processes 
waiting to get exclusive locks on the various Cyrus
databases. As the system approaches this deadlock condition, 'strace' 
shows times of many seconds being spent in 'fcntl'
waiting for the lock and the load average skyrockets because of all the 
'lmtpd' processes.
 Since mail is being delivered at essentially a constant rate and there 
is no other activity on the systems, I'm confused
as to how the machines will often run for extended times before suddenly 
getting into this pathological state.

I realize that because my setup is using several complex layers 
(actually the full storage design has

md->drbd->lvm->ocfs2->Cyrus imapd)  I will also consult the drbd and 
Cyrus mailing lists, but I'm hoping
that someone on this list might have some insight into how fcntl-based 
locking is implemented under ocfs2
that may help point the way to what is causing the deadlock after many 
days of running well.

The machines are both running CentOS 4.4 with a 2.6.19 kernel; the ocfs2 
code is that included with the kernel
sources; drbd is version 8.0 and the Cyrus version is 2.3.8.

Thank you for any thoughts on this matter.

Jeff Fookson

-- 
Jeffrey E. Fookson, PhD			Phone: (520) 621 3091
Support Systems Analyst, Principal	jfookson at as.arizona.edu
Steward Observatory
University of Arizona




More information about the Ocfs2-users mailing list