[Ocfs-users] Lock contention issue with ocfs

Jeremy Schneider jer1887 at asugroup.com
Thu Mar 11 13:24:45 CST 2004


FYI, I downloaded ocfs 1.0.10 from oss.oracle.com and tried it... 
couldn't even successfully create a filesystem.  (?!)

[root at dc1node1 /]# mkfs.ocfs -V
mkfs.ocfs 1.0.10-PROD1 Fri Mar  5 14:35:32 PST 2004 (build
902cb33b89695a48f0dd6517b713f949)
[root at dc1node1 /]# mkfs.ocfs -b 128 -F -g 0 -L dc1:/u03 -m /u03 -p 755
-u 0 /dev/sda
Cleared volume header sectors
Cleared node config sectors
Cleared publish sectors
Cleared vote sectors
Cleared bitmap sectors
Cleared data block
Wrote volume header
[root at dc1node1 /]# fsck.ocfs /dev/sda
fsck.ocfs 1.0.10-PROD1 Fri Mar  5 14:35:41 PST 2004 (build
b5602eb387c7409e9f814faf1d363b5b)
Checking Volume Header...
ERROR: structure failed verification, fsck.c, 384
ocfs_vol_disk_hdr
=================================
minor_version: 2
major_version: 1
signature: OracleCFS
mount_point: /u03
serial_num: 0
device_size: 10737418240
start_off: 0
bitmap_off: 56320
publ_off: 23552
vote_off: 39936
root_bitmap_off: 0
data_start_off: 1368064
root_bitmap_size: 0
root_off: <INVALID VALUE> 0
root_size: 0
cluster_size: 131072
num_nodes: 32
num_clusters: 81905
dir_node_size: 0
file_node_size: 0
internal_off: <INVALID VALUE> 0
node_cfg_off: 4096
node_cfg_size: 17408
new_cfg_off: 21504
prot_bits: -rwxr-xr-x
uid: 0 (root)
gid: 0 (root)
excl_mount: OCFS_INVALID_NODE_NUM

ERROR: Volume header bad. Exiting, fsck.c, 669
/dev/sda: 2 errors, 0 objects, 0/81905 blocks
[root at dc1node1 /]#



>>> Sunil Mushran <Sunil.Mushran at oracle.com> 03/10/2004 5:49:58 PM >>>
I hope, that when you were reading the dirnode, etc. using debugocfs,
you were accessing the volume via the raw device. If you weren't, do
so.
This is important because that's the only way to ensure directio.
Else,
you will be reading potentially stale data from the buffer cache.

Coming to the issue at hand. ls does not take an EXCLUSIVE_LOCK.
And all EXCLUSIVE_LOCKS are released when the operation is over.
So am not sure what is happening. Using debugocfs correctly should
help
us understand the problem.

Also, whenever you do your file operations (cat etc.) ensure those ops
are
o_direct. Now I am not sure why this would cause a problem, but do not
do buffered operations. ocfs does not support shared mmap.

If you download the 1.0.10 tools, you will not need to manually map
the
raw device. The tools do that automatically.

So, upgrade to 1.0.10 module and tools. See if you can reproduce the
problem.

Jeremy Schneider wrote:

>another note:
>
>after I delete the file I created that caused the
>OCFS_DLM_EXCLUSIVE_LOCK to be held, the lock doesn't seem to actually
be
>released (according to debugocfs) until the other node attempts to
read
>the DirNode.  (e.g. /bin/ls or something)
>
>Jeremy
>
>
>  
>
>>>>"Jeremy Schneider" <jer1887 at asugroup.com> 03/10/2004 4:55:56 PM
>>>>
>>>>        
>>>>
>I am still having this weird problem with nodes hanging while I'm
>running OCFS.  I'm using OCFS 1.0.9-12 and RHAS 2.1
>
>I've been working on tracking it down and here's what I've got so
far:
>1. I create a file from node 0.  This succeeds; I can /bin/cat the
>file, append, edit, or whatever.
>2. From node 1, I do an operation that accesses the DirNode (e.g.
>/bin/ls)
>3. Node 0 immediately acquires a OCFS_DLM_EXCLUSIVE_LOCK on the
>DirNode
>itself (although I seem to still be able to *read* the DirNode from
>node
>1)
>4. I attempt to create a file from node 1...  node 1 hangs, waiting
>for
>the exclusive lock on the DirNode to be released.
>*** node 1 is now completely dysfunctional.  OCFS is hung.
>5. I delete the file I created in step 1 (from node 0)
>6. The OCFS_DLM_EXCLUSIVE_LOCK is released.
>7. node 1 resumes, and creates a file
>
>8. I access the DirNode from node 0
>9. Node 1 immediately acquires a OCFS_DLM_EXCLUSIVE_LOCK on the
>DirNode
>itself...  the whole process repeats, but with the nodes reversed.
>
>This looks a lot like a bug to me.  I've had a case open with Oracle
>Support for it since the end of Feb, but at the moment BDE is too
busy
>investigating some message about the local hard drive controller to
>consider that it might be a bug (and honestly, it probably doesn't
>involve my local hard drive controller).
>
>Anyone have any suggestions?
>
>Jeremy
>Lansing, MI


 
<<<<...>>>>


More information about the Ocfs-users mailing list