[Ocfs-users] Lock contention issue with ocfs

Jeremy Schneider jer1887 at asugroup.com
Thu Mar 11 14:43:04 CST 2004


You're right.  Sorry, that was a little dense on my part.  It will be
nice, of course, when fsck just says "this volume has not been mounted
yet"...  but it is perfectly functional the way it is.  :)

Jeremy


>>> Sunil Mushran <Sunil.Mushran at oracle.com> 03/11/2004 1:37:32 PM >>>
It did create the filesystem. fsck is failing because the volume has 
never been
mounted on any node. On the very first mount, we create the system
files,
which fsck does not find. Yes, we should create these systemfiles in
mkfs.
It's in our todo list. Meanwhile the next release of fsck will not
fail. :-)

Jeremy Schneider wrote:

>FYI, I downloaded ocfs 1.0.10 from oss.oracle.com and tried it... 
>couldn't even successfully create a filesystem.  (?!)
>
>[root at dc1node1 /]# mkfs.ocfs -V
>mkfs.ocfs 1.0.10-PROD1 Fri Mar  5 14:35:32 PST 2004 (build
>902cb33b89695a48f0dd6517b713f949)
>[root at dc1node1 /]# mkfs.ocfs -b 128 -F -g 0 -L dc1:/u03 -m /u03 -p
755
>-u 0 /dev/sda
>Cleared volume header sectors
>Cleared node config sectors
>Cleared publish sectors
>Cleared vote sectors
>Cleared bitmap sectors
>Cleared data block
>Wrote volume header
>[root at dc1node1 /]# fsck.ocfs /dev/sda
>fsck.ocfs 1.0.10-PROD1 Fri Mar  5 14:35:41 PST 2004 (build
>b5602eb387c7409e9f814faf1d363b5b)
>Checking Volume Header...
>ERROR: structure failed verification, fsck.c, 384
>ocfs_vol_disk_hdr
>=================================
>minor_version: 2
>major_version: 1
>signature: OracleCFS
>mount_point: /u03
>serial_num: 0
>device_size: 10737418240
>start_off: 0
>bitmap_off: 56320
>publ_off: 23552
>vote_off: 39936
>root_bitmap_off: 0
>data_start_off: 1368064
>root_bitmap_size: 0
>root_off: <INVALID VALUE> 0
>root_size: 0
>cluster_size: 131072
>num_nodes: 32
>num_clusters: 81905
>dir_node_size: 0
>file_node_size: 0
>internal_off: <INVALID VALUE> 0
>node_cfg_off: 4096
>node_cfg_size: 17408
>new_cfg_off: 21504
>prot_bits: -rwxr-xr-x
>uid: 0 (root)
>gid: 0 (root)
>excl_mount: OCFS_INVALID_NODE_NUM
>
>ERROR: Volume header bad. Exiting, fsck.c, 669
>/dev/sda: 2 errors, 0 objects, 0/81905 blocks
>[root at dc1node1 /]#
>
>
>
>  
>
>>>>Sunil Mushran <Sunil.Mushran at oracle.com> 03/10/2004 5:49:58 PM >>>
>>>>        
>>>>
>I hope, that when you were reading the dirnode, etc. using debugocfs,
>you were accessing the volume via the raw device. If you weren't, do
>so.
>This is important because that's the only way to ensure directio.
>Else,
>you will be reading potentially stale data from the buffer cache.
>
>Coming to the issue at hand. ls does not take an EXCLUSIVE_LOCK.
>And all EXCLUSIVE_LOCKS are released when the operation is over.
>So am not sure what is happening. Using debugocfs correctly should
>help
>us understand the problem.
>
>Also, whenever you do your file operations (cat etc.) ensure those
ops
>are
>o_direct. Now I am not sure why this would cause a problem, but do
not
>do buffered operations. ocfs does not support shared mmap.
>
>If you download the 1.0.10 tools, you will not need to manually map
>the
>raw device. The tools do that automatically.
>
>So, upgrade to 1.0.10 module and tools. See if you can reproduce the
>problem.
>
>Jeremy Schneider wrote:
>
>  
>
>>another note:
>>
>>after I delete the file I created that caused the
>>OCFS_DLM_EXCLUSIVE_LOCK to be held, the lock doesn't seem to
actually
>>    
>>
>be
>  
>
>>released (according to debugocfs) until the other node attempts to
>>    
>>
>read
>  
>
>>the DirNode.  (e.g. /bin/ls or something)
>>
>>Jeremy
>>
>>
>> 
>>
>>    
>>
>>>>>"Jeremy Schneider" <jer1887 at asugroup.com> 03/10/2004 4:55:56 PM
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>I am still having this weird problem with nodes hanging while I'm
>>running OCFS.  I'm using OCFS 1.0.9-12 and RHAS 2.1
>>
>>I've been working on tracking it down and here's what I've got so
>>    
>>
>far:
>  
>
>>1. I create a file from node 0.  This succeeds; I can /bin/cat the
>>file, append, edit, or whatever.
>>2. From node 1, I do an operation that accesses the DirNode (e.g.
>>/bin/ls)
>>3. Node 0 immediately acquires a OCFS_DLM_EXCLUSIVE_LOCK on the
>>DirNode
>>itself (although I seem to still be able to *read* the DirNode from
>>node
>>1)
>>4. I attempt to create a file from node 1...  node 1 hangs, waiting
>>for
>>the exclusive lock on the DirNode to be released.
>>*** node 1 is now completely dysfunctional.  OCFS is hung.
>>5. I delete the file I created in step 1 (from node 0)
>>6. The OCFS_DLM_EXCLUSIVE_LOCK is released.
>>7. node 1 resumes, and creates a file
>>
>>8. I access the DirNode from node 0
>>9. Node 1 immediately acquires a OCFS_DLM_EXCLUSIVE_LOCK on the
>>DirNode
>>itself...  the whole process repeats, but with the nodes reversed.
>>
>>This looks a lot like a bug to me.  I've had a case open with Oracle
>>Support for it since the end of Feb, but at the moment BDE is too
>>    
>>
>busy
>  
>
>>investigating some message about the local hard drive controller to
>>consider that it might be a bug (and honestly, it probably doesn't
>>involve my local hard drive controller).
>>
>>Anyone have any suggestions?
>>
>>Jeremy
>>Lansing, MI
>>    
>>
>
>
> 
><<<<...>>>>
>_______________________________________________
>Ocfs-users mailing list
>Ocfs-users at oss.oracle.com 
>http://oss.oracle.com/mailman/listinfo/ocfs-users 
>  
>




More information about the Ocfs-users mailing list