[Ocfs-users] Lock contention issue with ocfs

Thu Mar 11 10:35:27 CST 2004

> FYI, I downloaded ocfs 1.0.10 from oss.oracle.com and tried it... 
> couldn't even successfully create a filesystem.  (?!)

That is because you must mount it at least once before the file system
is completely created.

There is code in the OCFS module which does some initialize filesystem
initialization on the first mount.

I believe that this is going to be transitioned out of the OCFS2 module
and put into mkfs.  I am not sure how this will affect OCFS1.

John

> [root at dc1node1 /]# mkfs.ocfs -V
> mkfs.ocfs 1.0.10-PROD1 Fri Mar  5 14:35:32 PST 2004 (build
> 902cb33b89695a48f0dd6517b713f949)
> [root at dc1node1 /]# mkfs.ocfs -b 128 -F -g 0 -L dc1:/u03 -m /u03 -p 755
> -u 0 /dev/sda
> Cleared volume header sectors
> Cleared node config sectors
> Cleared publish sectors
> Cleared vote sectors
> Cleared bitmap sectors
> Cleared data block
> Wrote volume header
> [root at dc1node1 /]# fsck.ocfs /dev/sda
> fsck.ocfs 1.0.10-PROD1 Fri Mar  5 14:35:41 PST 2004 (build
> b5602eb387c7409e9f814faf1d363b5b)
> Checking Volume Header...
> ERROR: structure failed verification, fsck.c, 384
> ocfs_vol_disk_hdr
> =================================
> minor_version: 2
> major_version: 1
> signature: OracleCFS
> mount_point: /u03
> serial_num: 0
> device_size: 10737418240
> start_off: 0
> bitmap_off: 56320
> publ_off: 23552
> vote_off: 39936
> root_bitmap_off: 0
> data_start_off: 1368064
> root_bitmap_size: 0
> root_off: <INVALID VALUE> 0
> root_size: 0
> cluster_size: 131072
> num_nodes: 32
> num_clusters: 81905
> dir_node_size: 0
> file_node_size: 0
> internal_off: <INVALID VALUE> 0
> node_cfg_off: 4096
> node_cfg_size: 17408
> new_cfg_off: 21504
> prot_bits: -rwxr-xr-x
> uid: 0 (root)
> gid: 0 (root)
> excl_mount: OCFS_INVALID_NODE_NUM
> 
> ERROR: Volume header bad. Exiting, fsck.c, 669
> /dev/sda: 2 errors, 0 objects, 0/81905 blocks
> [root at dc1node1 /]#
> 
> 
> 
> >>> Sunil Mushran <Sunil.Mushran at oracle.com> 03/10/2004 5:49:58 PM >>>
> I hope, that when you were reading the dirnode, etc. using debugocfs,
> you were accessing the volume via the raw device. If you weren't, do
> so.
> This is important because that's the only way to ensure directio.
> Else,
> you will be reading potentially stale data from the buffer cache.
> 
> Coming to the issue at hand. ls does not take an EXCLUSIVE_LOCK.
> And all EXCLUSIVE_LOCKS are released when the operation is over.
> So am not sure what is happening. Using debugocfs correctly should
> help
> us understand the problem.
> 
> Also, whenever you do your file operations (cat etc.) ensure those ops
> are
> o_direct. Now I am not sure why this would cause a problem, but do not
> do buffered operations. ocfs does not support shared mmap.
> 
> If you download the 1.0.10 tools, you will not need to manually map
> the
> raw device. The tools do that automatically.
> 
> So, upgrade to 1.0.10 module and tools. See if you can reproduce the
> problem.
> 
> Jeremy Schneider wrote:
> 
> >another note:
> >
> >after I delete the file I created that caused the
> >OCFS_DLM_EXCLUSIVE_LOCK to be held, the lock doesn't seem to actually
> be
> >released (according to debugocfs) until the other node attempts to
> read
> >the DirNode.  (e.g. /bin/ls or something)
> >
> >Jeremy
> >
> >
> >  
> >
> >>>>"Jeremy Schneider" <jer1887 at asugroup.com> 03/10/2004 4:55:56 PM
> >>>>
> >>>>        
> >>>>
> >I am still having this weird problem with nodes hanging while I'm
> >running OCFS.  I'm using OCFS 1.0.9-12 and RHAS 2.1
> >
> >I've been working on tracking it down and here's what I've got so
> far:
> >1. I create a file from node 0.  This succeeds; I can /bin/cat the
> >file, append, edit, or whatever.
> >2. From node 1, I do an operation that accesses the DirNode (e.g.
> >/bin/ls)
> >3. Node 0 immediately acquires a OCFS_DLM_EXCLUSIVE_LOCK on the
> >DirNode
> >itself (although I seem to still be able to *read* the DirNode from
> >node
> >1)
> >4. I attempt to create a file from node 1...  node 1 hangs, waiting
> >for
> >the exclusive lock on the DirNode to be released.
> >*** node 1 is now completely dysfunctional.  OCFS is hung.
> >5. I delete the file I created in step 1 (from node 0)
> >6. The OCFS_DLM_EXCLUSIVE_LOCK is released.
> >7. node 1 resumes, and creates a file
> >
> >8. I access the DirNode from node 0
> >9. Node 1 immediately acquires a OCFS_DLM_EXCLUSIVE_LOCK on the
> >DirNode
> >itself...  the whole process repeats, but with the nodes reversed.
> >
> >This looks a lot like a bug to me.  I've had a case open with Oracle
> >Support for it since the end of Feb, but at the moment BDE is too
> busy
> >investigating some message about the local hard drive controller to
> >consider that it might be a bug (and honestly, it probably doesn't
> >involve my local hard drive controller).
> >
> >Anyone have any suggestions?
> >
> >Jeremy
> >Lansing, MI
> 
> 
>  
> <<<<...>>>>
> _______________________________________________
> Ocfs-users mailing list
> Ocfs-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs-users
>