[Ocfs-users] A couple more minor questions about OCFS and RHEL3

Sat Mar 6 22:35:52 CST 2004

heh...

>  Our cluster has been stable since we installed RAC, but a few minor issues
> have me concerned.  First, our storage array seems to maintain continuous
> low-level activity even when the database is shut down.  The CPUs spend a
> modest amount of time in iowait state while this is going on.  I figure this
> might be related to the I/O fencing and inter-node communication features of
> OCFS, but I want to verify that this is expected.

ocfs does about 1k wite and 32kb read / second per mounted volume. if
nothing goes on, it write a heartbeat and reads everryone elses (32
sectors worth)

>  Next, I saw a Metalink thread which suggests that async I/O is not
> supported on OCFS with RHAS 2.1.  It doesn't say anything about RHEL3.
> We've been using async in our testing with no problems so far, and plan to
> use it in production unless Oracle feels the combination is not yet
> trustworthy.

well - tough one, it works, but the big issue is that you rredologfile
need to be contiguous on disk, otherwise you might have failures, exact
same goes for rhel3 as rhas21. you can see that by running debugocfs
eg :
/ocfs/log/foo1.dbf -> debugocfs -f /log/foo1.dbf /dev/sdXXX
that will show how many offsets (should only have one) in the extents
if its more than 1, dd it over with a very large blocksize and see if
that ends up being 1 contig file.

if you do that, everything should work, however, there just hasn't been
enough real testing with aio, need to ggather more evidence.

the reason the logfiles are annoying is because he way aio is
implemented and how we call it, it cannto handle short io's or non
contig aio submits.

>  The last issue is that sometimes we see  messages such as the following
> from dmesg:
> 
> (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1220
> (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1285
> (11637) ERROR: status = -16, Common/ocfsgendlm.c, 1586
> (11637) ERROR: status = -16, Common/ocfsgencreate.c, 1027
> (11637) ERROR: status = -16, Common/ocfsgencreate.c, 1770
> (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1220
> (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1285
> (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1586
> (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1027
> (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1770
> (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1220
> (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1285
> (12717) ERROR: status = -16, Common/ocfsgendlm.c, 1586
> (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1027
> (12717) ERROR: status = -16, Common/ocfsgencreate.c, 1770
> 
>  I think these mostly come up around boot time, so maybe they're related to
> mounting cluster filesystems when the other node is down.  The messages do
> not come continuously, and the systems behave properly, so I'm just trying
> to make sure that this isn't the sign of some subtle error.

hmm have to look at the code for this , get ebusy, sounds like dlm and
trying to get access to a file thats in use

you know when things are serious yoy really ought to call support, don't
rely on this maillist for production problems ;) mileage may vary ;)