[Ocfs2-users] Recommendations for OCFS2 issues

Tue May 6 09:27:57 PDT 2014

We have been using OCFS2 for a couple years, and have had a number of 
issues pop up, some of them seem resolved, but we are still concerned 
because the systems still seems a bit fragile.

Several times we have had various OCFS2 volumes become unresponsive or 
slow.  We have also run into the "wants too many credits" error a few 
times, which seems to have been fixed by increasing the journal size on 
the volume causing the issue (might have made the journals bigger than 
they really need to be (256MB), but I want to avoid the credits 
problem).  The slowness/unresponsiveness issues seemed to have been 
solved by increasing the cluster size (especially on largish volumes). 
But, there still a few concerns.

The major concern is that when a volume becomes unresponsive, it causes 
a cascade affect where servers that simply have that volume NFS mounted, 
but are not using it, will have problems because commands like df will 
hang on that volume.  I know that the nfsserver is trying to return the 
current freespace for the volume, but cannot get it because the volume 
is unresponsive.  However, I think it would be better if a cached 
version of the free space could be return instead when the volume is 
unresponsive.

When a server does hang a volume (probably locks) what is the best 
procedure to find the server that is causing the issue and the root 
cause of the problem.  I have the scanlocks scripts, and have gotten 
better at determining the which server is the problem and to some extent 
the program or directory, but, to me it still is not an exact science. 
Are there any suggestions about the best way to do this.  Ideally, it 
would be nice if I could get the systems to detect this on their own and 
either fence themselves or reboot.

Any help would be appreciated.

Thanks,

Andy