[Ocfs2-users] ocfs2 hangs during webserver usage

Mon Jan 26 12:31:18 PST 2009

Well it does pretty much make the system (or at least anything doing 
I/O to the volume) unresponsive, but it does recover after 10-15 
seconds typically.  I guess that is considered a "temporary slowdown" 
rather than a hang?

Yes, the log files are being written to the OCFS2 volume, and are 
actually being written to by both nodes in parallel.  I did much 
testing on this before going into production and never saw any 
problems or slowdowns, even on much less powerful systems.  And, as I 
mentioned, there were no problems on these systems for over a month 
in production (same load all along).

I do wonder if the 1.4 release would be any better for my situation, 
and would like to put it on my test environment first of 
course.  However, I do have an issue in that I am making use of the 
CDSL feature that was removed after 1.2.x, and thus I will have to 
figure some way to accomplish the desired configuration w/o them 
before I can upgrade.

The problem is continuing, and getting really annoying as it's 
tripping up our monitoring system like crazy.  Is there anything else 
I try doing to get more details about what is going on to help find a 
solution?  Any parameters that could be tweaked to account for the 
fact that there is a steady stream of small writes from all nodes to 
this volume?

At 04:28 PM 1/23/2009, Sunil Mushran wrote:
>The two issues are different. For starters, the issue in bugzilla#882
>was a hang. Not a temporary slowdown. And the stack showed it was related
>to flock() that ocfs2 1.2 did not support. Well, not cluster-aware flock().
>Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2 1.4.
>
>Are the apache logs also hitting the iscsi storage? If so, one explanation
>could be that the log flush is saturating the network. That would cause
>the ioutil to jump higher affecting the httpd ios.
>
>
>David Johle wrote:
>>[snip]
>>The cluster has been in a production environment for about 1.5 
>>months now, and just in the past last week it has started to have 
>>problems.  The user experience is an occasional lag of 5 to 15 
>>seconds, after which everything appears normal.  Digging deeper 
>>into the problem I had narrowed it to an I/O issue, and iostat 
>>shows near 100% utilization on said device during the lag.  Once it 
>>clears the utilization is back down to a consistent 0-5% 
>>average.  Also, when the lag is happening, a process listing shows 
>>the affected processes in the D state.
>>[snip]
>>    PID STAT COMMAND         WIDE-WCHAN-COLUMN
>>   8511 D    cronolog        ocfs2_wait_for_status_completion
>>   8510 D    cronolog        ocfs2_wait_for_status_completion
>>[snip]