[Ocfs2-users] ocfs2 hangs during webserver usage

Mon Jan 26 14:17:40 PST 2009

Well, something must have changed. Is the logfile growing in size?
Is the volume close to full? Are there a lot of files in a directory?

If the ioutil is high, the io subsystem is being saturated. That's a
given. Question is, who is the culprit. If the webserver load has not
changed, then the fs itself must be contributing to the load.

strace -T one of the processes. That should narrow down the scope
of the problem.

Sunil

David Johle wrote:
> Well it does pretty much make the system (or at least anything doing 
> I/O to the volume) unresponsive, but it does recover after 10-15 
> seconds typically.  I guess that is considered a "temporary slowdown" 
> rather than a hang?
>
> Yes, the log files are being written to the OCFS2 volume, and are 
> actually being written to by both nodes in parallel.  I did much 
> testing on this before going into production and never saw any 
> problems or slowdowns, even on much less powerful systems.  And, as I 
> mentioned, there were no problems on these systems for over a month in 
> production (same load all along).
>
> I do wonder if the 1.4 release would be any better for my situation, 
> and would like to put it on my test environment first of course.  
> However, I do have an issue in that I am making use of the CDSL 
> feature that was removed after 1.2.x, and thus I will have to figure 
> some way to accomplish the desired configuration w/o them before I can 
> upgrade.
>
> The problem is continuing, and getting really annoying as it's 
> tripping up our monitoring system like crazy.  Is there anything else 
> I try doing to get more details about what is going on to help find a 
> solution?  Any parameters that could be tweaked to account for the 
> fact that there is a steady stream of small writes from all nodes to 
> this volume?
>
>
>
> At 04:28 PM 1/23/2009, Sunil Mushran wrote:
>> The two issues are different. For starters, the issue in bugzilla#882
>> was a hang. Not a temporary slowdown. And the stack showed it was 
>> related
>> to flock() that ocfs2 1.2 did not support. Well, not cluster-aware 
>> flock().
>> Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2 
>> 1.4.
>>
>> Are the apache logs also hitting the iscsi storage? If so, one 
>> explanation
>> could be that the log flush is saturating the network. That would cause
>> the ioutil to jump higher affecting the httpd ios.
>>
>>
>> David Johle wrote:
>>> [snip]
>>> The cluster has been in a production environment for about 1.5 
>>> months now, and just in the past last week it has started to have 
>>> problems.  The user experience is an occasional lag of 5 to 15 
>>> seconds, after which everything appears normal.  Digging deeper into 
>>> the problem I had narrowed it to an I/O issue, and iostat shows near 
>>> 100% utilization on said device during the lag.  Once it clears the 
>>> utilization is back down to a consistent 0-5% average.  Also, when 
>>> the lag is happening, a process listing shows the affected processes 
>>> in the D state.
>>> [snip]
>>>    PID STAT COMMAND         WIDE-WCHAN-COLUMN
>>>   8511 D    cronolog        ocfs2_wait_for_status_completion
>>>   8510 D    cronolog        ocfs2_wait_for_status_completion
>>> [snip]
>
>
>