[Ocfs2-users] ocfs2 hangs during webserver usage
Sunil Mushran
sunil.mushran at oracle.com
Mon Jan 26 14:17:40 PST 2009
Well, something must have changed. Is the logfile growing in size?
Is the volume close to full? Are there a lot of files in a directory?
If the ioutil is high, the io subsystem is being saturated. That's a
given. Question is, who is the culprit. If the webserver load has not
changed, then the fs itself must be contributing to the load.
strace -T one of the processes. That should narrow down the scope
of the problem.
Sunil
David Johle wrote:
> Well it does pretty much make the system (or at least anything doing
> I/O to the volume) unresponsive, but it does recover after 10-15
> seconds typically. I guess that is considered a "temporary slowdown"
> rather than a hang?
>
> Yes, the log files are being written to the OCFS2 volume, and are
> actually being written to by both nodes in parallel. I did much
> testing on this before going into production and never saw any
> problems or slowdowns, even on much less powerful systems. And, as I
> mentioned, there were no problems on these systems for over a month in
> production (same load all along).
>
> I do wonder if the 1.4 release would be any better for my situation,
> and would like to put it on my test environment first of course.
> However, I do have an issue in that I am making use of the CDSL
> feature that was removed after 1.2.x, and thus I will have to figure
> some way to accomplish the desired configuration w/o them before I can
> upgrade.
>
> The problem is continuing, and getting really annoying as it's
> tripping up our monitoring system like crazy. Is there anything else
> I try doing to get more details about what is going on to help find a
> solution? Any parameters that could be tweaked to account for the
> fact that there is a steady stream of small writes from all nodes to
> this volume?
>
>
>
> At 04:28 PM 1/23/2009, Sunil Mushran wrote:
>> The two issues are different. For starters, the issue in bugzilla#882
>> was a hang. Not a temporary slowdown. And the stack showed it was
>> related
>> to flock() that ocfs2 1.2 did not support. Well, not cluster-aware
>> flock().
>> Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2
>> 1.4.
>>
>> Are the apache logs also hitting the iscsi storage? If so, one
>> explanation
>> could be that the log flush is saturating the network. That would cause
>> the ioutil to jump higher affecting the httpd ios.
>>
>>
>> David Johle wrote:
>>> [snip]
>>> The cluster has been in a production environment for about 1.5
>>> months now, and just in the past last week it has started to have
>>> problems. The user experience is an occasional lag of 5 to 15
>>> seconds, after which everything appears normal. Digging deeper into
>>> the problem I had narrowed it to an I/O issue, and iostat shows near
>>> 100% utilization on said device during the lag. Once it clears the
>>> utilization is back down to a consistent 0-5% average. Also, when
>>> the lag is happening, a process listing shows the affected processes
>>> in the D state.
>>> [snip]
>>> PID STAT COMMAND WIDE-WCHAN-COLUMN
>>> 8511 D cronolog ocfs2_wait_for_status_completion
>>> 8510 D cronolog ocfs2_wait_for_status_completion
>>> [snip]
>
>
>
More information about the Ocfs2-users
mailing list