[Ocfs2-users] ocfs2 hangs during webserver usage
David Johle
djohle at industrialinfo.com
Mon Jan 26 12:31:18 PST 2009
Well it does pretty much make the system (or at least anything doing
I/O to the volume) unresponsive, but it does recover after 10-15
seconds typically. I guess that is considered a "temporary slowdown"
rather than a hang?
Yes, the log files are being written to the OCFS2 volume, and are
actually being written to by both nodes in parallel. I did much
testing on this before going into production and never saw any
problems or slowdowns, even on much less powerful systems. And, as I
mentioned, there were no problems on these systems for over a month
in production (same load all along).
I do wonder if the 1.4 release would be any better for my situation,
and would like to put it on my test environment first of
course. However, I do have an issue in that I am making use of the
CDSL feature that was removed after 1.2.x, and thus I will have to
figure some way to accomplish the desired configuration w/o them
before I can upgrade.
The problem is continuing, and getting really annoying as it's
tripping up our monitoring system like crazy. Is there anything else
I try doing to get more details about what is going on to help find a
solution? Any parameters that could be tweaked to account for the
fact that there is a steady stream of small writes from all nodes to
this volume?
At 04:28 PM 1/23/2009, Sunil Mushran wrote:
>The two issues are different. For starters, the issue in bugzilla#882
>was a hang. Not a temporary slowdown. And the stack showed it was related
>to flock() that ocfs2 1.2 did not support. Well, not cluster-aware flock().
>Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2 1.4.
>
>Are the apache logs also hitting the iscsi storage? If so, one explanation
>could be that the log flush is saturating the network. That would cause
>the ioutil to jump higher affecting the httpd ios.
>
>
>David Johle wrote:
>>[snip]
>>The cluster has been in a production environment for about 1.5
>>months now, and just in the past last week it has started to have
>>problems. The user experience is an occasional lag of 5 to 15
>>seconds, after which everything appears normal. Digging deeper
>>into the problem I had narrowed it to an I/O issue, and iostat
>>shows near 100% utilization on said device during the lag. Once it
>>clears the utilization is back down to a consistent 0-5%
>>average. Also, when the lag is happening, a process listing shows
>>the affected processes in the D state.
>>[snip]
>> PID STAT COMMAND WIDE-WCHAN-COLUMN
>> 8511 D cronolog ocfs2_wait_for_status_completion
>> 8510 D cronolog ocfs2_wait_for_status_completion
>>[snip]
More information about the Ocfs2-users
mailing list