[Ocfs2-users] ocfs2 hangs during webserver usage
David Johle
djohle at industrialinfo.com
Thu Jan 22 16:35:16 PST 2009
I have been running a 2 node cluster of web servers handling around
800,000 - 1,200,000 requests on a typical weekday. The systems are
each dual quad core xeon w/4GB ram, and generally maintain a fairly
low load average. The servers are sharing a device via iSCSI SAN for
the OCFS2 volume. They are running RHEL5 U2 (2.6.18-92.1.10.el5) and
OCFS2 1.2.9-1 binaries of the matching kernel version.
The cluster has been in a production environment for about 1.5 months
now, and just in the past last week it has started to have
problems. The user experience is an occasional lag of 5 to 15
seconds, after which everything appears normal. Digging deeper into
the problem I had narrowed it to an I/O issue, and iostat shows near
100% utilization on said device during the lag. Once it clears the
utilization is back down to a consistent 0-5% average. Also, when
the lag is happening, a process listing shows the affected processes
in the D state.
After some reading of the mailing list archives and much google
searching, I found this old thread which seems quite similar:
http://oss.oracle.com/pipermail/ocfs2-users/2007-May/001643.html
And the corresponding bug from this same author:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=882
I see it has been a long time since this bug was last updated, but
there seems to have been no [documented] resolution.
I wrote a small script which hits a very simple web page at 10 second
intervals. If the server responds in less than 1 second it outputs a
dot; if the page takes more than 1 second to load it outputs a line
indicating how long it took, which node processed it, and a
timestamp. Here is some sample output over the course of about an hour:
................
Real time of 1 for wolf at Thu Jan 22 16:35:32 CST 2009
............................................................
Real time of 6 for cow at Thu Jan 22 16:45:52 CST 2009
.....
Real time of 7 for cow at Thu Jan 22 16:46:59 CST 2009
.................................................................................
Real time of 12 for wolf at Thu Jan 22 17:00:54 CST 2009
...................................
Real time of 3 for wolf at Thu Jan 22 17:06:58 CST 2009
....................
Real time of 13 for cow at Thu Jan 22 17:10:42 CST 2009
...
Real time of 11 for cow at Thu Jan 22 17:11:33 CST 2009
.......................
Real time of 9 for cow at Thu Jan 22 17:15:43 CST 2009
.......................................
Real time of 2 for wolf at Thu Jan 22 17:22:27 CST 2009
..
Real time of 8 for wolf at Thu Jan 22 17:23:05 CST 2009
.............................
Real time of 10 for wolf at Thu Jan 22 17:28:17 CST 2009
....................................
Real time of 14 for wolf at Thu Jan 22 17:34:42 CST 2009
There are no real patterns here, but the output is interesting
nonetheless -- it does show that it is intermittent, but quite frequent.
I tried using the "ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN"
command to see what processes were hanging on each system during the
time of the lags. Like Stephan's output, I see a few processes in
the D (uninterruptable sleep) state, which makes sense if they are
performing I/O. And also like Stephan, one node shows the "-"
output, and the other shows "ocfs2_wait_for_status_completion" as the function.
node cow:
PID STAT COMMAND WIDE-WCHAN-COLUMN
8511 D cronolog ocfs2_wait_for_status_completion
8510 D cronolog ocfs2_wait_for_status_completion
node wolf:
PID STAT COMMAND WIDE-WCHAN-COLUMN
9439 D cronolog -
In my case, however, it was not the httpd process that was listed,
but rather cronolog (which is what I use with apache for automatic
daily log rotations). I assume that in his case it the http was
directly writing to log files at the time of the hangs. Also, mine
typically was only a single process on each system in the D state,
only sometimes more.
Another thing I tried was to shut down httpd on one of the nodes to
see if the lags went away -- the result was unchanged, still had
similar lags at similar intervals. This time, the "ps" output on the
non-participating node was clean, but the other had multiple
processes, and not all web server related:
node wolf:
PID STAT COMMAND WIDE-WCHAN-COLUMN
9436 D cronolog -
9438 D cronolog -
9439 D cronolog -
9448 D cronolog -
9457 D cronolog -
31201 D httpd -
31202 D httpd -
31710 D vsftpd sync_buffer
So apparently this problem can happen between processes of the same
node, not just between nodes.
Watching a "top" command on each box during the lags only shows CPU
usage on iowait, no significant user/system CPU.
More information about the Ocfs2-users
mailing list