[Ocfs2-users] ocfs2 hangs during webserver usage

Thu Jan 22 16:35:16 PST 2009

I have been running a 2 node cluster of web servers handling around 
800,000 - 1,200,000 requests on a typical weekday.  The systems are 
each dual quad core xeon w/4GB ram, and generally maintain a fairly 
low load average.  The servers are sharing a device via iSCSI SAN for 
the OCFS2 volume.  They are running RHEL5 U2 (2.6.18-92.1.10.el5) and 
OCFS2 1.2.9-1 binaries of the matching kernel version.

The cluster has been in a production environment for about 1.5 months 
now, and just in the past last week it has started to have 
problems.  The user experience is an occasional lag of 5 to 15 
seconds, after which everything appears normal.  Digging deeper into 
the problem I had narrowed it to an I/O issue, and iostat shows near 
100% utilization on said device during the lag.  Once it clears the 
utilization is back down to a consistent 0-5% average.  Also, when 
the lag is happening, a process listing shows the affected processes 
in the D state.

After some reading of the mailing list archives and much google 
searching, I found this old thread which seems quite similar:
http://oss.oracle.com/pipermail/ocfs2-users/2007-May/001643.html

And the corresponding bug from this same author:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=882

I see it has been a long time since this bug was last updated, but 
there seems to have been no [documented] resolution.

I wrote a small script which hits a very simple web page at 10 second 
intervals.  If the server responds in less than 1 second it outputs a 
dot; if the page takes more than 1 second to load it outputs a line 
indicating how long it took, which node processed it, and a 
timestamp.  Here is some sample output over the course of about an hour:

................
Real time of 1 for wolf at Thu Jan 22 16:35:32 CST 2009
............................................................
Real time of 6 for cow at Thu Jan 22 16:45:52 CST 2009
.....
Real time of 7 for cow at Thu Jan 22 16:46:59 CST 2009
.................................................................................
Real time of 12 for wolf at Thu Jan 22 17:00:54 CST 2009
...................................
Real time of 3 for wolf at Thu Jan 22 17:06:58 CST 2009
....................
Real time of 13 for cow at Thu Jan 22 17:10:42 CST 2009
...
Real time of 11 for cow at Thu Jan 22 17:11:33 CST 2009
.......................
Real time of 9 for cow at Thu Jan 22 17:15:43 CST 2009
.......................................
Real time of 2 for wolf at Thu Jan 22 17:22:27 CST 2009
..
Real time of 8 for wolf at Thu Jan 22 17:23:05 CST 2009
.............................
Real time of 10 for wolf at Thu Jan 22 17:28:17 CST 2009
....................................
Real time of 14 for wolf at Thu Jan 22 17:34:42 CST 2009

There are no real patterns here, but the output is interesting 
nonetheless -- it does show that it is intermittent, but quite frequent.

I tried using the "ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN" 
command to see what processes were hanging on each system during the 
time of the lags.  Like Stephan's output, I see a few processes in 
the D (uninterruptable sleep) state, which makes sense if they are 
performing I/O.  And also like Stephan, one node shows the "-" 
output, and the other shows "ocfs2_wait_for_status_completion" as the function.

node cow:
   PID STAT COMMAND         WIDE-WCHAN-COLUMN
  8511 D    cronolog        ocfs2_wait_for_status_completion
  8510 D    cronolog        ocfs2_wait_for_status_completion

node wolf:
   PID STAT COMMAND         WIDE-WCHAN-COLUMN
  9439 D    cronolog        -

In my case, however, it was not the httpd process that was listed, 
but rather cronolog (which is what I use with apache for automatic 
daily log rotations).  I assume that in his case it the http was 
directly writing to log files at the time of the hangs.  Also, mine 
typically was only a single process on each system in the D state, 
only sometimes more.

Another thing I tried was to shut down httpd on one of the nodes to 
see if the lags went away -- the result was unchanged, still had 
similar lags at similar intervals.  This time, the "ps" output on the 
non-participating node was clean, but the other had multiple 
processes, and not all web server related:

node wolf:
   PID STAT COMMAND         WIDE-WCHAN-COLUMN
  9436 D    cronolog        -
  9438 D    cronolog        -
  9439 D    cronolog        -
  9448 D    cronolog        -
  9457 D    cronolog        -
31201 D    httpd           -
31202 D    httpd           -
31710 D    vsftpd          sync_buffer

So apparently this problem can happen between processes of the same 
node, not just between nodes.

Watching a "top" command on each box during the lags only shows CPU 
usage on iowait, no significant user/system CPU.