[Ocfs2-users] ocfs2 hangs during webserver usage

Fri Jan 23 14:28:04 PST 2009

The two issues are different. For starters, the issue in bugzilla#882
was a hang. Not a temporary slowdown. And the stack showed it was related
to flock() that ocfs2 1.2 did not support. Well, not cluster-aware flock().
Support for clustered flock() was added in Kernel 2.6.25-ish / ocfs2 1.4.

Are the apache logs also hitting the iscsi storage? If so, one explanation
could be that the log flush is saturating the network. That would cause
the ioutil to jump higher affecting the httpd ios.

Sunil

David Johle wrote:
> I have been running a 2 node cluster of web servers handling around 
> 800,000 - 1,200,000 requests on a typical weekday.  The systems are 
> each dual quad core xeon w/4GB ram, and generally maintain a fairly 
> low load average.  The servers are sharing a device via iSCSI SAN for 
> the OCFS2 volume.  They are running RHEL5 U2 (2.6.18-92.1.10.el5) and 
> OCFS2 1.2.9-1 binaries of the matching kernel version.
>
> The cluster has been in a production environment for about 1.5 months 
> now, and just in the past last week it has started to have 
> problems.  The user experience is an occasional lag of 5 to 15 
> seconds, after which everything appears normal.  Digging deeper into 
> the problem I had narrowed it to an I/O issue, and iostat shows near 
> 100% utilization on said device during the lag.  Once it clears the 
> utilization is back down to a consistent 0-5% average.  Also, when 
> the lag is happening, a process listing shows the affected processes 
> in the D state.
>
> After some reading of the mailing list archives and much google 
> searching, I found this old thread which seems quite similar:
> http://oss.oracle.com/pipermail/ocfs2-users/2007-May/001643.html
>
> And the corresponding bug from this same author:
> http://oss.oracle.com/bugzilla/show_bug.cgi?id=882
>
> I see it has been a long time since this bug was last updated, but 
> there seems to have been no [documented] resolution.
>
> I wrote a small script which hits a very simple web page at 10 second 
> intervals.  If the server responds in less than 1 second it outputs a 
> dot; if the page takes more than 1 second to load it outputs a line 
> indicating how long it took, which node processed it, and a 
> timestamp.  Here is some sample output over the course of about an hour:
>
> ................
> Real time of 1 for wolf at Thu Jan 22 16:35:32 CST 2009
> ............................................................
> Real time of 6 for cow at Thu Jan 22 16:45:52 CST 2009
> .....
> Real time of 7 for cow at Thu Jan 22 16:46:59 CST 2009
> .................................................................................
> Real time of 12 for wolf at Thu Jan 22 17:00:54 CST 2009
> ...................................
> Real time of 3 for wolf at Thu Jan 22 17:06:58 CST 2009
> ....................
> Real time of 13 for cow at Thu Jan 22 17:10:42 CST 2009
> ...
> Real time of 11 for cow at Thu Jan 22 17:11:33 CST 2009
> .......................
> Real time of 9 for cow at Thu Jan 22 17:15:43 CST 2009
> .......................................
> Real time of 2 for wolf at Thu Jan 22 17:22:27 CST 2009
> ..
> Real time of 8 for wolf at Thu Jan 22 17:23:05 CST 2009
> .............................
> Real time of 10 for wolf at Thu Jan 22 17:28:17 CST 2009
> ....................................
> Real time of 14 for wolf at Thu Jan 22 17:34:42 CST 2009
>
>
> There are no real patterns here, but the output is interesting 
> nonetheless -- it does show that it is intermittent, but quite frequent.
>
> I tried using the "ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN" 
> command to see what processes were hanging on each system during the 
> time of the lags.  Like Stephan's output, I see a few processes in 
> the D (uninterruptable sleep) state, which makes sense if they are 
> performing I/O.  And also like Stephan, one node shows the "-" 
> output, and the other shows "ocfs2_wait_for_status_completion" as the function.
>
> node cow:
>    PID STAT COMMAND         WIDE-WCHAN-COLUMN
>   8511 D    cronolog        ocfs2_wait_for_status_completion
>   8510 D    cronolog        ocfs2_wait_for_status_completion
>
> node wolf:
>    PID STAT COMMAND         WIDE-WCHAN-COLUMN
>   9439 D    cronolog        -
>
> In my case, however, it was not the httpd process that was listed, 
> but rather cronolog (which is what I use with apache for automatic 
> daily log rotations).  I assume that in his case it the http was 
> directly writing to log files at the time of the hangs.  Also, mine 
> typically was only a single process on each system in the D state, 
> only sometimes more.
>
> Another thing I tried was to shut down httpd on one of the nodes to 
> see if the lags went away -- the result was unchanged, still had 
> similar lags at similar intervals.  This time, the "ps" output on the 
> non-participating node was clean, but the other had multiple 
> processes, and not all web server related:
>
> node wolf:
>    PID STAT COMMAND         WIDE-WCHAN-COLUMN
>   9436 D    cronolog        -
>   9438 D    cronolog        -
>   9439 D    cronolog        -
>   9448 D    cronolog        -
>   9457 D    cronolog        -
> 31201 D    httpd           -
> 31202 D    httpd           -
> 31710 D    vsftpd          sync_buffer
>
> So apparently this problem can happen between processes of the same 
> node, not just between nodes.
>
> Watching a "top" command on each box during the lags only shows CPU 
> usage on iowait, no significant user/system CPU.
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>