[Ocfs2-users] ocfs2 hangs during webserver usage

Mon Jan 26 16:07:16 PST 2009

Logfiles are rotated daily, and are pretty consistent in size (less 
on weekends of course).

The 200GB volume is only 42% used, so definitely no free space 
issues.  I have all the logs rotating & aging, so they are compressed 
and eventually removed over time.  This makes the worst case 
directory having only 361 files, so not many at all.

Something I don't think I mentioned before is that the amount of IOPs 
and bytes/sec do remain fairly low, especially compared to what I can 
achieve in benchmarks on these systems (think 100s vs 1000s).

Also of note is the fact that this will still happen during off-peak 
times when there is very little traffic on the site, and as a result 
not a whole lot of logging going on.  I will say that it is 
definitely more frequent during times of higher activity, and that is 
probably just because there is statistically a greater chance of contention.

I spent some time watching two simultaneous strace -T outputs, one on 
each node for the two cronologs that are sharing a single file on the 
ocfs2 volume (I chose the busiest of all logs).  Almost the entire 
time they were having no problems with all kinds of interleaved 
writes.  Eventually I caught it in the act, one 
"ocfs2_wait_for_status_completion" and the other "-" as per the WCHAN 
output.  They both delayed on a write() at the same time, and strace 
appeared similar to this:

read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 258 <0.031799>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 258) = 258 <0.005505>
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 262 <0.017983>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 262) = 262 <0.000012>
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 263 <0.054034>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 263) = 263 <0.005155>
read(0, "66.166.185.247 - - [26/Jan/2009:"..., 65536) = 335 <0.003183>
write(3, "66.166.185.247 - - [26/Jan/2009:"..., 335) = 335 
<16.875438>         <------- >16s!!!
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 13770 <0.000023>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 13770) = 13770 <0.015753>
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 303 <0.000006>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 303) = 303 <0.013217>
read(0, "65.215.68.222 - - [26/Jan/2009:1"..., 65536) = 162 <0.000008>
write(3, "65.215.68.222 - - [26/Jan/2009:1"..., 162) = 162 <0.011767>

Here is corresponding iostat output for this (cleaned for 
readability), 1s sample interval:

    r/s      w/s   rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  16.00    40.00   61.50   156.50     7.79     0.03    0.57   0.39   2.20
  20.00    17.00   80.00    68.00     8.00     0.21    5.81   0.65   2.40
   9.90     0.99   37.13     0.50     6.91     0.24   21.73   8.64   9.41
   0.00     7.00    0.00    28.00     8.00     0.03    4.00   0.57   0.40
   1.00     1.00    1.50     0.50     2.00     0.00    0.50   0.50   0.10
   0.00     0.00    0.00     0.00     0.00     0.00    0.00   0.00   0.00
   1.00    14.00    1.50    52.50     7.20     0.01    0.80   0.13   0.20
   1.00     0.00    4.00     0.00     8.00     0.01   11.00  11.00   1.10
   1.00    52.00    1.50   204.50     7.77     0.10    1.87   0.26   1.40
   3.00    14.00   12.00    56.00     8.00     0.01    0.41   0.24   0.40
   4.00    19.00   13.50    72.50     7.48     0.02    0.70   0.35   0.80
  43.56     0.99  174.26     3.96     8.00     0.56   12.62  12.62 
56.24      <-- Lag begins
  91.00     1.00  361.50     0.50     7.87     0.99   10.62  10.72  98.60
  89.00     0.00  356.00     0.00     8.00     0.98   11.11  11.02  98.10
  90.00     1.00  357.50     0.50     7.87     0.98   10.74  10.75  97.80
  90.00     1.00  360.00     4.00     8.00     1.00   10.67  10.78  98.10
  77.00     1.00  305.50     0.50     7.85     0.99   12.97  12.64  98.60
  94.00     0.00  376.00     0.00     8.00     0.99   10.61  10.51  98.80
  88.00    16.00  349.50    60.50     7.88     1.02    9.78   9.39  97.70
  94.00     0.00  376.00     0.00     8.00     1.08   11.52  10.40  97.80
  92.00     1.00  365.50     0.50     7.87     0.97   10.43  10.47  97.40
  62.38     0.00  249.50     0.00     8.00     0.98   15.67  15.68  97.82
  52.00     1.00  205.50     0.50     7.77     0.99   18.77  18.70  99.10
  88.00    10.00  352.00    40.00     8.00     1.02   10.37  10.09  98.90
  91.00     1.00  361.50     0.50     7.87     0.98   10.74  10.66  98.10
  87.00     0.00  348.00     0.00     8.00     0.98   11.28  11.32  98.50
  93.00     1.00  369.50     0.50     7.87     0.98   10.49  10.48  98.50
  94.00     0.00  376.00     0.00     8.00     0.98   10.40  10.40  97.80
160.00   523.00  637.50  2088.50     7.98     0.96    1.40   0.84 
57.10      <-- Lag ends
242.00   205.00  968.00   820.00     8.00     4.35    9.72   0.36  15.90
  22.00    95.00   85.50   376.50     7.90     0.11    0.92   0.53   6.20
   5.94    18.81   23.76    75.25     8.00     0.02    0.72   0.28   0.69
   1.00     1.00    1.50     0.50     2.00     0.00    0.50   0.50   0.10
   9.00    39.00   36.00   156.00     8.00     0.05    1.12   0.35   1.70
  33.00   107.00  129.50   424.50     7.91     0.11    0.76   0.42   5.90
  36.00   111.00  144.00   444.00     8.00     0.08    0.56   0.41   6.00

For comparison (not necessarily a good one, but it gets the point 
across) of the I/O it can achieve:

$ time ( /bin/dd if=/san/ocfs2vol/junk of=/dev/null bs=4k count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 11.4136 seconds, 108 MB/s

real    0m11.519s
user    0m0.037s
sys     0m1.890s

And the iostat during the dd run:

      r/s   w/s     rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
16218.00 16.00  64872.00    64.00     8.00  1491.02   78.76   0.04  63.70
28676.00  1.00 114701.50     0.50     8.00  2670.45   96.35   0.03  99.10
26627.00 19.00 106508.00    76.00     8.00  2609.60  101.79   0.04  99.10
29208.00 24.00 116829.50    92.50     8.00  2709.55   90.97   0.03  99.30
24583.00  1.00  98332.00     4.00     8.00  2594.72  101.48   0.04  99.60
26625.00 18.00 106497.50    68.50     8.00  2577.44  100.00   0.04  99.00
28672.00  0.00 114688.00     0.00     8.00  2704.34   87.83   0.03  97.50
28389.11  0.99 113553.96     0.50     8.00  2376.68   91.66   0.03  97.52
21944.00 11.00  87776.00    44.00     8.00  3109.33  126.39   0.05  98.90
28842.00 48.00 115365.50   188.50     8.00  2475.97   94.92   0.03 100.10
22804.00 63.00  91216.00   252.00     8.00  2572.85  116.42   0.04 100.10
20941.00 21.00  83761.50    80.50     8.00  1940.30   93.02   0.04  87.60

Writing to a file on the ocfs2 disk is not quite as fast of course:

$ time ( /bin/dd if=/dev/zero of=/san/ocfs2vol/junk bs=4k 
count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 46.7036 seconds, 26.3 MB/s

real    0m49.566s
user    0m0.053s
sys     0m42.226s

The iostat for writing is harder to show because some of it happens 
during the dd and some when sync is called, it's not as 
contiguous.  Here's a sample from part of the sync phase though:

   r/s      w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  4.00  6225.00    13.50 24896.50     8.00   292.49   55.00   0.04  27.30
  9.00  8325.00    36.00 33300.00     8.00   365.88   36.88   0.03  27.60
18.00  6206.00    69.50 24820.50     8.00   336.86   63.52   0.05  30.00
  3.96  7192.08    15.84 28768.32     8.00   245.81   33.17   0.03  24.06
  7.00  7297.00    25.50 29184.50     8.00   390.72   54.48   0.04  31.40

I also have an ext3 filesystem on one of the servers which is also on 
the same iSCSI SAN, so I have included some similar sort of time 
trials to contrast the ext3 with ocfs2 on the same system:

Reading:

$ time ( /bin/dd  if=/san/ext3vol/junk of=/dev/null bs=4k 
count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 9.44083 seconds, 130 MB/s

real    0m9.452s
user    0m0.037s
sys     0m1.800s

Writing:

$ time ( /bin/dd if=/dev/zero of=/san/ext3vol/junk bs=4k count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 4.32389 seconds, 284 MB/s

real    0m9.766s
user    0m0.045s
sys     0m3.335s

So that's about 125MB/s average write factoring in the sync time.

At 04:17 PM 1/26/2009, Sunil Mushran wrote:
>Well, something must have changed. Is the logfile growing in size?
>Is the volume close to full? Are there a lot of files in a directory?
>
>If the ioutil is high, the io subsystem is being saturated. That's a
>given. Question is, who is the culprit. If the webserver load has not
>changed, then the fs itself must be contributing to the load.
>
>strace -T one of the processes. That should narrow down the scope
>of the problem.
>
>David Johle wrote:
>>Well it does pretty much make the system (or at least anything 
>>doing I/O to the volume) unresponsive, but it does recover after 
>>10-15 seconds typically.  I guess that is considered a "temporary 
>>slowdown" rather than a hang?
>>
>>Yes, the log files are being written to the OCFS2 volume, and are 
>>actually being written to by both nodes in parallel.  I did much 
>>testing on this before going into production and never saw any 
>>problems or slowdowns, even on much less powerful systems.  And, as 
>>I mentioned, there were no problems on these systems for over a 
>>month in production (same load all along).
>>
>>I do wonder if the 1.4 release would be any better for my 
>>situation, and would like to put it on my test environment first of course.
>>However, I do have an issue in that I am making use of the CDSL 
>>feature that was removed after 1.2.x, and thus I will have to 
>>figure some way to accomplish the desired configuration w/o them 
>>before I can upgrade.
>>
>>The problem is continuing, and getting really annoying as it's 
>>tripping up our monitoring system like crazy.  Is there anything 
>>else I try doing to get more details about what is going on to help 
>>find a solution?  Any parameters that could be tweaked to account 
>>for the fact that there is a steady stream of small writes from all 
>>nodes to this volume?
>>[snip]