[Ocfs2-users] ocfs2 hangs during webserver usage

Mon Jan 26 17:04:22 PST 2009

The high util is coinciding with higher read activity. See rKb/s. While the
numbers are small, it could be resulting in higher util because the ios are
scattered all over the disk. If that is the case, we need to identify the
process.

Do you have blktrace installed? If not, install it.

Then run,
$ btrace -a read /dev/sdX1

This will give you the view from the block layer. Now when you hit that
slowdown window, the blktrace output will show you the process doing the
reads. The 5th column is the pid. See man pages for blktrace, btrace and
blkparse for more.

BTW, that one node shows the writing process waiting at
"ocfs2_wait_for_status_completion" is to be expected. It is waiting for
the other node to release the lock. And that node is waiting on the io
to complete before it can release the lock.

David Johle wrote:
> Logfiles are rotated daily, and are pretty consistent in size (less on 
> weekends of course).
>
> The 200GB volume is only 42% used, so definitely no free space 
> issues.  I have all the logs rotating & aging, so they are compressed 
> and eventually removed over time.  This makes the worst case directory 
> having only 361 files, so not many at all.
>
> Something I don't think I mentioned before is that the amount of IOPs 
> and bytes/sec do remain fairly low, especially compared to what I can 
> achieve in benchmarks on these systems (think 100s vs 1000s).
>
> Also of note is the fact that this will still happen during off-peak 
> times when there is very little traffic on the site, and as a result 
> not a whole lot of logging going on.  I will say that it is definitely 
> more frequent during times of higher activity, and that is probably 
> just because there is statistically a greater chance of contention.
>
> I spent some time watching two simultaneous strace -T outputs, one on 
> each node for the two cronologs that are sharing a single file on the 
> ocfs2 volume (I chose the busiest of all logs).  Almost the entire 
> time they were having no problems with all kinds of interleaved 
> writes.  Eventually I caught it in the act, one 
> "ocfs2_wait_for_status_completion" and the other "-" as per the WCHAN 
> output.  They both delayed on a write() at the same time, and strace 
> appeared similar to this:
>
> read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 258 <0.031799>
> write(3, "216.255.104.61 - - [26/Jan/2009:"..., 258) = 258 <0.005505>
> read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 262 <0.017983>
> write(3, "216.255.104.61 - - [26/Jan/2009:"..., 262) = 262 <0.000012>
> read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 263 <0.054034>
> write(3, "216.255.104.61 - - [26/Jan/2009:"..., 263) = 263 <0.005155>
> read(0, "66.166.185.247 - - [26/Jan/2009:"..., 65536) = 335 <0.003183>
> write(3, "66.166.185.247 - - [26/Jan/2009:"..., 335) = 335 
> <16.875438>         <------- >16s!!!
> read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 13770 <0.000023>
> write(3, "216.255.104.61 - - [26/Jan/2009:"..., 13770) = 13770 <0.015753>
> read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 303 <0.000006>
> write(3, "216.255.104.61 - - [26/Jan/2009:"..., 303) = 303 <0.013217>
> read(0, "65.215.68.222 - - [26/Jan/2009:1"..., 65536) = 162 <0.000008>
> write(3, "65.215.68.222 - - [26/Jan/2009:1"..., 162) = 162 <0.011767>
>
>
> Here is corresponding iostat output for this (cleaned for 
> readability), 1s sample interval:
>
>    r/s      w/s   rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>  16.00    40.00   61.50   156.50     7.79     0.03    0.57   0.39   2.20
>  20.00    17.00   80.00    68.00     8.00     0.21    5.81   0.65   2.40
>   9.90     0.99   37.13     0.50     6.91     0.24   21.73   8.64   9.41
>   0.00     7.00    0.00    28.00     8.00     0.03    4.00   0.57   0.40
>   1.00     1.00    1.50     0.50     2.00     0.00    0.50   0.50   0.10
>   0.00     0.00    0.00     0.00     0.00     0.00    0.00   0.00   0.00
>   1.00    14.00    1.50    52.50     7.20     0.01    0.80   0.13   0.20
>   1.00     0.00    4.00     0.00     8.00     0.01   11.00  11.00   1.10
>   1.00    52.00    1.50   204.50     7.77     0.10    1.87   0.26   1.40
>   3.00    14.00   12.00    56.00     8.00     0.01    0.41   0.24   0.40
>   4.00    19.00   13.50    72.50     7.48     0.02    0.70   0.35   0.80
>  43.56     0.99  174.26     3.96     8.00     0.56   12.62  12.62 
> 56.24      <-- Lag begins
>  91.00     1.00  361.50     0.50     7.87     0.99   10.62  10.72  98.60
>  89.00     0.00  356.00     0.00     8.00     0.98   11.11  11.02  98.10
>  90.00     1.00  357.50     0.50     7.87     0.98   10.74  10.75  97.80
>  90.00     1.00  360.00     4.00     8.00     1.00   10.67  10.78  98.10
>  77.00     1.00  305.50     0.50     7.85     0.99   12.97  12.64  98.60
>  94.00     0.00  376.00     0.00     8.00     0.99   10.61  10.51  98.80
>  88.00    16.00  349.50    60.50     7.88     1.02    9.78   9.39  97.70
>  94.00     0.00  376.00     0.00     8.00     1.08   11.52  10.40  97.80
>  92.00     1.00  365.50     0.50     7.87     0.97   10.43  10.47  97.40
>  62.38     0.00  249.50     0.00     8.00     0.98   15.67  15.68  97.82
>  52.00     1.00  205.50     0.50     7.77     0.99   18.77  18.70  99.10
>  88.00    10.00  352.00    40.00     8.00     1.02   10.37  10.09  98.90
>  91.00     1.00  361.50     0.50     7.87     0.98   10.74  10.66  98.10
>  87.00     0.00  348.00     0.00     8.00     0.98   11.28  11.32  98.50
>  93.00     1.00  369.50     0.50     7.87     0.98   10.49  10.48  98.50
>  94.00     0.00  376.00     0.00     8.00     0.98   10.40  10.40  97.80
> 160.00   523.00  637.50  2088.50     7.98     0.96    1.40   0.84 
> 57.10      <-- Lag ends
> 242.00   205.00  968.00   820.00     8.00     4.35    9.72   0.36  15.90
>  22.00    95.00   85.50   376.50     7.90     0.11    0.92   0.53   6.20
>   5.94    18.81   23.76    75.25     8.00     0.02    0.72   0.28   0.69
>   1.00     1.00    1.50     0.50     2.00     0.00    0.50   0.50   0.10
>   9.00    39.00   36.00   156.00     8.00     0.05    1.12   0.35   1.70
>  33.00   107.00  129.50   424.50     7.91     0.11    0.76   0.42   5.90
>  36.00   111.00  144.00   444.00     8.00     0.08    0.56   0.41   6.00
>
>
> For comparison (not necessarily a good one, but it gets the point 
> across) of the I/O it can achieve:
>
> $ time ( /bin/dd if=/san/ocfs2vol/junk of=/dev/null bs=4k count=300000 
> ; sync )
> 300000+0 records in
> 300000+0 records out
> 1228800000 bytes (1.2 GB) copied, 11.4136 seconds, 108 MB/s
>
> real    0m11.519s
> user    0m0.037s
> sys     0m1.890s
>
>
> And the iostat during the dd run:
>
>      r/s   w/s     rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> 16218.00 16.00  64872.00    64.00     8.00  1491.02   78.76   0.04  63.70
> 28676.00  1.00 114701.50     0.50     8.00  2670.45   96.35   0.03  99.10
> 26627.00 19.00 106508.00    76.00     8.00  2609.60  101.79   0.04  99.10
> 29208.00 24.00 116829.50    92.50     8.00  2709.55   90.97   0.03  99.30
> 24583.00  1.00  98332.00     4.00     8.00  2594.72  101.48   0.04  99.60
> 26625.00 18.00 106497.50    68.50     8.00  2577.44  100.00   0.04  99.00
> 28672.00  0.00 114688.00     0.00     8.00  2704.34   87.83   0.03  97.50
> 28389.11  0.99 113553.96     0.50     8.00  2376.68   91.66   0.03  97.52
> 21944.00 11.00  87776.00    44.00     8.00  3109.33  126.39   0.05  98.90
> 28842.00 48.00 115365.50   188.50     8.00  2475.97   94.92   0.03 100.10
> 22804.00 63.00  91216.00   252.00     8.00  2572.85  116.42   0.04 100.10
> 20941.00 21.00  83761.50    80.50     8.00  1940.30   93.02   0.04  87.60
>
>
> Writing to a file on the ocfs2 disk is not quite as fast of course:
>
> $ time ( /bin/dd if=/dev/zero of=/san/ocfs2vol/junk bs=4k count=300000 
> ; sync )
> 300000+0 records in
> 300000+0 records out
> 1228800000 bytes (1.2 GB) copied, 46.7036 seconds, 26.3 MB/s
>
> real    0m49.566s
> user    0m0.053s
> sys     0m42.226s
>
>
> The iostat for writing is harder to show because some of it happens 
> during the dd and some when sync is called, it's not as contiguous.  
> Here's a sample from part of the sync phase though:
>
>   r/s      w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>  4.00  6225.00    13.50 24896.50     8.00   292.49   55.00   0.04  27.30
>  9.00  8325.00    36.00 33300.00     8.00   365.88   36.88   0.03  27.60
> 18.00  6206.00    69.50 24820.50     8.00   336.86   63.52   0.05  30.00
>  3.96  7192.08    15.84 28768.32     8.00   245.81   33.17   0.03  24.06
>  7.00  7297.00    25.50 29184.50     8.00   390.72   54.48   0.04  31.40
>
>
>
> I also have an ext3 filesystem on one of the servers which is also on 
> the same iSCSI SAN, so I have included some similar sort of time 
> trials to contrast the ext3 with ocfs2 on the same system:
>
> Reading:
>
> $ time ( /bin/dd  if=/san/ext3vol/junk of=/dev/null bs=4k count=300000 
> ; sync )
> 300000+0 records in
> 300000+0 records out
> 1228800000 bytes (1.2 GB) copied, 9.44083 seconds, 130 MB/s
>
> real    0m9.452s
> user    0m0.037s
> sys     0m1.800s
>
>
> Writing:
>
> $ time ( /bin/dd if=/dev/zero of=/san/ext3vol/junk bs=4k count=300000 
> ; sync )
> 300000+0 records in
> 300000+0 records out
> 1228800000 bytes (1.2 GB) copied, 4.32389 seconds, 284 MB/s
>
> real    0m9.766s
> user    0m0.045s
> sys     0m3.335s
>
> So that's about 125MB/s average write factoring in the sync time.
>
>
>
>
>
> At 04:17 PM 1/26/2009, Sunil Mushran wrote:
>> Well, something must have changed. Is the logfile growing in size?
>> Is the volume close to full? Are there a lot of files in a directory?
>>
>> If the ioutil is high, the io subsystem is being saturated. That's a
>> given. Question is, who is the culprit. If the webserver load has not
>> changed, then the fs itself must be contributing to the load.
>>
>> strace -T one of the processes. That should narrow down the scope
>> of the problem.
>>
>> David Johle wrote:
>>> Well it does pretty much make the system (or at least anything doing 
>>> I/O to the volume) unresponsive, but it does recover after 10-15 
>>> seconds typically.  I guess that is considered a "temporary 
>>> slowdown" rather than a hang?
>>>
>>> Yes, the log files are being written to the OCFS2 volume, and are 
>>> actually being written to by both nodes in parallel.  I did much 
>>> testing on this before going into production and never saw any 
>>> problems or slowdowns, even on much less powerful systems.  And, as 
>>> I mentioned, there were no problems on these systems for over a 
>>> month in production (same load all along).
>>>
>>> I do wonder if the 1.4 release would be any better for my situation, 
>>> and would like to put it on my test environment first of course.
>>> However, I do have an issue in that I am making use of the CDSL 
>>> feature that was removed after 1.2.x, and thus I will have to figure 
>>> some way to accomplish the desired configuration w/o them before I 
>>> can upgrade.
>>>
>>> The problem is continuing, and getting really annoying as it's 
>>> tripping up our monitoring system like crazy.  Is there anything 
>>> else I try doing to get more details about what is going on to help 
>>> find a solution?  Any parameters that could be tweaked to account 
>>> for the fact that there is a steady stream of small writes from all 
>>> nodes to this volume?
>>> [snip]
>