[Ocfs2-users] ocfs2 hangs during webserver usage
David Johle
djohle at industrialinfo.com
Mon Jan 26 16:07:16 PST 2009
Logfiles are rotated daily, and are pretty consistent in size (less
on weekends of course).
The 200GB volume is only 42% used, so definitely no free space
issues. I have all the logs rotating & aging, so they are compressed
and eventually removed over time. This makes the worst case
directory having only 361 files, so not many at all.
Something I don't think I mentioned before is that the amount of IOPs
and bytes/sec do remain fairly low, especially compared to what I can
achieve in benchmarks on these systems (think 100s vs 1000s).
Also of note is the fact that this will still happen during off-peak
times when there is very little traffic on the site, and as a result
not a whole lot of logging going on. I will say that it is
definitely more frequent during times of higher activity, and that is
probably just because there is statistically a greater chance of contention.
I spent some time watching two simultaneous strace -T outputs, one on
each node for the two cronologs that are sharing a single file on the
ocfs2 volume (I chose the busiest of all logs). Almost the entire
time they were having no problems with all kinds of interleaved
writes. Eventually I caught it in the act, one
"ocfs2_wait_for_status_completion" and the other "-" as per the WCHAN
output. They both delayed on a write() at the same time, and strace
appeared similar to this:
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 258 <0.031799>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 258) = 258 <0.005505>
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 262 <0.017983>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 262) = 262 <0.000012>
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 263 <0.054034>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 263) = 263 <0.005155>
read(0, "66.166.185.247 - - [26/Jan/2009:"..., 65536) = 335 <0.003183>
write(3, "66.166.185.247 - - [26/Jan/2009:"..., 335) = 335
<16.875438> <------- >16s!!!
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 13770 <0.000023>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 13770) = 13770 <0.015753>
read(0, "216.255.104.61 - - [26/Jan/2009:"..., 65536) = 303 <0.000006>
write(3, "216.255.104.61 - - [26/Jan/2009:"..., 303) = 303 <0.013217>
read(0, "65.215.68.222 - - [26/Jan/2009:1"..., 65536) = 162 <0.000008>
write(3, "65.215.68.222 - - [26/Jan/2009:1"..., 162) = 162 <0.011767>
Here is corresponding iostat output for this (cleaned for
readability), 1s sample interval:
r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
16.00 40.00 61.50 156.50 7.79 0.03 0.57 0.39 2.20
20.00 17.00 80.00 68.00 8.00 0.21 5.81 0.65 2.40
9.90 0.99 37.13 0.50 6.91 0.24 21.73 8.64 9.41
0.00 7.00 0.00 28.00 8.00 0.03 4.00 0.57 0.40
1.00 1.00 1.50 0.50 2.00 0.00 0.50 0.50 0.10
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 14.00 1.50 52.50 7.20 0.01 0.80 0.13 0.20
1.00 0.00 4.00 0.00 8.00 0.01 11.00 11.00 1.10
1.00 52.00 1.50 204.50 7.77 0.10 1.87 0.26 1.40
3.00 14.00 12.00 56.00 8.00 0.01 0.41 0.24 0.40
4.00 19.00 13.50 72.50 7.48 0.02 0.70 0.35 0.80
43.56 0.99 174.26 3.96 8.00 0.56 12.62 12.62
56.24 <-- Lag begins
91.00 1.00 361.50 0.50 7.87 0.99 10.62 10.72 98.60
89.00 0.00 356.00 0.00 8.00 0.98 11.11 11.02 98.10
90.00 1.00 357.50 0.50 7.87 0.98 10.74 10.75 97.80
90.00 1.00 360.00 4.00 8.00 1.00 10.67 10.78 98.10
77.00 1.00 305.50 0.50 7.85 0.99 12.97 12.64 98.60
94.00 0.00 376.00 0.00 8.00 0.99 10.61 10.51 98.80
88.00 16.00 349.50 60.50 7.88 1.02 9.78 9.39 97.70
94.00 0.00 376.00 0.00 8.00 1.08 11.52 10.40 97.80
92.00 1.00 365.50 0.50 7.87 0.97 10.43 10.47 97.40
62.38 0.00 249.50 0.00 8.00 0.98 15.67 15.68 97.82
52.00 1.00 205.50 0.50 7.77 0.99 18.77 18.70 99.10
88.00 10.00 352.00 40.00 8.00 1.02 10.37 10.09 98.90
91.00 1.00 361.50 0.50 7.87 0.98 10.74 10.66 98.10
87.00 0.00 348.00 0.00 8.00 0.98 11.28 11.32 98.50
93.00 1.00 369.50 0.50 7.87 0.98 10.49 10.48 98.50
94.00 0.00 376.00 0.00 8.00 0.98 10.40 10.40 97.80
160.00 523.00 637.50 2088.50 7.98 0.96 1.40 0.84
57.10 <-- Lag ends
242.00 205.00 968.00 820.00 8.00 4.35 9.72 0.36 15.90
22.00 95.00 85.50 376.50 7.90 0.11 0.92 0.53 6.20
5.94 18.81 23.76 75.25 8.00 0.02 0.72 0.28 0.69
1.00 1.00 1.50 0.50 2.00 0.00 0.50 0.50 0.10
9.00 39.00 36.00 156.00 8.00 0.05 1.12 0.35 1.70
33.00 107.00 129.50 424.50 7.91 0.11 0.76 0.42 5.90
36.00 111.00 144.00 444.00 8.00 0.08 0.56 0.41 6.00
For comparison (not necessarily a good one, but it gets the point
across) of the I/O it can achieve:
$ time ( /bin/dd if=/san/ocfs2vol/junk of=/dev/null bs=4k count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 11.4136 seconds, 108 MB/s
real 0m11.519s
user 0m0.037s
sys 0m1.890s
And the iostat during the dd run:
r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
16218.00 16.00 64872.00 64.00 8.00 1491.02 78.76 0.04 63.70
28676.00 1.00 114701.50 0.50 8.00 2670.45 96.35 0.03 99.10
26627.00 19.00 106508.00 76.00 8.00 2609.60 101.79 0.04 99.10
29208.00 24.00 116829.50 92.50 8.00 2709.55 90.97 0.03 99.30
24583.00 1.00 98332.00 4.00 8.00 2594.72 101.48 0.04 99.60
26625.00 18.00 106497.50 68.50 8.00 2577.44 100.00 0.04 99.00
28672.00 0.00 114688.00 0.00 8.00 2704.34 87.83 0.03 97.50
28389.11 0.99 113553.96 0.50 8.00 2376.68 91.66 0.03 97.52
21944.00 11.00 87776.00 44.00 8.00 3109.33 126.39 0.05 98.90
28842.00 48.00 115365.50 188.50 8.00 2475.97 94.92 0.03 100.10
22804.00 63.00 91216.00 252.00 8.00 2572.85 116.42 0.04 100.10
20941.00 21.00 83761.50 80.50 8.00 1940.30 93.02 0.04 87.60
Writing to a file on the ocfs2 disk is not quite as fast of course:
$ time ( /bin/dd if=/dev/zero of=/san/ocfs2vol/junk bs=4k
count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 46.7036 seconds, 26.3 MB/s
real 0m49.566s
user 0m0.053s
sys 0m42.226s
The iostat for writing is harder to show because some of it happens
during the dd and some when sync is called, it's not as
contiguous. Here's a sample from part of the sync phase though:
r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
4.00 6225.00 13.50 24896.50 8.00 292.49 55.00 0.04 27.30
9.00 8325.00 36.00 33300.00 8.00 365.88 36.88 0.03 27.60
18.00 6206.00 69.50 24820.50 8.00 336.86 63.52 0.05 30.00
3.96 7192.08 15.84 28768.32 8.00 245.81 33.17 0.03 24.06
7.00 7297.00 25.50 29184.50 8.00 390.72 54.48 0.04 31.40
I also have an ext3 filesystem on one of the servers which is also on
the same iSCSI SAN, so I have included some similar sort of time
trials to contrast the ext3 with ocfs2 on the same system:
Reading:
$ time ( /bin/dd if=/san/ext3vol/junk of=/dev/null bs=4k
count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 9.44083 seconds, 130 MB/s
real 0m9.452s
user 0m0.037s
sys 0m1.800s
Writing:
$ time ( /bin/dd if=/dev/zero of=/san/ext3vol/junk bs=4k count=300000 ; sync )
300000+0 records in
300000+0 records out
1228800000 bytes (1.2 GB) copied, 4.32389 seconds, 284 MB/s
real 0m9.766s
user 0m0.045s
sys 0m3.335s
So that's about 125MB/s average write factoring in the sync time.
At 04:17 PM 1/26/2009, Sunil Mushran wrote:
>Well, something must have changed. Is the logfile growing in size?
>Is the volume close to full? Are there a lot of files in a directory?
>
>If the ioutil is high, the io subsystem is being saturated. That's a
>given. Question is, who is the culprit. If the webserver load has not
>changed, then the fs itself must be contributing to the load.
>
>strace -T one of the processes. That should narrow down the scope
>of the problem.
>
>David Johle wrote:
>>Well it does pretty much make the system (or at least anything
>>doing I/O to the volume) unresponsive, but it does recover after
>>10-15 seconds typically. I guess that is considered a "temporary
>>slowdown" rather than a hang?
>>
>>Yes, the log files are being written to the OCFS2 volume, and are
>>actually being written to by both nodes in parallel. I did much
>>testing on this before going into production and never saw any
>>problems or slowdowns, even on much less powerful systems. And, as
>>I mentioned, there were no problems on these systems for over a
>>month in production (same load all along).
>>
>>I do wonder if the 1.4 release would be any better for my
>>situation, and would like to put it on my test environment first of course.
>>However, I do have an issue in that I am making use of the CDSL
>>feature that was removed after 1.2.x, and thus I will have to
>>figure some way to accomplish the desired configuration w/o them
>>before I can upgrade.
>>
>>The problem is continuing, and getting really annoying as it's
>>tripping up our monitoring system like crazy. Is there anything
>>else I try doing to get more details about what is going on to help
>>find a solution? Any parameters that could be tweaked to account
>>for the fact that there is a steady stream of small writes from all
>>nodes to this volume?
>>[snip]
More information about the Ocfs2-users
mailing list