[Ocfs2-users] Extremely poor write performance, but read appears to be okay

Wed Dec 8 17:07:19 PST 2010

Hello,

I'm writing from the otherside of the world from where my systems are,
so details are coming in slow. We have a 6TB OCFS2 volume across 20 or
so nodes all running OEL5.4 running ocfs2-1.4.4. The system has worked
fairly well for the last 6-8 months. Something has happened over the
last few weeks which has driven write performance nearly to a halt.
I'm not sure how to proceed, and very poor internet is hindering my
abilities further. I've verified that the disk array is in good
health. I'm seeing a few awkward kernel log messages, an example of
one follows. I have not been able to verify all nodes due to limited
time and slow internet in my present location. Any assistance would be
greatly appreciated. I should be able to provide log files in about 12
hours. At this moment, loadavgs on each node are 0.00 to 0.09.

Here is a test write and associated iostat -xm 5 output. Previously I
was obtaining > 90MB/s:

$ dd if=/dev/zero of=/home/testdump count=1000 bs=1024k

...and associated iostat output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.10    0.00    0.43   12.25    0.00   87.22

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     1.80  0.00  8.40     0.00     0.04     9.71
    0.01    0.64   0.05   0.04
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda3              0.00     1.80  0.00  8.40     0.00     0.04     9.71
    0.01    0.64   0.05   0.04
sdc               0.00     0.00 115.80  0.60     0.46     0.00
8.04     0.99    8.48   8.47  98.54
sdc1              0.00     0.00 115.80  0.60     0.46     0.00
8.04     0.99    8.48   8.47  98.54

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.07    0.00    0.55   12.25    0.00   87.13

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.40  0.00  0.80     0.00     0.00    12.00
    0.00    2.00   1.25   0.10
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda3              0.00     0.40  0.00  0.80     0.00     0.00    12.00
    0.00    2.00   1.25   0.10
sdc               0.00     0.00 112.80  0.40     0.44     0.00
8.03     0.98    8.68   8.69  98.38
sdc1              0.00     0.00 112.80  0.40     0.44     0.00
8.03     0.98    8.68   8.69  98.38

Here is a test read and associated iostat output. I'm intentionally
reading from a different test file as to avoid caching effects:

$ dd if=/home/someothertestdump of=/dev/null bs=1024k

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.10    0.00    3.60   10.85    0.00   85.45

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     3.79  0.00  1.40     0.00     0.02    29.71
    0.00    1.29   0.43   0.06
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda3              0.00     3.79  0.00  1.40     0.00     0.02    29.71
    0.00    1.29   0.43   0.06
sdc               7.98     0.20 813.17  1.00   102.50     0.00
257.84     1.92    2.34   1.19  96.71
sdc1              7.98     0.20 813.17  1.00   102.50     0.00
257.84     1.92    2.34   1.19  96.67

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.07    0.00    3.67   10.22    0.00   86.03

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.20  0.00  0.40     0.00     0.00    12.00
    0.00    0.50   0.50   0.02
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00
    0.00    0.00   0.00   0.00
sda3              0.00     0.20  0.00  0.40     0.00     0.00    12.00
    0.00    0.50   0.50   0.02
sdc               6.60     0.20 829.00  1.00   104.28     0.00
257.32     1.90    2.31   1.17  97.28
sdc1              6.60     0.20 829.00  1.00   104.28     0.00
257.32     1.90    2.31   1.17  97.28

I'm seeing a few weird kernel messages, such as:

Dec  7 14:07:50 growler kernel:
(dlm_wq,4793,4):dlm_deref_lockres_worker:2344 ERROR:
84B7C6421A6C4280AB87F569035C5368:O0000000000000016296ce900000000: node
14 trying to drop ref but it is already dropped!
Dec  7 14:07:50 growler kernel: lockres:
O0000000000000016296ce900000000, owner=0, state=0
Dec  7 14:07:50 growler kernel:   last used: 0, refcnt: 6, on purge list: no
Dec  7 14:07:50 growler kernel:   on dirty list: no, on reco list: no,
migrating pending: no
Dec  7 14:07:50 growler kernel:   inflight locks: 0, asts reserved: 0
Dec  7 14:07:50 growler kernel:   refmap nodes: [ 21 ], inflight=0
Dec  7 14:07:50 growler kernel:   granted queue:
Dec  7 14:07:50 growler kernel:     type=3, conv=-1, node=21,
cookie=21:213370, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n),
pending=(conv=n,lock=n,cancel=n,unlock=n)
Dec  7 14:07:50 growler kernel:   converting queue:
Dec  7 14:07:50 growler kernel:   blocked queue:

Here is df output:

root at growler:~$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda3            245695888  29469416 203544360  13% /
/dev/sda1               101086     15133     80734  16% /boot
tmpfs                 33005580         0  33005580   0% /dev/shm
/dev/sdc1            5857428444 5234400436 623028008  90% /home

 Thanks
-Daniel