[Ocfs-users] poor write performance or locking issues with ocfs2

Tue May 6 09:02:53 PDT 2014

Hello all,

I've got heavy troubles with my ocfs2 environment. Cluster filesystem worked fine for about 3-6 weeks after initial setup, but since 1 week performance issues occurs. I've already searched long time in google and on this mailing list but I wasn't able to found any solution. I've found a lot of posts with "same" problems but without the magic answer  :-)

First, the environment:
- HP 3par SAN, 2 TB LUN (no SAN storage related performance problems - already checked)
- qlogic HBA (4 path), round robin with multipath
- kernel 3.2.0-4-amd64
- 5 cluster nodes
- ocfs2 version 1.6.4
- 480 million inodes in use, iUse% = 92
- OCFS was made with: "mkfs.ocfs2 -b 4k -C 4k -N 8 -L myocfs -T mail --fs-feature-level=max-features --fs-features=indexed-dirs /dev/mapper/myocfs"
- OCFS is mounted with: "_netdev,noatime,data=writeback,nouser_xattr"; I also tried "_netdev,noatime,data=writeback,nouser_xattr,commit=60,localalloc=16" which I've found on this great list, but this haven't solved the issues. And also a try without data= und commit=...
- Apache 2 Webserver with PHP on 2 nodes, NGINX and FTP on the other nodes (nginx will only read data, FTP and PHP will write also). I guess the read-rate is about 80%.
- The filesystem was online extended 2 times after initial setup.
- sysctl.conf parameters are set (for the webserver):
--
net.ipv4.ip_nonlocal_bind=1
net.ipv4.tcp_fin_timeout=10
net.ipv4.ip_local_port_range=1024 65535
vm.swappiness=10
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
--

Now, the problem:
The cluster runs well, but some times a day the systemload grows up from ~0-1 to 40, 500, 2000! CPU is fine, no problems. RAM is free, no problems.
"ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN | grep D" shows me some apache processes with a state of "D", but with no "WIDE-WCHAN-COLUMN" filled. Here's an example output:
--
3176 D<   o2hb-6F81EC9057 -
 3392 D    jbd2/dm-1-41    -
 3393 D    ocfs2cmt        -
17221 D    apache2         -
18424 D    kworker/8:3     -
18453 D    apache2         -
...
---

Some output of /proc/pid/stack:
--
[<ffffffff81051d5f>] process_timeout+0x0/0x5
[<ffffffff810528be>] msleep_interruptible+0x1a/0x37
[<ffffffffa0311903>] o2hb_thread+0x17f/0x2df [ocfs2_nodemanager]
[<ffffffffa0311784>] o2hb_thread+0x0/0x2df [ocfs2_nodemanager]
[<ffffffff8105f681>] kthread+0x76/0x7e
[<ffffffff81356ef4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105f60b>] kthread+0x0/0x7e
[<ffffffff81356ef0>] kernel_thread_helper+0x0/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa015f007>] jbd2_journal_commit_transaction+0x1a6/0x10bf [jbd2]
[<ffffffff8100d02f>] load_TLS+0x7/0xa
[<ffffffff8100d69e>] __switch_to+0x133/0x258
[<ffffffff81039ac2>] finish_task_switch+0x88/0xb9
[<ffffffff81071011>] arch_local_irq_save+0x11/0x17
[<ffffffff8105fcd3>] autoremove_wake_function+0x0/0x2a
[<ffffffffa0163156>] kjournald2+0xc0/0x20a [jbd2]
[<ffffffff8105fcd3>] autoremove_wake_function+0x0/0x2a
[<ffffffffa0163096>] kjournald2+0x0/0x20a [jbd2]
[<ffffffff8105f681>] kthread+0x76/0x7e
[<ffffffff81356ef4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105f60b>] kthread+0x0/0x7e
[<ffffffff81356ef0>] kernel_thread_helper+0x0/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

root at server:~# cat /proc/3393/stack
[<ffffffff8100d02f>] load_TLS+0x7/0xa
[<ffffffff811b42e3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffa0546841>] ocfs2_commit_thread+0xf1/0x3a5 [ocfs2]
[<ffffffff8105fcd3>] autoremove_wake_function+0x0/0x2a
[<ffffffffa0546750>] ocfs2_commit_thread+0x0/0x3a5 [ocfs2]
[<ffffffff8105f681>] kthread+0x76/0x7e
[<ffffffff81356ef4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105f60b>] kthread+0x0/0x7e
[<ffffffff81356ef0>] kernel_thread_helper+0x0/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
--

When the load grow the process will remain. When I run "iotop -o" I see one apache process with an IO ~99%. DISK Write is 0 K/s, DISK Read is ~100 - 200 K/s. When checking lsof with the process ID, I get following (example):
--
apache2   64039         www-data    6w     FIFO                0,8      0t0   60701189 pipe
apache2   64039         www-data    7u      REG              254,3        0    1308167 /tmp/ZCUDX8xreu (deleted)
apache2   64039         www-data    8u     0000                0,9        0       3545 anon_inode
apache2   64039         www-data    9u     IPv6           62815839      0t0        TCP *** (CLOSE_WAIT)
apache2   64039         www-data   10u     IPv4           62820353      0t0        TCP *** (CLOSE_WAIT)
apache2   64039         www-data   11u     IPv4           62820355      0t0        TCP *** (ESTABLISHED)
apache2   64039         www-data   12u     IPv4           62810900      0t0        TCP *** (ESTABLISHED)
apache2   64039         www-data   13r      REG              254,3  1679422    1308180 /tmp/phprWvjyj
apache2   64039         www-data   14w      REG              254,1   315392  517499656 /var/www/myocfs/images/original/0a6adf0421891131f30120d4235fbb08.jpg
--
The last line differ. Always type "w", but not always the same path. When this problem occure, all other server grows in load (many visitors...), but all webserver processes are in "D" state. No IO seen with iotop on the other nodes. In the shell, I can run "ls /var/www/myocfs/images/original/0a6adf0421891131f30120d4235fbb08.jpg" fine on the "active" server. The command on the other nodes will hang / wait. Ls of another file in another path will work somtimes... After some minutes (3 - 15) all nodes can access the file, load is getting back, IO is normal again. But this isn't reproducable. 

debugfs.ocfs2 with fs_lock won't work regarding an mismatch between kernel and ocft-tools (I think).
debugfs.ocfs2 -R "fs_locks" /dev/mapper/myocfs
--> Debug string proto 3 found, but 2 is the highest I understand.

- Why does this problem occure only since ~1 week? Why run the ocfs fine weeks before? (I've also rebootet the server ...)
- What's the issue? Locking? How can I solve this? Or is the IO to slow?
- What causes the issue? Write? Delete? Why will some writes work well?

Can anybody help me with this issue? I've no further ideas :-(

Kind regards,
Karl-Heinz