[Ocfs2-users] Hangups

Wed Nov 16 03:36:37 PST 2011

Hi all!

I have a four-node OCFS2-kluster running RHEL 5.7 (x86_64) on ProLiants DL360-G5's
with 6 GB RAM each. Two webservers running Apache and two MySQL-servers. I've
shut down OCFS2 on the DB-servers since it's never really been used there, so it's only
live on the webservers.

OCFS2 is a 14-disk RAID-10 connected via dedicated QLogic iSCSI-NICs via a Gigabit
switch. The serves are connected to the same switch on a separate VLAN for the
heartbeat. They also have dedicated Gbps-NICs for frontend and backend traffic.

The problem is that last couple of weeks the write-performance has intermittently
slowed down to a crawl and right now the system is pretty much unusable. httpd has
its DocumentRoot on /san and these processes often go into D-state. "iostat -dmx 1"
often reports ~99 %util and writing anything hangs disturbingly often. My biggest
problem right now is that I've got very little to work with (that is no juicy kernel-panics
and such) and the only OCFS-related stuff in the logs look like this:

---CUT---
Nov 14 14:13:49 web02 kernel: INFO: task httpd:3959 blocked for more than 120 seconds.
Nov 14 14:13:49 web02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 14:13:49 web02 kernel: httpd         D ffff810001004420     0  3959   3149          3962 
3954 (NOTLB)
Nov 14 14:13:49 web02 kernel:  ffff81018c407ea8 0000000000000082 ffff81019d600140 ffffffff8875b54b
Nov 14 14:13:49 web02 kernel:  ffff810184210000 000000000000000a ffff81018eb957a0 ffffffff80314b60
Nov 14 14:13:49 web02 kernel:  000000b90056ee85 000000000018d3ad ffff81018eb95988 0000000000000003
Nov 14 14:13:49 web02 kernel: Call Trace:
Nov 14 14:13:49 web02 kernel:  [<ffffffff8875b54b>] :ocfs2:ocfs2_inode_lock_full+0x5fb/0xfe2
Nov 14 14:13:49 web02 kernel:  [<ffffffff8001b035>] cp_new_stat+0xe5/0xfd
Nov 14 14:13:49 web02 kernel:  [<ffffffff80063c53>] __mutex_lock_slowpath+0x60/0x9b
Nov 14 14:13:49 web02 kernel:  [<ffffffff80063c9d>] .text.lock.mutex+0xf/0x14
Nov 14 14:13:49 web02 kernel:  [<ffffffff80013f1b>] generic_file_llseek+0x2a/0x8b
Nov 14 14:13:49 web02 kernel:  [<ffffffff80025788>] sys_lseek+0x40/0x60
Nov 14 14:13:49 web02 kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83
---CUT---

Also, I see these kind of errors regarding eth4 (heartbeat-network):
# grep 'e1000e.*Detected' /var/log/messages
Nov 13 13:25:00 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:
Nov 14 09:10:55 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:
Nov 14 21:54:49 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:
Nov 14 21:58:35 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:
Nov 15 01:42:33 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:
Nov 15 13:34:01 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:
Nov 15 20:55:24 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:
Nov 16 02:42:55 web02 kernel: e1000e 0000:0b:00.0: eth4: Detected Hardware Unit Hang:

However, they happen relatively seldom and I've tried flood-pinging hosts over that VLAN and
I've never experienced timeouts nor packet-loss so really don't think that's the issue.

I managed to catch an strace running a dd to the drive and it hanged during the 8th 10MB-write for
approx 1 minute and 30 seconds. Things are looking surprisingly cheerful right now as only maybe
1 out of 10 tries hangs like this.

---CUT---
# strace dd if=/dev/zero of=/san/testfile bs=1M count=10
...
open("/san/testfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1
...
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576<APPROX 1,5
MIN DELAY>) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
close(0)                                = 0
close(1)                                = 0
...
write(2, "10+0 records in\n10+0 records out"..., 3310+0 records in
10+0 records out
) = 33
write(2, "10485760 bytes (10 MB) copied", 2910485760 bytes (10 MB) copied) = 29
write(2, ", 86.8347 seconds, 121 kB/s\n", 28, 86.8347 seconds, 121 kB/s
) = 28
...
---CUT---

I've made sure the disk-enclosure is OK, running latest firmware and experiencing no errors,
no other NIC-related problems other than the one mentioned above (eth4).

I realize that there's not much pointing towards OCFS having problems but I cannot for the life
of me comprehend the nature of this problem. If nothing else, I'd appreciate any help in determining
that OCFS can be eliminated from the equation! I've added some more information below. Please
let me know if you  need any additional information!

Thanks in advance,
-- Andy Herrero

# uname -a
Linux web01 2.6.18-274.7.1.el5 #1 SMP Mon Oct 17 11:57:14 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep ocfs
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-238.9.1.el5-1.4.7-1.el5
ocfs2-2.6.18-274.el5-1.4.7-1.el5
ocfs2-2.6.18-274.7.1.el5-1.4.7-1.el5

# cat /etc/ocfs2/cluster.conf
cluster:
                name = ocfs2
                node_count = 4

node:
                name = web01
                cluster = ocfs2
                number = 0
                ip_address = 10.60.2.11
                ip_port = 7777

node:
                name = web02
                cluster = ocfs2
                number = 1
                ip_address = 10.60.2.12
                ip_port = 7777

node:
                name = db01
                cluster = ocfs2
                number = 2
                ip_address = 10.60.2.13
                ip_port = 7777

node:
                name = db02
                cluster = ocfs2
                number = 3
                ip_address = 10.60.2.14
                ip_port = 7777

# cat /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# On Debian based systems the preferred method is running
# 'dpkg-reconfigure ocfs2-tools'.
#

# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_STACK: The name of the cluster stack backing O2CB.
O2CB_STACK=o2cb

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=61

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.
O2CB_IDLE_TIMEOUT_MS=

# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
O2CB_KEEPALIVE_DELAY_MS=

# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS=

# mounted.ocfs2 -f
Device                FS     Nodes
/dev/sda1             ocfs2  web01, web02

# df -h /dev/sda1
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             6.4T  5.4T 1001G  85% /san

# df -i /dev/sda1
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1            1708654583 1446362212 262292371   85% /san

# debugfs.ocfs2 -R "ls -l //" /dev/sda1
        6               drwxr-xr-x   7     0     0            4096 17-Dec-2008 16:17 .
        6               drwxr-xr-x   7     0     0            4096 17-Dec-2008 16:17 ..
        7               -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 bad_blocks
        8               -rw-r--r--   1     0     0          831488 17-Dec-2008 16:17 global_inode_alloc
        9               -rw-r--r--   1     0     0            4096 17-Dec-2008 16:47 slot_map
        10              -rw-r--r--   1     0     0         1048576 17-Dec-2008 16:17 heartbeat
        11              -rw-r--r--   1     0     0   6998649171968 17-Dec-2008 16:17 global_bitmap
        12              drwxr-xr-x   2     0     0          221184 16-Nov-2011 11:08 orphan_dir:0000
        13              drwxr-xr-x   2     0     0           32768 16-Nov-2011 12:01 orphan_dir:0001
        14              drwxr-xr-x   2     0     0           45056 14-Nov-2011 13:45 orphan_dir:0002
        15              drwxr-xr-x   2     0     0           73728 14-Nov-2011 17:07 orphan_dir:0003
        16              drwxr-xr-x   2     0     0            4096 17-Dec-2008 16:17 orphan_dir:0004
        17              -rw-r--r--   1     0     0       310378496 17-Dec-2008 16:17 extent_alloc:0000
        18              -rw-r--r--   1     0     0        62914560 17-Dec-2008 16:17 extent_alloc:0001
        19              -rw-r--r--   1     0     0       243269632 17-Dec-2008 16:17 extent_alloc:0002
        20              -rw-r--r--   1     0     0       146800640 17-Dec-2008 16:17 extent_alloc:0003
        21              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 extent_alloc:0004
        22              -rw-r--r--   1     0     0      1044381696 17-Dec-2008 16:17 inode_alloc:0000
        23              -rw-r--r--   1     0     0       792723456 17-Dec-2008 16:17 inode_alloc:0001
        24              -rw-r--r--   1     0     0      1132462080 17-Dec-2008 16:17 inode_alloc:0002
        25              -rw-r--r--   1     0     0      1040187392 17-Dec-2008 16:17 inode_alloc:0003
        26              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 inode_alloc:0004
        27              -rw-r--r--   1     0     0       268435456 17-Dec-2008 16:40 journal:0000
        28              -rw-r--r--   1     0     0       268435456 17-Dec-2008 16:40 journal:0001
        29              -rw-r--r--   1     0     0       268435456 17-Dec-2008 16:40 journal:0002
        30              -rw-r--r--   1     0     0       268435456 17-Dec-2008 16:40 journal:0003
        31              -rw-r--r--   1     0     0       268435456 17-Dec-2008 16:40 journal:0004
        32              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 local_alloc:0000
        33              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 local_alloc:0001
        34              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 local_alloc:0002
        35              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 local_alloc:0003
        36              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 local_alloc:0004
        37              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 truncate_log:0000
        38              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 truncate_log:0001
        39              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 truncate_log:0002
        40              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 truncate_log:0003
        41              -rw-r--r--   1     0     0               0 17-Dec-2008 16:17 truncate_log:0004

# hdparm -tT /dev/sda1

/dev/sda1:
 Timing cached reads:   22396 MB in  2.00 seconds = 11225.69 MB/sec
 Timing buffered disk reads:  238 MB in  3.00 seconds =  79.32 MB/sec