[Ocfs2-users] OCFS2

Andrew.MORLEY at sungard.com Andrew.MORLEY at sungard.com
Thu Apr 24 09:06:53 PDT 2014


Hi,

I have an issue with ocfs2 and I am not quite sure, where the problem is. I would be grateful for any feedback. The issue looks like a multipath issue, however I have redundant links, so not quite sure why ocfs2 would barf and bring the server down.

I have a set of production servers that have started showing the same error.
I am not aware of any changes within the infrastructure.

setup is.

4 off Equallogic ps6100X.
lots of Dell R610 servers, all with multiple ISCSI interfaces.


This has happened on 3 different servers in the last week, causing the servers to hang.



I have checked all switches and logs and can see no flapping interfaces. I can see the ISCSI initiator make logout and login requests during this time period.

I See in the logs

Apr 22 15:53:09 servername multipathd: eql-0-8a0906-2d6a4c605-13244eee0b250b79_a: Entering recovery mode: max_retries=5
Apr 22 15:53:09 servername multipathd: 8:176: mark as failed
Apr 22 15:53:09 servername multipathd: 8:16: mark as failed
Apr 22 15:53:09 servername multipathd: 8:48: mark as failed
Apr 22 15:53:09 servername multipathd: 8:64: mark as failed
Apr 22 15:53:09 servername multipathd: 8:128: mark as failed
Apr 22 15:53:09 servername multipathd: 8:160: mark as failed
Apr 22 15:53:09 servername multipathd: eql-0-8a0906-2d6a4c605-13244eee0b250b79_a: Entering recovery mode: max_retries=5
Apr 22 15:53:09 servername multipathd: 8:176: mark as failed
Apr 22 15:53:09 servername multipathd: 8:16: mark as failed
Apr 22 15:53:09 servername multipathd: 8:48: mark as failed
Apr 22 15:53:09 servername multipathd: 8:64: mark as failed
Apr 22 15:53:09 servername multipathd: 8:128: mark as failed
Apr 22 15:53:09 servername multipathd: 8:160: mark as failed
Apr 22 15:53:11 servername kernel: (kmpathd/6,2888,6):o2hb_bio_end_io:241 ERROR: IO Error -5
Apr 22 15:53:11 servername kernel: Buffer I/O error on device dm-7, logical block 480
Apr 22 15:53:11 servername kernel: lost page write due to I/O error on dm-7
Apr 22 15:53:11 servername kernel: scsi 114:0:0:0: rejecting I/O to dead device
Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing path 8:176.
Apr 22 15:53:11 servername kernel: (o2hb-1B3B9BEE63,4754,7):o2hb_do_disk_heartbeat:772 ERROR: status = -5
Apr 22 15:53:11 servername multipathd: dm-4: add map (uevent)
Apr 22 15:53:11 servername kernel: scsi 115:0:0:0: rejecting I/O to dead device
Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing path 8:16.
Apr 22 15:53:11 servername multipathd: dm-4: devmap already registered
Apr 22 15:53:11 servername multipathd: dm-4: add map (uevent)
Apr 22 15:53:11 servername multipathd: dm-4: devmap already registered
Apr 22 15:53:11 servername multipathd: dm-3: add map (uevent)
Apr 22 15:53:11 servername kernel: scsi 110:0:0:0: rejecting I/O to dead device
Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing path 8:48.


Apr 22 15:53:17 servername multipathd: asvolume: load table [0 629145600 multipath 0 0 1 1 round-robin 0 6 1 8:32 10
8:80 10 8:96 10 8:112 10 8:144 10 8:16 10]
Apr 22 15:53:17 servername multipathd: dm-2: add map (uevent)
Apr 22 15:53:17 servername multipathd: dm-2: devmap already registered
Apr 22 15:53:17 servername multipathd: dm-8: add map (uevent)
Apr 22 15:53:17 servername iscsid: Connection117:0 to [target: iqn.2001-05.com.equallogic:0-8a0906-2d6a4c605-13244eee
0b250b79-as14volumeocfs2, portal: 192.168.5.100,3260] through [iface: eql.eth2_2] is operational now
Apr 22 15:53:22 servername multipathd: dm-3: add map (uevent)
Apr 22 15:53:22 servername multipathd: dm-3: devmap already registered
Apr 22 15:53:22 servername multipathd: dm-4: add map (uevent)
Apr 22 15:53:22 servername multipathd: dm-4: devmap already registered
Apr 22 15:53:22 servername multipathd: dm-5: add map (uevent)
Apr 22 15:53:22 servername multipathd: dm-5: devmap already registered
Apr 22 15:53:22 servername multipathd: dm-9: add map (uevent)
Apr 22 15:53:22 servername multipathd: dm-9: devmap already registered
Apr 22 15:53:22 servername kernel: get_page_tbl ctx=0xffff810623d041c0 (253:6): bits=2, mask=0x3, num=20480, max=2048
0


Then the ocfs2 has an issue.


Apr 22 15:53:23 servername kernel: (ocfs2cmt,4773,6):ocfs2_commit_cache:191 ERROR: status = -5
Apr 22 15:53:23 servername kernel: (ocfs2cmt,4773,6):ocfs2_commit_thread:1799 ERROR: status = -5
Apr 22 15:53:23 servername kernel: (ocfs2cmt,4773,6):ocfs2_commit_cache:191 ERROR: status = -5





then

Apr 22 15:53:23 servername kernel: s2cmt,4773,6):ocfs2<3>(ocfs2c<3>(ocfs2cmt,4773,6):ocfs2_commit_cache:191 ERROR: status = -
5

Apr 22 15:53:23 servername kernel: (ocfs2cmt,4773,6):ocfs2_commi<3>(ocfs2cm<<3>(<3>(ocfs2cmt,4773,6):ocfs2_commit_cache:191 E
RROR: status = -5

Apr 22 15:53:23 servername kernel: (ocfs2<3>(ocfs2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2cmt,4<3>(ocf<3>(ocfs<3>(ocf<3>
(ocfs2cm<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2cmt<3>(o<3>(ocfs2cmt<3>(ocfs2cm<3>(ocfs2c<3><3>(ocfs2<3>(oc<3>(ocfs2cmt,<3>(ocf<3>(oc
fs2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2c<3>(o<3>(ocfs2c<3>(oc<3>(ocfs2cmt,47<3>(o<3>(ocfs2cmt,477<3>(ocfs<3>(ocfs2c<
3>(ocf<3>(ocfs2cmt<3>(<3>(ocfs2cmt,4773<3>(oc<3>(ocfs2cmt,<3>(oc<3>(ocfs2cmt<3>(ocfs<3>(ocfs2cm<3>(oc<3>(ocfs<3>(oc<3>(ocf<3>
(ocfs2cmt,<3>(oc<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2<3>(<3>(ocfs2cmt,4773,<3>(oc<3>(ocfs2cmt,4773,<3>(ocfs<3>(ocfs2cmt<3>(oc<3>(ocf
s2cmt,477<3>(ocf<3>(ocfs2cmt,477<3>(<3>(ocfs2cmt,<3>(oc<3>(ocfs2cmt,<3>(o<3>(ocfs2cmt<3>(ocfs<3>(ocfs2c<3>(ocf<3>(ocfs2cmt<3>
(ocfs<3>(ocfs2c<3>(ocf<3>(ocfs2cmt<3>(<3>(ocfs2<3>(ocf<3>(ocfs2cmt<3>(oc<3>(ocfs2cmt<3>(oc<3>(ocfs<3>(ocfs2<3>(ocfs2c<3>(o<3>
(ocfs2cmt,4<3>(ocf<3>(ocfs2<3>(oc<3>(ocfs2cm<3>(oc<3>(ocfs2cmt<3>(oc<3>(ocfs2cmt<3>(ocfs<3>(ocfs2cmt,<3>(ocfs<3>(ocfs2c<3>(oc
fs2<3>(ocfs2c<3>(ocfs2c<3>(ocf

Apr 22 15:53:23 servername kernel: 2cmt,4773,6):<3>(ocf<3>(ocfs2cmt,<3>(ocfs2<3>(ocfs2cmt,<3>(ocfs<3>(ocfs2cmt<3>(ocf<3>(ocfs
2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2cmt,<3>(o<3>(ocfs2cmt,4<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cm
t,<3>(ocf<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2cmt<3>(<3>(ocfs2cm<3>(ocfs<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2cmt<3>(oc<3>(ocfs2cmt<3>(ocfs<
3>(ocfs2<3>(ocf<3>(ocfs2cmt,4773,<3>(oc<3>(ocfs2cm<3>(ocfs2<3>(ocfs2cm<3>(oc<3>(ocfs2cmt,4773,6):<3>(<3>(ocfs2cmt<3>(oc<3>(oc
fs2cm<3>(ocfs2<3>(ocfs2cmt<3>(o<3>(ocfs2cmt<3>(ocf<3>(ocfs2c<3>(ocfs2c<3>(ocfs2cmt,<3>(oc<3>(ocfs2c<3>(ocfs2cm<3>(ocfs2cmt<3>
(o<3>(ocfs2cmt<3>(o<3>(ocfs2cm<3><3>(ocfs2cmt<3>(ocfs2c<3>(ocfs2cmt,<3>(o<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3
>(o<3>(ocfs2<3>(oc<3>(ocfs2cmt,47<3>(oc<3>(ocfs2cmt,4773,6<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2<3>(o<3>(ocfs2<3>(<3>(ocfs2cm<3>(oc
<3>(ocfs<3>(ocfs2c<3>(ocfs2cmt<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2cmt<3><3>(ocfs2cmt,<3>(o<3>(ocfs2cmt,4<3>(oc<3>(ocfs2c<3>(o<3>(
ocfs2cmt,<3>(o<3>(ocfs2cmt<3>(

Repeated thousands of times and bringing the server to a halt.


cat /etc/multipath.conf

blacklist {
        devnode "^sd[a]$"
}

## Use user friendly names, instead of using WWIDs as names.
defaults {
        user_friendly_names yes
}
multipaths {
        multipath {
                wwid                    36090a058604c6a2d790b250bee4exxxx
                alias                   asvolume
                path_grouping_policy    multibus
                #path_checker            readsector0
                path_selector           "round-robin 0"
                failback                immediate
                rr_weight               priorities
                rr_min_io               10
                no_path_retry           5
        }
}
devices {
        device {
                vendor                  "EQLOGIC"
                product                 "100E-00"
                path_grouping_policy    multibus
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                #features               "1 queue_if_no_path"
                path_checker            readsector0
                path_selector           "round-robin 0"
                failback                immediate
                rr_min_io               10
                rr_weight               priorities
        }
}

cat /etc/ocfs2/cluster.conf



node:
        ip_port = 8888
        ip_address = x.x.x.x
        number = 9
        name = servername
        cluster = ocfs
node:
        ip_port = 8888
        ip_address = x.x.x.x
        number = 109
        name = servername1
        cluster = ocfs

more nodes in here

cluster:
        node_count = 22
        name = ocfs

Cluster consists of 14 nodes.

/etc/init.d/o2cb status


Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs: Online
Heartbeat dead threshold = 61
  Network idle timeout: 30000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active






Server and package information.

cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.10 (Tikanga)

rpm -qa | grep multipath
device-mapper-multipath-0.4.7-59.el5


rpm -qa | grep ocfs2

ocfs2-2.6.18-371.3.1.el5-1.4.10-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2console-1.4.4-1.el5

rpm -qa | grep kernel

kernel-2.6.18-371.3.1.el5

modinfo ocfs2

filename:       /lib/modules/2.6.18-371.3.1.el5/kernel/fs/ocfs2/ocfs2.ko
license:        GPL
author:         Oracle
version:        1.4.10
description:    OCFS2 1.4.10 Thu Dec  5 16:38:36 PST 2013 (build b703e5e0906b370c876b657dabe8d4c8)
srcversion:     41115DB9EFDAA5735C18810
depends:        ocfs2_dlm,jbd,ocfs2_nodemanager
vermagic:       2.6.18-371.3.1.el5 SMP mod_unload gcc-4.1



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20140424/448d9f6f/attachment-0001.html 


More information about the Ocfs2-users mailing list