[Ocfs2-users] OCFS2

Tue Apr 29 01:14:54 PDT 2014

HI Marty,

Thanks for taking a look at the issue, thought I would provide a little more information.

flat class c network isolated for ISCSI, All server nodes have two NICS with interfaces through two separate switches in active / active. So connectivity to the volume is always available (no flapping interfaces)

No nodes / servers have failed at the same time, so to my mind that rules out the infrastructure SANs and switches.

In all cases the ocfs2cmt has bought the server to a halt with the thousands of messages per second, would this be normal behaviour on a failed path?

Nothing of interest before Apr 22 15:53:09.

Thanks
And

-----Original Message-----
From: Marty Sweet [mailto:msweet.dev at gmail.com] 
Sent: 25 April 2014 19:19
To: Morley, Andrew
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2

Thanks for the information,

From first glance I would say this is not an OCFS2 issue. It would appear that all your iSCSI targets are going offline at the same time, thus causing multipath to fail the device - which OCFS2 should not be expected to deal with nicely (its most likely to fence to ensure data integrity).

That being said, we are happy to help you with this issue.

Do all the servers have this problem at the same time?
 If so, it is likely that this is more a problem with the iSCSI target or network leading to it (can you provide a topology diagram?).

Once all the paths disappear they are most likely not recovering, how many messages are there before Apr 22 15:53:09?

Marty

On 24 April 2014 17:06,  <Andrew.MORLEY at sungard.com> wrote:
> Hi,
>
>
>
> I have an issue with ocfs2 and I am not quite sure, where the problem 
> is. I would be grateful for any feedback. The issue looks like a 
> multipath issue, however I have redundant links, so not quite sure why 
> ocfs2 would barf and bring the server down.
>
>
>
> I have a set of production servers that have started showing the same error.
>
> I am not aware of any changes within the infrastructure.
>
>
>
> setup is.
>
>
>
> 4 off Equallogic ps6100X.
>
> lots of Dell R610 servers, all with multiple ISCSI interfaces.
>
>
>
>
>
> This has happened on 3 different servers in the last week, causing the 
> servers to hang.
>
>
>
>
>
>
>
> I have checked all switches and logs and can see no flapping 
> interfaces. I can see the ISCSI initiator make logout and login 
> requests during this time period.
>
>
>
> I See in the logs
>
>
>
> Apr 22 15:53:09 servername multipathd:
> eql-0-8a0906-2d6a4c605-13244eee0b250b79_a: Entering recovery mode:
> max_retries=5
>
> Apr 22 15:53:09 servername multipathd: 8:176: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:16: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:48: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:64: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:128: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:160: mark as failed
>
> Apr 22 15:53:09 servername multipathd:
> eql-0-8a0906-2d6a4c605-13244eee0b250b79_a: Entering recovery mode:
> max_retries=5
>
> Apr 22 15:53:09 servername multipathd: 8:176: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:16: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:48: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:64: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:128: mark as failed
>
> Apr 22 15:53:09 servername multipathd: 8:160: mark as failed
>
> Apr 22 15:53:11 servername kernel: 
> (kmpathd/6,2888,6):o2hb_bio_end_io:241
> ERROR: IO Error -5
>
> Apr 22 15:53:11 servername kernel: Buffer I/O error on device dm-7, 
> logical block 480
>
> Apr 22 15:53:11 servername kernel: lost page write due to I/O error on 
> dm-7
>
> Apr 22 15:53:11 servername kernel: scsi 114:0:0:0: rejecting I/O to 
> dead device
>
> Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing 
> path 8:176.
>
> Apr 22 15:53:11 servername kernel:
> (o2hb-1B3B9BEE63,4754,7):o2hb_do_disk_heartbeat:772 ERROR: status = -5
>
> Apr 22 15:53:11 servername multipathd: dm-4: add map (uevent)
>
> Apr 22 15:53:11 servername kernel: scsi 115:0:0:0: rejecting I/O to 
> dead device
>
> Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing 
> path 8:16.
>
> Apr 22 15:53:11 servername multipathd: dm-4: devmap already registered
>
> Apr 22 15:53:11 servername multipathd: dm-4: add map (uevent)
>
> Apr 22 15:53:11 servername multipathd: dm-4: devmap already registered
>
> Apr 22 15:53:11 servername multipathd: dm-3: add map (uevent)
>
> Apr 22 15:53:11 servername kernel: scsi 110:0:0:0: rejecting I/O to 
> dead device
>
> Apr 22 15:53:11 servername kernel: device-mapper: multipath: Failing 
> path 8:48.
>
>
>
>
>
> Apr 22 15:53:17 servername multipathd: asvolume: load table [0 
> 629145600 multipath 0 0 1 1 round-robin 0 6 1 8:32 10
>
> 8:80 10 8:96 10 8:112 10 8:144 10 8:16 10]
>
> Apr 22 15:53:17 servername multipathd: dm-2: add map (uevent)
>
> Apr 22 15:53:17 servername multipathd: dm-2: devmap already registered
>
> Apr 22 15:53:17 servername multipathd: dm-8: add map (uevent)
>
> Apr 22 15:53:17 servername iscsid: Connection117:0 to [target:
> iqn.2001-05.com.equallogic:0-8a0906-2d6a4c605-13244eee
>
> 0b250b79-as14volumeocfs2, portal: 192.168.5.100,3260] through [iface:
> eql.eth2_2] is operational now
>
> Apr 22 15:53:22 servername multipathd: dm-3: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-3: devmap already registered
>
> Apr 22 15:53:22 servername multipathd: dm-4: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-4: devmap already registered
>
> Apr 22 15:53:22 servername multipathd: dm-5: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-5: devmap already registered
>
> Apr 22 15:53:22 servername multipathd: dm-9: add map (uevent)
>
> Apr 22 15:53:22 servername multipathd: dm-9: devmap already registered
>
> Apr 22 15:53:22 servername kernel: get_page_tbl ctx=0xffff810623d041c0
> (253:6): bits=2, mask=0x3, num=20480, max=2048
>
> 0
>
>
>
>
>
> Then the ocfs2 has an issue.
>
>
>
>
>
> Apr 22 15:53:23 servername kernel: 
> (ocfs2cmt,4773,6):ocfs2_commit_cache:191
> ERROR: status = -5
>
> Apr 22 15:53:23 servername kernel:
> (ocfs2cmt,4773,6):ocfs2_commit_thread:1799 ERROR: status = -5
>
> Apr 22 15:53:23 servername kernel: 
> (ocfs2cmt,4773,6):ocfs2_commit_cache:191
> ERROR: status = -5
>
>
>
>
>
>
>
>
>
>
>
> then
>
>
>
> Apr 22 15:53:23 servername kernel:
> s2cmt,4773,6):ocfs2<3>(ocfs2c<3>(ocfs2cmt,4773,6):ocfs2_commit_cache:1
> 91
> ERROR: status = -
>
> 5
>
>
>
> Apr 22 15:53:23 servername kernel:
> (ocfs2cmt,4773,6):ocfs2_commi<3>(ocfs2cm<<3>(<3>(ocfs2cmt,4773,6):ocfs
> 2_commit_cache:191
> E
>
> RROR: status = -5
>
>
>
> Apr 22 15:53:23 servername kernel:
> (ocfs2<3>(ocfs2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2cmt,4<3>(o
> cf<3>(ocfs<3>(ocf<3>
>
> (ocfs2cm<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2cmt<3>(o<3>(ocfs2cmt<3>(ocfs2c
> m<3>(ocfs2c<3><3>(ocfs2<3>(oc<3>(ocfs2cmt,<3>(ocf<3>(oc
>
> fs2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2c<3>(o<3>(ocfs2c<3>(oc
> <3>(ocfs2cmt,47<3>(o<3>(ocfs2cmt,477<3>(ocfs<3>(ocfs2c<
>
> 3>(ocf<3>(ocfs2cmt<3>(<3>(ocfs2cmt,4773<3>(oc<3>(ocfs2cmt,<3>(oc<3>(oc
> 3>fs2cmt<3>(ocfs<3>(ocfs2cm<3>(oc<3>(ocfs<3>(oc<3>(ocf<3>
>
> (ocfs2cmt,<3>(oc<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2<3>(<3>(ocfs2cmt,4773,<3
> >(oc<3>(ocfs2cmt,4773,<3>(ocfs<3>(ocfs2cmt<3>(oc<3>(ocf
>
> s2cmt,477<3>(ocf<3>(ocfs2cmt,477<3>(<3>(ocfs2cmt,<3>(oc<3>(ocfs2cmt,<3
> >(o<3>(ocfs2cmt<3>(ocfs<3>(ocfs2c<3>(ocf<3>(ocfs2cmt<3>
>
> (ocfs<3>(ocfs2c<3>(ocf<3>(ocfs2cmt<3>(<3>(ocfs2<3>(ocf<3>(ocfs2cmt<3>(
> oc<3>(ocfs2cmt<3>(oc<3>(ocfs<3>(ocfs2<3>(ocfs2c<3>(o<3>
>
> (ocfs2cmt,4<3>(ocf<3>(ocfs2<3>(oc<3>(ocfs2cm<3>(oc<3>(ocfs2cmt<3>(oc<3
> >(ocfs2cmt<3>(ocfs<3>(ocfs2cmt,<3>(ocfs<3>(ocfs2c<3>(oc
>
> fs2<3>(ocfs2c<3>(ocfs2c<3>(ocf
>
>
>
> Apr 22 15:53:23 servername kernel:
> 2cmt,4773,6):<3>(ocf<3>(ocfs2cmt,<3>(ocfs2<3>(ocfs2cmt,<3>(ocfs<3>(ocf
> s2cmt<3>(ocf<3>(ocfs
>
> 2cmt,47<3>(ocf<3>(ocfs2cmt,47<3>(ocfs<3>(ocfs2cmt,<3>(o<3>(ocfs2cmt,4<
> 3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cm
>
> t,<3>(ocf<3>(ocfs2cmt<3>(ocfs2<3>(ocfs2cmt<3>(<3>(ocfs2cm<3>(ocfs<3>(o
> cfs2cmt<3>(ocfs2<3>(ocfs2cmt<3>(oc<3>(ocfs2cmt<3>(ocfs<
>
> 3>(ocfs2<3>(ocf<3>(ocfs2cmt,4773,<3>(oc<3>(ocfs2cm<3>(ocfs2<3>(ocfs2cm
> 3><3>(oc<3>(ocfs2cmt,4773,6):<3>(<3>(ocfs2cmt<3>(oc<3>(oc
>
> fs2cm<3>(ocfs2<3>(ocfs2cmt<3>(o<3>(ocfs2cmt<3>(ocf<3>(ocfs2c<3>(ocfs2c
> <3>(ocfs2cmt,<3>(oc<3>(ocfs2c<3>(ocfs2cm<3>(ocfs2cmt<3>
>
> (o<3>(ocfs2cmt<3>(o<3>(ocfs2cm<3><3>(ocfs2cmt<3>(ocfs2c<3>(ocfs2cmt,<3
> >(o<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3>(ocf<3>(ocfs2cmt<3
>
>>(o<3>(ocfs2<3>(oc<3>(ocfs2cmt,47<3>(oc<3>(ocfs2cmt,4773,6<3>(o<3>(ocfs
>>2cm<3>(ocf<3>(ocfs2<3>(o<3>(ocfs2<3>(<3>(ocfs2cm<3>(oc
>
> <3>(ocfs<3>(ocfs2c<3>(ocfs2cmt<3>(o<3>(ocfs2cm<3>(ocf<3>(ocfs2cmt<3><3
> >(ocfs2cmt,<3>(o<3>(ocfs2cmt,4<3>(oc<3>(ocfs2c<3>(o<3>(
>
> ocfs2cmt,<3>(o<3>(ocfs2cmt<3>(
>
>
>
> Repeated thousands of times and bringing the server to a halt.
>
>
>
>
>
> cat /etc/multipath.conf
>
>
>
> blacklist {
>
>         devnode "^sd[a]$"
>
> }
>
>
>
> ## Use user friendly names, instead of using WWIDs as names.
>
> defaults {
>
>         user_friendly_names yes
>
> }
>
> multipaths {
>
>         multipath {
>
>                 wwid                    36090a058604c6a2d790b250bee4exxxx
>
>                 alias                   asvolume
>
>                 path_grouping_policy    multibus
>
>                 #path_checker            readsector0
>
>                 path_selector           "round-robin 0"
>
>                 failback                immediate
>
>                 rr_weight               priorities
>
>                 rr_min_io               10
>
>                 no_path_retry           5
>
>         }
>
> }
>
> devices {
>
>         device {
>
>                 vendor                  "EQLOGIC"
>
>                 product                 "100E-00"
>
>                 path_grouping_policy    multibus
>
>                 getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
>
>                 #features               "1 queue_if_no_path"
>
>                 path_checker            readsector0
>
>                 path_selector           "round-robin 0"
>
>                 failback                immediate
>
>                 rr_min_io               10
>
>                 rr_weight               priorities
>
>         }
>
> }
>
>
>
> cat /etc/ocfs2/cluster.conf
>
>
>
>
>
>
>
> node:
>
>         ip_port = 8888
>
>         ip_address = x.x.x.x
>
>         number = 9
>
>         name = servername
>
>         cluster = ocfs
>
> node:
>
>         ip_port = 8888
>
>         ip_address = x.x.x.x
>
>         number = 109
>
>         name = servername1
>
>         cluster = ocfs
>
>
>
> more nodes in here
>
>
>
> cluster:
>
>         node_count = 22
>
>         name = ocfs
>
>
>
> Cluster consists of 14 nodes.
>
>
>
> /etc/init.d/o2cb status
>
>
>
>
>
> Driver for "configfs": Loaded
>
> Filesystem "configfs": Mounted
>
> Driver for "ocfs2_dlmfs": Loaded
>
> Filesystem "ocfs2_dlmfs": Mounted
>
> Checking O2CB cluster ocfs: Online
>
> Heartbeat dead threshold = 61
>
>   Network idle timeout: 30000
>
>   Network keepalive delay: 2000
>
>   Network reconnect delay: 2000
>
> Checking O2CB heartbeat: Active
>
>
>
>
>
>
>
>
>
>
>
>
>
> Server and package information.
>
>
>
> cat /etc/redhat-release
>
> Red Hat Enterprise Linux Server release 5.10 (Tikanga)
>
>
>
> rpm -qa | grep multipath
>
> device-mapper-multipath-0.4.7-59.el5
>
>
>
>
>
> rpm -qa | grep ocfs2
>
>
>
> ocfs2-2.6.18-371.3.1.el5-1.4.10-1.el5
>
> ocfs2-tools-1.4.4-1.el5
>
> ocfs2console-1.4.4-1.el5
>
>
>
> rpm -qa | grep kernel
>
>
>
> kernel-2.6.18-371.3.1.el5
>
>
>
> modinfo ocfs2
>
>
>
> filename:       /lib/modules/2.6.18-371.3.1.el5/kernel/fs/ocfs2/ocfs2.ko
>
> license:        GPL
>
> author:         Oracle
>
> version:        1.4.10
>
> description:    OCFS2 1.4.10 Thu Dec  5 16:38:36 PST 2013 (build
> b703e5e0906b370c876b657dabe8d4c8)
>
> srcversion:     41115DB9EFDAA5735C18810
>
> depends:        ocfs2_dlm,jbd,ocfs2_nodemanager
>
> vermagic:       2.6.18-371.3.1.el5 SMP mod_unload gcc-4.1
>
>
>
>
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users