[Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

Mon Jun 8 02:06:39 PDT 2009

Hi Tao and Sunil,
thank you VERY much for you great help! I will reboot node 1 on sunday
and this should fix my problem (hope so).

Great filesystem and great support - keep it up!

I have one last question that is a little OT:

We tried to use ocfs2 with Vserver clustered with Heartbeat. But
Vservers need barrier=1. That did not work on our shared storage with
ocfs2 but I guess this is no ocfs2 problem it is a device mapper problem
because we need to use multipath and LVM2, isn't it?
Device mapper does not support barriers, doesn't it?

So the only way we could do it is to disable multipathing for one shared
storage LUN and mount only one path!?

Is there any better solution?

Best regards,
Florian

> 
> Hi Florian,
> 
> florian.engelmann at bt.com wrote:
> > Hi Tao,
> >
> >> Hi Florian,
> >>
> >> florian.engelmann at bt.com wrote:
> >>> Hi Tao,
> >>>
> >>>> Hi florian,
> >>>>
> >>>> florian.engelmann at bt.com wrote:
> >>>>>> Florian,
> >>>>>> the problem here seems to be with network. The nodes are
running
> >>> into
> >>>>>> network heartbeat timeout and hence second node is getting
> > fenced.
> >>> Do
> >>>>>> you see o2net thread consuming 100% cpu on any node? if not
then
> >>>>>> probably check your network
> >>>>>> thanks,
> >>>>>> --Srini
> >>>>> I forgot to post my /etc/ocfs2/cluster.conf
> >>>>> node:
> >>>>>         ip_port = 7777
> >>>>>         ip_address = 192.168.0.101
> >>>>>         number = 0
> >>>>>         name = defr1elcbtd01
> >>>>>         cluster = ocfs2
> >>>>>
> >>>>> node:
> >>>>>         ip_port = 7777
> >>>>>         ip_address = 192.168.0.102
> >>>>>         number = 1
> >>>>>         name = defr1elcbtd02
> >>>>>         cluster = ocfs2
> >>>>>
> >>>>> cluster:
> >>>>>         node_count = 2
> >>>>>         name = ocfs2
> >>>>>
> >>>>>
> >>>>> 192.168.0.10x is eth3 on both nodes and connected with a cross
> > over
> >>>>> cable. No active network component is involved here.
> >>>>>
> >>>>> defr1elcbtd02:~# traceroute 192.168.0.101
> >>>>> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52
byte
> >>>>> packets
> >>>>>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
> >>>>> defr1elcbtd02:~#
> >>>>>
> >>>>> The error message looks like a network problem but why should
> > there
> >>> be a
> >>>>> network problem if I shutdown a FC port?! I testet it about 20
> > times
> >>> and
> >>>>> got about 16 kernel panics starting with the same error message:
> >>>>>
> >>>>> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0)
> > at
> >>>>> 192.168.0.101:7777
> >>>> It isn't an error message, just a status report that we can't
> > connect
> >>> to
> >>>> that node now. That node may be rebooted or something else, but
> > this
> >>>> node don't know, and it only knows the connection is down.
> >>> But node defr1elcbtd01 was never down and also the network link
> > (eth3)
> >>> wasn't down. I was able to ping from each node to the other.
> >>> Node 1 is hosting all services and never was faulted while I was
> >>> testing.
> >>>
> >>> All I have to do to panic node 2 is to disable one of two fibre
> > channel
> >>> ports or pull one fibre channel cable or delete node 2 from the
> > cisco
> >>> SAN zoning.
> >>> If I apply one of those 3 "errors" I get the message about o2net
is
> > no
> >>> longer connected to node 1 and 60 seconds later the 2nd node
panics
> >>> because of ocfs2 fencing (but this happens only in about 80% of
> > cases -
> >>> in the other 20% of cases o2net does not disconnect and there are
no
> >>> messages about ocfs2 at all - like it should be...).
> >>> Everything else is working fine in these 60 seconds. The
filesystem
> > is
> >>> still writable from both nodes and both nodes can ping each other
> > (via
> >>> the cluster interconnect).
> >> I just checked your log. The error why node 2 get the message is
that
> >> node 1 get the message that node 2 stopped disk heartbeat for quite
a
> >> long time so it stop the connection intentionally. So node 2 get
this
> >> message.
> >>
> >> See the log in node 1:
> >> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
node
> > 1,
> >> 1 total
> >> Jun  8 09:46:26 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:382
> >> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err
0
> > ->
> >> -107
> >> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296
node
> >> 1, 1 total
> >> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
> > node
> >> defr1elcbtd02 (num 1) at 192.168.0.102:7777
> >>
> >> And I guess the reason why you see this log sometimes(80%) is that
the
> >> time interval. You know ocfs2 disk heartbeat try every 2 secs so
> >> sometimes node 2 panic before node 1 call o2quo_hb_down and
sometimes
> >> node2 panic after node 1 call o2quo_hb_down(which will put
something
> >> like "no longer..." in node 2's log).
> >>
> >> So would you please give your timeout configuration(o2cb)?
> >
> > These are my setting on node 2:
> >
> > O2CB_HEARTBEAT_THRESHOLD=61
> > O2CB_IDLE_TIMEOUT_MS=60000
> > O2CB_KEEPALIVE_DELAY_MS=4000
> > O2CB_RECONNECT_DELAY_MS=4000
> ocfs2 can't allow 2 nodes have different timeouts. So if node 1 and 2
> don't have the same configuration, node 2 won't be allowed to join the
> domain and mount the same volume.
> yeah, this parameter looks much better.
> See
>
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.h
tm
> l#TIMEOUT
> For details.
> 
> I just went through the whole thread, and the panic because of the
> storage failure is a deliberate behavior since with no disk access, no
> need to survive. See
>
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.h
tm
> l#QUORUM
> 
> Regards,
> Tao
> >
> > On node 1 I still got the old setting because there was no downtime
to
> > reboot this system till today. Is there any way to change the values
> > without a reboot? The system is mission critical and I can only
reboot
> > on Sundays.
> >
> > Settings on node 1 are the default setting that came with the Debian
> > package. I changed them to fit the settings of node 2 after the next
> > reboot - so I can only guess they were:
> >
> > O2CB_HEARTBEAT_THRESHOLD=7
> > O2CB_IDLE_TIMEOUT_MS=12000
> > O2CB_KEEPALIVE_DELAY_MS=2000
> > O2CB_RECONNECT_DELAY_MS=2000
> 
> 
> >
> > Regards,
> > Florian
> >
> >
> >
> >> Regards,
> >> Tao
> >>
> >>
> >>
> >>> Here are the logs with debug logging:
> >>>
> >>> Node 2:
> >>>
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx 0000:04:00.0: LOOP
> > DOWN
> >>> detected (2).
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452
[sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] data_ready hit
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] get
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1170
> >>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> > page_off
> >>> 24
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel:
> >>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> >>> sys_stat 0 key 00000000 num 0] processing message
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 24] get
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = 1
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = -11
> >>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452
[sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] data_ready hit
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] get
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452
[sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] data_ready hit
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] get
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1170
> >>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> > page_off
> >>> 24
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
> >>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> >>> sys_stat 0 key 00000000 num 0] processing message
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 24] get
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = 1
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1170
> >>> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> > page_off
> >>> 24
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
> >>> (3463,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
> >>> sys_stat 0 key 00000000 num 0] processing message
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 24] put
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 24] get
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = 1
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = -11
> >>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452
[sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] data_ready hit
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] get
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1170
> >>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> > page_off
> >>> 24
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel:
> >>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> >>> sys_stat 0 key 00000000 num 0] processing message
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 24] get
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = 1
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = -11
> >>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_state_change:477
> > [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] state_change to 8
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] get
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452
[sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] data_ready hit
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] get
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
(3463,0):o2net_shutdown_sc:572
> > [sc
> >>> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] shutting down
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
> > (3463,0):o2net_set_nn_state:382
> >>> node 0 sc: ffff81007c2f0800 -> 0000000000000000, valid 1 -> 0, err
0
> > ->
> >>> -107
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_conn_err:296
> > node
> >>> 0, 1 total
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_set_hold:162
> > node
> >>> 0, 1 total
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: o2net: no longer connected
to
> > node
> >>> defr1elcbtd01 (num 0) at 192.168.0.101:7777
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
> >>> (3463,0):o2net_complete_nodes_nsw:255 completed 0 messages for
node
> > 0
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
> > (3463,0):o2net_set_nn_state:431
> >>> queueing conn attempt in 0 jiffies
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] get
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] receiving
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
(3463,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] ret = 0
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
> > (3463,0):o2net_rx_until_empty:1215
> >>> [sc ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] saw error 0, closing
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel:
(3463,0):o2net_shutdown_sc:572
> > [sc
> >>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] shutting down
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> >>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_kref_release:274
> > [sc
> >>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> >>> ffff81007fafbb00 pg_off 0] releasing
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: scsi 2:0:1:1: SCSI error:
> > return
> >>> code = 0x00010000
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: end_request: I/O error, dev
> > sdd,
> >>> sector 1672
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath:
> > Failing
> >>> path 8:48.
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath:
> > Failing
> >>> path 8:16.
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath
emc:
> > long
> >>> trespass command will be send
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath
emc:
> >>> honor reservation bit will not be set (default)
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: ioctl: error
> > adding
> >>> target to table
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath
emc:
> > long
> >>> trespass command will be send
> >>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath
emc:
> >>> honor reservation bit will not be set (default)
> >>> Jun  8 09:46:48 defr1elcbtd02 kernel: device-mapper: multipath
emc:
> >>> emc_pg_init: sending switch-over command
> >>>
> >>>
> >>>
> >>>
> >>> Node 1:
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel:
(16528,0):o2net_data_ready:452
> > [sc
> >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] data_ready hit
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel: (16528,0):sc_get:294 [sc
> >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] get
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] receiving
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1170
> >>> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> > page_off
> >>> 24
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel:
> >>> (3505,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
> >>> sys_stat 0 key 00000000 num 0] processing message
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 24] put
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 24] get
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] ret = 1
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] receiving
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] ret = -11
> >>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
> > node 1,
> >>> 1 total
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:382
> >>> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err
0
> > ->
> >>> -107
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296
> > node
> >>> 1, 1 total
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected
to
> > node
> >>> defr1elcbtd02 (num 1) at 192.168.0.102:7777
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel:
> >>> (3804,0):o2net_complete_nodes_nsw:255 completed 0 messages for
node
> > 1
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:431
> >>> queueing conn attempt in 0 jiffies
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_get:294 [sc
> >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] get
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_put:289 [sc
> >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel:
(3505,0):o2net_shutdown_sc:572
> > [sc
> >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] shutting down
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> >>> ffff81007ddf4400 refs 2 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> >>> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] put
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_kref_release:274
> > [sc
> >>> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
> >>> ffff81007f75b3e8 pg_off 0] releasing
> >>> Jun  8 09:46:26 defr1elcbtd01 kernel:
> >>> (16541,0):ocfs2_replay_journal:1184 Recovering node 1 from slot 0
on
> >>> device (253,7)
> >>> Jun  8 09:46:28 defr1elcbtd01 kernel: kjournald starting.  Commit
> >>> interval 5 seconds
> >>> Jun  8 09:46:48 defr1elcbtd01 kernel: Please double check your
> >>> configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
> >>> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199
node
> > 1, 2
> >>> total
> >>> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162
> > node
> >>> 1, 1 total
> >>> Jun  8 09:46:48 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:382
> >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
> > -107
> >>> -> 0
> >>> Jun  8 09:46:48 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:431
> >>> queueing conn attempt in 0 jiffies
> >>> Jun  8 09:46:58 defr1elcbtd01 kernel:
> > (3505,0):o2net_set_nn_state:382
> >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
0
> > ->
> >>> -107
> >>> Jun  8 09:46:58 defr1elcbtd01 kernel: (3505,0):o2quo_conn_err:296
> > node
> >>> 1, 1 total
> >>> Jun  8 09:46:58 defr1elcbtd01 kernel:
> > (3505,0):o2net_set_nn_state:431
> >>> queueing conn attempt in 0 jiffies
> >>> Jun  8 09:47:16 defr1elcbtd01 kernel:
(3505,0):o2quo_hb_still_up:242
> >>> node 1
> >>> Jun  8 09:47:16 defr1elcbtd01 kernel:
(3505,0):o2quo_clear_hold:171
> > node
> >>> 1, 0 total
> >>> Jun  8 09:47:16 defr1elcbtd01 kernel:
(8,0):o2quo_make_decision:105
> >>> heartbeating: 2, connected: 1, lowest: 0 (reachable)
> >>> Jun  8 09:47:44 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
> > node 1,
> >>> 1 total
> >>> Jun  8 09:47:44 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:382
> >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
> > -107
> >>> -> -107
> >>> Jun  8 09:47:44 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:431
> >>> queueing conn attempt in 0 jiffies
> >>> Jun  8 09:48:33 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> > NIC
> >>> Link is Down
> >>> Jun  8 09:48:45 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> > NIC
> >>> Link is Up 1000 Mbps Full Duplex
> >>> Jun  8 09:50:05 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> > NIC
> >>> Link is Down
> >>> Jun  8 09:50:08 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> > NIC
> >>> Link is Up 1000 Mbps Full Duplex
> >>> Jun  8 09:50:20 defr1elcbtd01 kernel: Please double check your
> >>> configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
> >>> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199
node
> > 1, 2
> >>> total
> >>> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162
> > node
> >>> 1, 1 total
> >>> Jun  8 09:50:20 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:382
> >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
> > -107
> >>> -> 0
> >>> Jun  8 09:50:20 defr1elcbtd01 kernel:
> > (3804,0):o2net_set_nn_state:431
> >>> queueing conn attempt in 0 jiffies
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> > (0,0):o2net_listen_data_ready:1670
> >>> bytes: 0
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_alloc:320 [sc
> >>> ffff81007dddc400 refs 1 sock 0000000000000000 node 1 page
> >>> 0000000000000000 pg_off 0] alloced
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> > (3505,0):o2net_set_nn_state:382
> >>> node 1 sc: 0000000000000000 -> ffff81007dddc400, valid 0 -> 0, err
0
> > ->
> >>> 0
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> > (3505,0):o2net_set_nn_state:431
> >>> queueing conn attempt in 0 jiffies
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> >>> ffff81007dddc400 refs 1 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] get
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> >>> ffff81007dddc400 refs 2 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] get
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> >>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] get
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] put
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] receiving
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> >>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 16] get
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> > (3505,0):o2net_set_nn_state:382
> >>> node 1 sc: ffff81007dddc400 -> ffff81007dddc400, valid 0 -> 1, err
0
> > ->
> >>> 0
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_conn_up:267
> > node 1,
> >>> 2 total
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2quo_clear_hold:171
> > node
> >>> 1, 0 total
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: o2net: accepted connection
> > from
> >>> node defr1elcbtd02 (num 1) at 192.168.0.102:7777
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] ret = 16
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1129
> > [sc
> >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] receiving
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2net_advance_rx:1196
> > [sc
> >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] ret = -11
> >>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> >>> ffff81007f272590 pg_off 0] put
> >>>
> >>>> If there is some problem in FC port, the node will reboot and the
> >>> other
> >>>> node will get this message.
> >>>>> The cluster is running fine if there is no problem with the SAN
> >>>>> connection.
> >>>>>
> >>>>> How to enable verbose logging with ofcs2?
> >>>> debugfs.ocfs2 -l will show current logging status.
> >>>>
> >>>> If you want to enable some log, use e.g
> >>>> debugfs.ocfs2  -l DISK_ALLOC allow
> >>>> and
> >>>> debugfs.ocfs2 -l DISK_ALLOC off
> >>>> will turn it off.
> >>>>
> >>>> Regards,
> >>>> Tao
> >>>>> Regards,
> >>>>> Florian
> >>>>>
> >>>>>> florian.engelmann at bt.com wrote:
> >>>>>>> Hello,
> >>>>>>> our Debian etch cluster nodes are panicing because of ocfs2
> >>> fencing
> >>>>> if
> >>>>>>> one SAN path fails.
> >>>>>>>
> >>>>>>> modinfo ocfs2
> >>>>>>> filename:
> >>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
> >>>>>>> author:         Oracle
> >>>>>>> license:        GPL
> >>>>>>> description:    OCFS2 1.3.3
> >>>>>>> version:        1.3.3
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:        ocfs2_dlm,ocfs2_nodemanager,jbd
> >>>>>>> srcversion:     0798424846E68F10172C203
> >>>>>>>
> >>>>>>> modinfo ocfs2_dlmfs
> >>>>>>> filename:
> >>>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
> >>>>>>> author:         Oracle
> >>>>>>> license:        GPL
> >>>>>>> description:    OCFS2 DLMFS 1.3.3
> >>>>>>> version:        1.3.3
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:        ocfs2_dlm,ocfs2_nodemanager
> >>>>>>> srcversion:     E3780E12396118282B3C1AD
> >>>>>>>
> >>>>>>> defr1elcbtd02:~# modinfo ocfs2_dlm
> >>>>>>> filename:
> >>>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
> >>>>>>> author:         Oracle
> >>>>>>> license:        GPL
> >>>>>>> description:    OCFS2 DLM 1.3.3
> >>>>>>> version:        1.3.3
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:        ocfs2_nodemanager
> >>>>>>> srcversion:     7DC395EA08AE4CE826C5B92
> >>>>>>>
> >>>>>>> modinfo ocfs2_nodemanager
> >>>>>>> filename:
> >>>>>>>
> >
/lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
> >>>>>>> author:         Oracle
> >>>>>>> license:        GPL
> >>>>>>> description:    OCFS2 Node Manager 1.3.3
> >>>>>>> version:        1.3.3
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:        configfs
> >>>>>>> srcversion:     C4C9871302E1910B78DAE40
> >>>>>>>
> >>>>>>> modinfo qla2xxx
> >>>>>>> filename:
> >>>>>>>
> > /lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> >>>>>>> author:         QLogic Corporation
> >>>>>>> description:    QLogic Fibre Channel HBA Driver
> >>>>>>> license:        GPL
> >>>>>>> version:        8.01.07-k1
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:        scsi_mod,scsi_transport_fc,firmware_class
> >>>>>>> alias:          pci:v00001077d00002100sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00002200sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00002300sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00002312sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00002322sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00006312sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00006322sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00002422sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00002432sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00005422sv*sd*bc*sc*i*
> >>>>>>> alias:          pci:v00001077d00005432sv*sd*bc*sc*i*
> >>>>>>> srcversion:     B8E1608E257391DCAFD9224
> >>>>>>> parm:           ql2xfdmienable:Enables FDMI registratons
Default
> >>> is
> >>>>> 0 -
> >>>>>>> no FDMI. 1 - perfom FDMI. (int)
> >>>>>>> parm:           extended_error_logging:Option to enable
extended
> >>>>> error
> >>>>>>> logging, Default is 0 - no logging. 1 - log errors. (int)
> >>>>>>> parm:           ql2xallocfwdump:Option to enable allocation of
> >>>>> memory
> >>>>>>> for a firmware dump during HBA initialization.  Memory
> > allocation
> >>>>>>> requirements vary by ISP type.  Default is 1 - allocate
memory.
> >>>>> (int)
> >>>>>>> parm:           ql2xloginretrycount:Specify an alternate value
> > for
> >>>>> the
> >>>>>>> NVRAM login retry count. (int)
> >>>>>>> parm:           ql2xplogiabsentdevice:Option to enable PLOGI
to
> >>>>> devices
> >>>>>>> that are not present after a Fabric scan.  This is needed for
> >>>>> several
> >>>>>>> broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI.
> > (int)
> >>>>>>> parm:           qlport_down_retry:Maximum number of command
> >>> retries
> >>>>> to a
> >>>>>>> port that returns a PORT-DOWN status. (int)
> >>>>>>> parm:           ql2xlogintimeout:Login timeout value in
seconds.
> >>>>> (int)
> >>>>>>> modinfo dm_multipath
> >>>>>>> filename:
> >>>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-multipath.ko
> >>>>>>> description:    device-mapper multipath target
> >>>>>>> author:         Sistina Software <dm-devel at redhat.com>
> >>>>>>> license:        GPL
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:        dm-mod
> >>>>>>>
> >>>>>>> modinfo dm_mod
> >>>>>>> filename:
> >>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-mod.ko
> >>>>>>> description:    device-mapper driver
> >>>>>>> author:         Joe Thornber <dm-devel at redhat.com>
> >>>>>>> license:        GPL
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:
> >>>>>>> parm:           major:The major number of the device mapper
> > (uint)
> >>>>>>> modinfo dm_round_robin
> >>>>>>> filename:
> >>>>>>>
/lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-round-robin.ko
> >>>>>>> description:    device-mapper round-robin multipath path
> > selector
> >>>>>>> author:         Sistina Software <dm-devel at redhat.com>
> >>>>>>> license:        GPL
> >>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>>>> depends:        dm-multipath
> >>>>>>>
> >>>>>>> There is no self compiled software just the official
repository
> >>> was
> >>>>>>> used.
> >>>>>>> The nodes are connected to our two independent SANs. The
storage
> >>>>> systems
> >>>>>>> are EMC Clariion CX3-20f and EMC Clariion CX500.
> >>>>>>>
> >>>>>>> multipath.conf:
> >>>>>>> defaults {
> >>>>>>>         rr_min_io                       1000
> >>>>>>>         polling_interval                2
> >>>>>>>         no_path_retry                   5
> >>>>>>>         user_friendly_names             yes
> >>>>>>> }
> >>>>>>>
> >>>>>>> blacklist {
> >>>>>>>         devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
> >>>>>>>         devnode "^hd[a-z][[0-9]*]"
> >>>>>>>         devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
> >>>>>>>         device {
> >>>>>>>                 vendor "DGC"
> >>>>>>>                 product "LUNZ" # EMC Management LUN
> >>>>>>>         }
> >>>>>>>         device {
> >>>>>>>                 vendor "ATA"  #We do not need mutlipathing for
> >>> local
> >>>>>>> drives
> >>>>>>>                 product "*"
> >>>>>>>         }
> >>>>>>>         device {
> >>>>>>>                 vendor "AMI" # No multipathing for SUN Virtual
> >>>>> devices
> >>>>>>>                 product "*"
> >>>>>>>         }
> >>>>>>>         device {
> >>>>>>>                 vendor "HITACHI" # No multipathing for local
> > scsi
> >>>>> disks
> >>>>>>>                 product "H101414SCSUN146G"
> >>>>>>>         }
> >>>>>>> }
> >>>>>>>
> >>>>>>> devices {
> >>>>>>>         ## Device attributes for EMC CLARiiON
> >>>>>>>         device {
> >>>>>>>                 vendor                  "DGC"
> >>>>>>>                 product                 "*"
> >>>>>>>                 path_grouping_policy    group_by_prio
> >>>>>>>                 getuid_callout          "/sbin/scsi_id -g -u
-s
> >>>>>>> /block/%n"
> >>>>>>>                 prio_callout            "/sbin/mpath_prio_emc
> >>>>> /dev/%n"
> >>>>>>>                 hardware_handler        "1 emc"
> >>>>>>>                 features                "1 queue_if_no_path"
> >>>>>>>                 no_path_retry           fail
> >>>>>>>                 path_checker            emc_clariion
> >>>>>>>                 path_selector           "round-robin 0"
> >>>>>>>                 failback                immediate
> >>>>>>>                 user_friendly_names     yes
> >>>>>>>         }
> >>>>>>> }
> >>>>>>>
> >>>>>>> multipaths {
> >>>>>>>         multipath {
> >>>>>>>                 wwid
> >>>>>>> 3600601603ac511001c7c92fec775dd11
> >>>>>>>                 alias                   stosan01_lun070
> >>>>>>>         }
> >>>>>>> }
> >>>>>>>
> >>>>>>> multipath -ll:
> >>>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7
> > DGC,RAID
> >>> 5
> >>>>>>> [size=133G][features=0][hwhandler=1 emc]
> >>>>>>> \_ round-robin 0 [prio=2][active]
> >>>>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
> >>>>>>>  \_ 1:0:1:1 sdh 8:112 [active][ready]
> >>>>>>> \_ round-robin 0 [prio=0][enabled]
> >>>>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
> >>>>>>>  \_ 1:0:0:1 sdf 8:80  [active][ready]
> >>>>>>>
> >>>>>>>
> >>>>>>> As we use lvm2 we added /dev/sd* to the filter:
> >>>>>>> filter = [ "r|/dev/cdrom|", "r|/dev/sd.*|" ]
> >>>>>>>
> >>>>>>> Here is what happened and what we did to reconstruct the
> > situation
> >>>>> to
> >>>>>>> find a solution:
> >>>>>>>
> >>>>>>> On 02.06.2009 we did something wrong with the zoning on one of
> > our
> >>>>> two
> >>>>>>> SANs and all servers (about 40) lost one path to the SAN. Only
> > two
> >>>>>>> servers crashed. Those two are our Debian etch heartbeat
cluster
> >>>>>>> described above.
> >>>>>>> The console showed a kernel panic because of ocfs2 was fencing
> >>> both
> >>>>>>> nodes.
> >>>>>>>
> >>>>>>> This was the message:
> >>>>>>> O2hb_write_timeout: 165 ERROR: Heartbeat write timeout to
device
> >>>>> dm-7
> >>>>>>> after 12000 milliseconds
> >>>>>>>
> >>>>>>> So we decided to change the o2cb settings to:
> >>>>>>> O2CB_HEARTBEAT_THRESHOLD=31
> >>>>>>> O2CB_IDLE_TIMEOUT_MS=30000
> >>>>>>> O2CB_KEEPALIVE_DELAY_MS=2000
> >>>>>>> O2CB_RECONNECT_DELAY_MS=2000
> >>>>>>>
> >>>>>>> We switched all cluster resources to the 1st node to test the
> > new
> >>>>>>> settings on the second node. We removed the 2nd node from the
> >>> zoning
> >>>>> (we
> >>>>>>> also tested shutting down the port with the same result) and
got
> > a
> >>>>>>> different error but still ending up with a kernel panic:
> >>>>>>>
> >>>>>>> Jun  4 16:41:05 defr1elcbtd02 kernel: o2net: no longer
connected
> >>> to
> >>>>> node
> >>>>>>> defr1elcbtd01 (num 0) at 192.168.0.101:7777
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-0: blocked FC
> >>>>> remote
> >>>>>>> port time out: removing target and saving binding
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-1: blocked FC
> >>>>> remote
> >>>>>>> port time out: removing target and saving binding
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: sd 0:0:1:1: SCSI error:
> >>> return
> >>>>>>> code = 0x00010000
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: end_request: I/O error,
> > dev
> >>>>> sdd,
> >>>>>>> sector 1672
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper:
multipath:
> >>>>> Failing
> >>>>>>> path 8:48.
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper:
multipath:
> >>>>> Failing
> >>>>>>> path 8:16.
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: scsi 0:0:1:1: rejecting
> > I/O
> >>> to
> >>>>>>> device being removed
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> >>> emc:
> >>>>> long
> >>>>>>> trespass command will be send
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> >>> emc:
> >>>>>>> honor reservation bit will not be set (default)
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: table:
> > 253:7:
> >>>>>>> multipath: error getting device
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: ioctl:
> > error
> >>>>> adding
> >>>>>>> target to table
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> >>> emc:
> >>>>> long
> >>>>>>> trespass command will be send
> >>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> >>> emc:
> >>>>>>> honor reservation bit will not be set (default)
> >>>>>>> Jun  4 16:41:29 defr1elcbtd02 kernel: device-mapper: multipath
> >>> emc:
> >>>>>>> emc_pg_init: sending switch-over command
> >>>>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
> >>>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status =
> > -107
> >>>>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
> >>>>>>> (10751,1):dlm_wait_for_node_death:374
> >>>>> 5EE89BC01EFC405E9197C198DEEAE678:
> >>>>>>> waiting 5000ms for notification of death of node 0
> >>>>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
> >>>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status =
> > -107
> >>>>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
> >>>>>>> (10751,1):dlm_wait_for_node_death:374
> >>>>> 5EE89BC01EFC405E9197C198DEEAE678:
> >>>>>>> waiting 5000ms for notification of death of node 0
> >>>>>>> [...]
> >>>>>>> After 60 seconds:
> >>>>>>>
> >>>>>>> (8,0): o2quo_make_decision:143 ERROR: fending this node
because
> > it
> >>>>> is
> >>>>>>> connected to a half-quorum of 1 out of 2 nodes which doesn't
> >>> include
> >>>>> the
> >>>>>>> lowest active node 0
> >>>>>>>
> >>>>>>>
> >>>>>>> multipath -ll changed to:
> >>>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7
> > DGC,RAID
> >>> 5
> >>>>>>> [size=133G][features=0][hwhandler=1 emc]
> >>>>>>> \_ round-robin 0 [prio=1][active]
> >>>>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
> >>>>>>> \_ round-robin 0 [prio=0][enabled]
> >>>>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
> >>>>>>>
> >>>>>>> The ocfs2 filesystem is still mounted an writable. Even if I
> >>> enable
> >>>>> the
> >>>>>>> zoneing (or the FC port) again within the 60 seconds ocfs2
does
> >>> not
> >>>>>>> reconnect to node 1 and panics the kernel after 60 seconds
while
> >>>>>>> multipath -ll shows both path again.
> >>>>>>>
> >>>>>>> I do not understand at all what the Ethernet heartbeat
> > connection
> >>> of
> >>>>>>> ocfs2 has to do with the SAN connection.
> >>>>>>>
> >>>>>>> The strangest thing at all is - this does not happen always!
> > After
> >>>>> some
> >>>>>>> reboots the system keeps running stable even if I shutdown a
FC
> >>> port
> >>>>> and
> >>>>>>> enable it again many times. There is no constant behaviour...
It
> >>>>> happens
> >>>>>>> most of the times, but at about 10% it does not happen and
> >>>>> everything is
> >>>>>>> working as intended.
> >>>>>>>
> >>>>>>> Any explanations or ideas what causes this behaviour?
> >>>>>>>
> >>>>>>> I will test this on Debian lenny to see if the Debian version
> >>> makes
> >>>>> a
> >>>>>>> difference.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Florian
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Ocfs2-users mailing list
> >>>>>>> Ocfs2-users at oss.oracle.com
> >>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>>>>>
> >>>>> _______________________________________________
> >>>>> Ocfs2-users mailing list
> >>>>> Ocfs2-users at oss.oracle.com
> >>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users