[Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

Mon Jun 8 01:37:06 PDT 2009

Hi Tao,

> 
> Hi Florian,
> 
> florian.engelmann at bt.com wrote:
> > Hi Tao,
> >
> >> Hi florian,
> >>
> >> florian.engelmann at bt.com wrote:
> >>>> Florian,
> >>>> the problem here seems to be with network. The nodes are running
> > into
> >>>> network heartbeat timeout and hence second node is getting
fenced.
> > Do
> >>>> you see o2net thread consuming 100% cpu on any node? if not then
> >>>> probably check your network
> >>>> thanks,
> >>>> --Srini
> >>> I forgot to post my /etc/ocfs2/cluster.conf
> >>> node:
> >>>         ip_port = 7777
> >>>         ip_address = 192.168.0.101
> >>>         number = 0
> >>>         name = defr1elcbtd01
> >>>         cluster = ocfs2
> >>>
> >>> node:
> >>>         ip_port = 7777
> >>>         ip_address = 192.168.0.102
> >>>         number = 1
> >>>         name = defr1elcbtd02
> >>>         cluster = ocfs2
> >>>
> >>> cluster:
> >>>         node_count = 2
> >>>         name = ocfs2
> >>>
> >>>
> >>> 192.168.0.10x is eth3 on both nodes and connected with a cross
over
> >>> cable. No active network component is involved here.
> >>>
> >>> defr1elcbtd02:~# traceroute 192.168.0.101
> >>> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
> >>> packets
> >>>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
> >>> defr1elcbtd02:~#
> >>>
> >>> The error message looks like a network problem but why should
there
> > be a
> >>> network problem if I shutdown a FC port?! I testet it about 20
times
> > and
> >>> got about 16 kernel panics starting with the same error message:
> >>>
> >>> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0)
at
> >>> 192.168.0.101:7777
> >> It isn't an error message, just a status report that we can't
connect
> > to
> >> that node now. That node may be rebooted or something else, but
this
> >> node don't know, and it only knows the connection is down.
> >
> > But node defr1elcbtd01 was never down and also the network link
(eth3)
> > wasn't down. I was able to ping from each node to the other.
> > Node 1 is hosting all services and never was faulted while I was
> > testing.
> >
> > All I have to do to panic node 2 is to disable one of two fibre
channel
> > ports or pull one fibre channel cable or delete node 2 from the
cisco
> > SAN zoning.
> > If I apply one of those 3 "errors" I get the message about o2net is
no
> > longer connected to node 1 and 60 seconds later the 2nd node panics
> > because of ocfs2 fencing (but this happens only in about 80% of
cases -
> > in the other 20% of cases o2net does not disconnect and there are no
> > messages about ocfs2 at all - like it should be...).
> > Everything else is working fine in these 60 seconds. The filesystem
is
> > still writable from both nodes and both nodes can ping each other
(via
> > the cluster interconnect).
> I just checked your log. The error why node 2 get the message is that
> node 1 get the message that node 2 stopped disk heartbeat for quite a
> long time so it stop the connection intentionally. So node 2 get this
> message.
> 
> See the log in node 1:
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node
1,
> 1 total
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0
->
> -107
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
> 1, 1 total
> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
node
> defr1elcbtd02 (num 1) at 192.168.0.102:7777
> 
> And I guess the reason why you see this log sometimes(80%) is that the
> time interval. You know ocfs2 disk heartbeat try every 2 secs so
> sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes
> node2 panic after node 1 call o2quo_hb_down(which will put something
> like "no longer..." in node 2's log).
> 
> So would you please give your timeout configuration(o2cb)?

These are my setting on node 2:

O2CB_HEARTBEAT_THRESHOLD=61
O2CB_IDLE_TIMEOUT_MS=60000
O2CB_KEEPALIVE_DELAY_MS=4000
O2CB_RECONNECT_DELAY_MS=4000

On node 1 I still got the old setting because there was no downtime to
reboot this system till today. Is there any way to change the values
without a reboot? The system is mission critical and I can only reboot
on Sundays.

Settings on node 1 are the default setting that came with the Debian
package. I changed them to fit the settings of node 2 after the next
reboot - so I can only guess they were:

O2CB_HEARTBEAT_THRESHOLD=7
O2CB_IDLE_TIMEOUT_MS=12000
O2CB_KEEPALIVE_DELAY_MS=2000
O2CB_RECONNECT_DELAY_MS=2000

Regards,
Florian

> 
> Regards,
> Tao
> 
> 
> 
> >
> > Here are the logs with debug logging:
> >
> > Node 2:
> >
> > Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx 0000:04:00.0: LOOP
DOWN
> > detected (2).
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] data_ready hit
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] get
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> > [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
page_off
> > 24
> > Jun  8 09:46:11 defr1elcbtd02 kernel:
> > (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> > sys_stat 0 key 00000000 num 0] processing message
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 24] get
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = 1
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = -11
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] data_ready hit
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] get
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] data_ready hit
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] get
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> > [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
page_off
> > 24
> > Jun  8 09:46:16 defr1elcbtd02 kernel:
> > (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> > sys_stat 0 key 00000000 num 0] processing message
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 24] get
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = 1
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> > [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
page_off
> > 24
> > Jun  8 09:46:16 defr1elcbtd02 kernel:
> > (3463,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
> > sys_stat 0 key 00000000 num 0] processing message
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 24] put
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 24] get
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = 1
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = -11
> > Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] data_ready hit
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] get
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> > [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
page_off
> > 24
> > Jun  8 09:46:21 defr1elcbtd02 kernel:
> > (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> > sys_stat 0 key 00000000 num 0] processing message
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 24] get
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = 1
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = -11
> > Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_state_change:477
[sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] state_change to 8
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] get
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] data_ready hit
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] get
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572
[sc
> > ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] shutting down
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd02 kernel:
(3463,0):o2net_set_nn_state:382
> > node 0 sc: ffff81007c2f0800 -> 0000000000000000, valid 1 -> 0, err 0
->
> > -107
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_conn_err:296
node
> > 0, 1 total
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_set_hold:162
node
> > 0, 1 total
> > Jun  8 09:46:26 defr1elcbtd02 kernel: o2net: no longer connected to
node
> > defr1elcbtd01 (num 0) at 192.168.0.101:7777
> > Jun  8 09:46:26 defr1elcbtd02 kernel:
> > (3463,0):o2net_complete_nodes_nsw:255 completed 0 messages for node
0
> > Jun  8 09:46:26 defr1elcbtd02 kernel:
(3463,0):o2net_set_nn_state:431
> > queueing conn attempt in 0 jiffies
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] get
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
[sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] receiving
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
[sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] ret = 0
> > Jun  8 09:46:26 defr1elcbtd02 kernel:
(3463,0):o2net_rx_until_empty:1215
> > [sc ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] saw error 0, closing
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572
[sc
> > ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] shutting down
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_kref_release:274
[sc
> > ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> > ffff81007fafbb00 pg_off 0] releasing
> > Jun  8 09:46:46 defr1elcbtd02 kernel: scsi 2:0:1:1: SCSI error:
return
> > code = 0x00010000
> > Jun  8 09:46:46 defr1elcbtd02 kernel: end_request: I/O error, dev
sdd,
> > sector 1672
> > Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath:
Failing
> > path 8:48.
> > Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath:
Failing
> > path 8:16.
> > Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
long
> > trespass command will be send
> > Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
> > honor reservation bit will not be set (default)
> > Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: ioctl: error
adding
> > target to table
> > Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
long
> > trespass command will be send
> > Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
> > honor reservation bit will not be set (default)
> > Jun  8 09:46:48 defr1elcbtd02 kernel: device-mapper: multipath emc:
> > emc_pg_init: sending switch-over command
> >
> >
> >
> >
> > Node 1:
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (16528,0):o2net_data_ready:452
[sc
> > ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] data_ready hit
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (16528,0):sc_get:294 [sc
> > ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] get
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
[sc
> > ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] receiving
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1170
> > [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
page_off
> > 24
> > Jun  8 09:46:21 defr1elcbtd01 kernel:
> > (3505,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
> > sys_stat 0 key 00000000 num 0] processing message
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> > ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 24] put
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> > ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 24] get
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
[sc
> > ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] ret = 1
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
[sc
> > ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] receiving
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
[sc
> > ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] ret = -11
> > Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> > ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
node 1,
> > 1 total
> > Jun  8 09:46:26 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:382
> > node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0
->
> > -107
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296
node
> > 1, 1 total
> > Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
node
> > defr1elcbtd02 (num 1) at 192.168.0.102:7777
> > Jun  8 09:46:26 defr1elcbtd01 kernel:
> > (3804,0):o2net_complete_nodes_nsw:255 completed 0 messages for node
1
> > Jun  8 09:46:26 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:431
> > queueing conn attempt in 0 jiffies
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_get:294 [sc
> > ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] get
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_put:289 [sc
> > ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):o2net_shutdown_sc:572
[sc
> > ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] shutting down
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> > ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> > ffff81007ddf4400 refs 2 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> > ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] put
> > Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_kref_release:274
[sc
> > ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
> > ffff81007f75b3e8 pg_off 0] releasing
> > Jun  8 09:46:26 defr1elcbtd01 kernel:
> > (16541,0):ocfs2_replay_journal:1184 Recovering node 1 from slot 0 on
> > device (253,7)
> > Jun  8 09:46:28 defr1elcbtd01 kernel: kjournald starting.  Commit
> > interval 5 seconds
> > Jun  8 09:46:48 defr1elcbtd01 kernel: Please double check your
> > configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
> > Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node
1, 2
> > total
> > Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162
node
> > 1, 1 total
> > Jun  8 09:46:48 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:382
> > node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
-107
> > -> 0
> > Jun  8 09:46:48 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:431
> > queueing conn attempt in 0 jiffies
> > Jun  8 09:46:58 defr1elcbtd01 kernel:
(3505,0):o2net_set_nn_state:382
> > node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0
->
> > -107
> > Jun  8 09:46:58 defr1elcbtd01 kernel: (3505,0):o2quo_conn_err:296
node
> > 1, 1 total
> > Jun  8 09:46:58 defr1elcbtd01 kernel:
(3505,0):o2net_set_nn_state:431
> > queueing conn attempt in 0 jiffies
> > Jun  8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_hb_still_up:242
> > node 1
> > Jun  8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171
node
> > 1, 0 total
> > Jun  8 09:47:16 defr1elcbtd01 kernel: (8,0):o2quo_make_decision:105
> > heartbeating: 2, connected: 1, lowest: 0 (reachable)
> > Jun  8 09:47:44 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
node 1,
> > 1 total
> > Jun  8 09:47:44 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:382
> > node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
-107
> > -> -107
> > Jun  8 09:47:44 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:431
> > queueing conn attempt in 0 jiffies
> > Jun  8 09:48:33 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
NIC
> > Link is Down
> > Jun  8 09:48:45 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
NIC
> > Link is Up 1000 Mbps Full Duplex
> > Jun  8 09:50:05 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
NIC
> > Link is Down
> > Jun  8 09:50:08 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
NIC
> > Link is Up 1000 Mbps Full Duplex
> > Jun  8 09:50:20 defr1elcbtd01 kernel: Please double check your
> > configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
> > Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node
1, 2
> > total
> > Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162
node
> > 1, 1 total
> > Jun  8 09:50:20 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:382
> > node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
-107
> > -> 0
> > Jun  8 09:50:20 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:431
> > queueing conn attempt in 0 jiffies
> > Jun  8 09:50:22 defr1elcbtd01 kernel:
(0,0):o2net_listen_data_ready:1670
> > bytes: 0
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_alloc:320 [sc
> > ffff81007dddc400 refs 1 sock 0000000000000000 node 1 page
> > 0000000000000000 pg_off 0] alloced
> > Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2net_set_nn_state:382
> > node 1 sc: 0000000000000000 -> ffff81007dddc400, valid 0 -> 0, err 0
->
> > 0
> > Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2net_set_nn_state:431
> > queueing conn attempt in 0 jiffies
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> > ffff81007dddc400 refs 1 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] get
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> > ffff81007dddc400 refs 2 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] get
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> > ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] get
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> > ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] put
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
[sc
> > ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] receiving
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> > ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 16] get
> > Jun  8 09:50:22 defr1elcbtd01 kernel:
(3505,0):o2net_set_nn_state:382
> > node 1 sc: ffff81007dddc400 -> ffff81007dddc400, valid 0 -> 1, err 0
->
> > 0
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_conn_up:267
node 1,
> > 2 total
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171
node
> > 1, 0 total
> > Jun  8 09:50:22 defr1elcbtd01 kernel: o2net: accepted connection
from
> > node defr1elcbtd02 (num 1) at 192.168.0.102:7777
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
[sc
> > ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] ret = 16
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
[sc
> > ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] receiving
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
[sc
> > ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] ret = -11
> > Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> > ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> > ffff81007f272590 pg_off 0] put
> >
> >> If there is some problem in FC port, the node will reboot and the
> > other
> >> node will get this message.
> >>> The cluster is running fine if there is no problem with the SAN
> >>> connection.
> >>>
> >>> How to enable verbose logging with ofcs2?
> >> debugfs.ocfs2 -l will show current logging status.
> >>
> >> If you want to enable some log, use e.g
> >> debugfs.ocfs2  -l DISK_ALLOC allow
> >> and
> >> debugfs.ocfs2 -l DISK_ALLOC off
> >> will turn it off.
> >>
> >> Regards,
> >> Tao
> >>> Regards,
> >>> Florian
> >>>
> >>>> florian.engelmann at bt.com wrote:
> >>>>> Hello,
> >>>>> our Debian etch cluster nodes are panicing because of ocfs2
> > fencing
> >>> if
> >>>>> one SAN path fails.
> >>>>>
> >>>>> modinfo ocfs2
> >>>>> filename:
> > /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
> >>>>> author:         Oracle
> >>>>> license:        GPL
> >>>>> description:    OCFS2 1.3.3
> >>>>> version:        1.3.3
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:        ocfs2_dlm,ocfs2_nodemanager,jbd
> >>>>> srcversion:     0798424846E68F10172C203
> >>>>>
> >>>>> modinfo ocfs2_dlmfs
> >>>>> filename:
> >>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
> >>>>> author:         Oracle
> >>>>> license:        GPL
> >>>>> description:    OCFS2 DLMFS 1.3.3
> >>>>> version:        1.3.3
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:        ocfs2_dlm,ocfs2_nodemanager
> >>>>> srcversion:     E3780E12396118282B3C1AD
> >>>>>
> >>>>> defr1elcbtd02:~# modinfo ocfs2_dlm
> >>>>> filename:
> >>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
> >>>>> author:         Oracle
> >>>>> license:        GPL
> >>>>> description:    OCFS2 DLM 1.3.3
> >>>>> version:        1.3.3
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:        ocfs2_nodemanager
> >>>>> srcversion:     7DC395EA08AE4CE826C5B92
> >>>>>
> >>>>> modinfo ocfs2_nodemanager
> >>>>> filename:
> >>>>>
> >
/lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
> >>>>> author:         Oracle
> >>>>> license:        GPL
> >>>>> description:    OCFS2 Node Manager 1.3.3
> >>>>> version:        1.3.3
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:        configfs
> >>>>> srcversion:     C4C9871302E1910B78DAE40
> >>>>>
> >>>>> modinfo qla2xxx
> >>>>> filename:
> >>>>>
/lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> >>>>> author:         QLogic Corporation
> >>>>> description:    QLogic Fibre Channel HBA Driver
> >>>>> license:        GPL
> >>>>> version:        8.01.07-k1
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:        scsi_mod,scsi_transport_fc,firmware_class
> >>>>> alias:          pci:v00001077d00002100sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00002200sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00002300sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00002312sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00002322sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00006312sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00006322sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00002422sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00002432sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00005422sv*sd*bc*sc*i*
> >>>>> alias:          pci:v00001077d00005432sv*sd*bc*sc*i*
> >>>>> srcversion:     B8E1608E257391DCAFD9224
> >>>>> parm:           ql2xfdmienable:Enables FDMI registratons Default
> > is
> >>> 0 -
> >>>>> no FDMI. 1 - perfom FDMI. (int)
> >>>>> parm:           extended_error_logging:Option to enable extended
> >>> error
> >>>>> logging, Default is 0 - no logging. 1 - log errors. (int)
> >>>>> parm:           ql2xallocfwdump:Option to enable allocation of
> >>> memory
> >>>>> for a firmware dump during HBA initialization.  Memory
allocation
> >>>>> requirements vary by ISP type.  Default is 1 - allocate memory.
> >>> (int)
> >>>>> parm:           ql2xloginretrycount:Specify an alternate value
for
> >>> the
> >>>>> NVRAM login retry count. (int)
> >>>>> parm:           ql2xplogiabsentdevice:Option to enable PLOGI to
> >>> devices
> >>>>> that are not present after a Fabric scan.  This is needed for
> >>> several
> >>>>> broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI.
(int)
> >>>>> parm:           qlport_down_retry:Maximum number of command
> > retries
> >>> to a
> >>>>> port that returns a PORT-DOWN status. (int)
> >>>>> parm:           ql2xlogintimeout:Login timeout value in seconds.
> >>> (int)
> >>>>> modinfo dm_multipath
> >>>>> filename:
> >>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-multipath.ko
> >>>>> description:    device-mapper multipath target
> >>>>> author:         Sistina Software <dm-devel at redhat.com>
> >>>>> license:        GPL
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:        dm-mod
> >>>>>
> >>>>> modinfo dm_mod
> >>>>> filename:
> >>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-mod.ko
> >>>>> description:    device-mapper driver
> >>>>> author:         Joe Thornber <dm-devel at redhat.com>
> >>>>> license:        GPL
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:
> >>>>> parm:           major:The major number of the device mapper
(uint)
> >>>>>
> >>>>> modinfo dm_round_robin
> >>>>> filename:
> >>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-round-robin.ko
> >>>>> description:    device-mapper round-robin multipath path
selector
> >>>>> author:         Sistina Software <dm-devel at redhat.com>
> >>>>> license:        GPL
> >>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
> >>>>> depends:        dm-multipath
> >>>>>
> >>>>> There is no self compiled software just the official repository
> > was
> >>>>> used.
> >>>>> The nodes are connected to our two independent SANs. The storage
> >>> systems
> >>>>> are EMC Clariion CX3-20f and EMC Clariion CX500.
> >>>>>
> >>>>> multipath.conf:
> >>>>> defaults {
> >>>>>         rr_min_io                       1000
> >>>>>         polling_interval                2
> >>>>>         no_path_retry                   5
> >>>>>         user_friendly_names             yes
> >>>>> }
> >>>>>
> >>>>> blacklist {
> >>>>>         devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
> >>>>>         devnode "^hd[a-z][[0-9]*]"
> >>>>>         devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
> >>>>>         device {
> >>>>>                 vendor "DGC"
> >>>>>                 product "LUNZ" # EMC Management LUN
> >>>>>         }
> >>>>>         device {
> >>>>>                 vendor "ATA"  #We do not need mutlipathing for
> > local
> >>>>> drives
> >>>>>                 product "*"
> >>>>>         }
> >>>>>         device {
> >>>>>                 vendor "AMI" # No multipathing for SUN Virtual
> >>> devices
> >>>>>                 product "*"
> >>>>>         }
> >>>>>         device {
> >>>>>                 vendor "HITACHI" # No multipathing for local
scsi
> >>> disks
> >>>>>                 product "H101414SCSUN146G"
> >>>>>         }
> >>>>> }
> >>>>>
> >>>>> devices {
> >>>>>         ## Device attributes for EMC CLARiiON
> >>>>>         device {
> >>>>>                 vendor                  "DGC"
> >>>>>                 product                 "*"
> >>>>>                 path_grouping_policy    group_by_prio
> >>>>>                 getuid_callout          "/sbin/scsi_id -g -u -s
> >>>>> /block/%n"
> >>>>>                 prio_callout            "/sbin/mpath_prio_emc
> >>> /dev/%n"
> >>>>>                 hardware_handler        "1 emc"
> >>>>>                 features                "1 queue_if_no_path"
> >>>>>                 no_path_retry           fail
> >>>>>                 path_checker            emc_clariion
> >>>>>                 path_selector           "round-robin 0"
> >>>>>                 failback                immediate
> >>>>>                 user_friendly_names     yes
> >>>>>         }
> >>>>> }
> >>>>>
> >>>>> multipaths {
> >>>>>         multipath {
> >>>>>                 wwid
> >>>>> 3600601603ac511001c7c92fec775dd11
> >>>>>                 alias                   stosan01_lun070
> >>>>>         }
> >>>>> }
> >>>>>
> >>>>> multipath -ll:
> >>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7
DGC,RAID
> > 5
> >>>>> [size=133G][features=0][hwhandler=1 emc]
> >>>>> \_ round-robin 0 [prio=2][active]
> >>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
> >>>>>  \_ 1:0:1:1 sdh 8:112 [active][ready]
> >>>>> \_ round-robin 0 [prio=0][enabled]
> >>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
> >>>>>  \_ 1:0:0:1 sdf 8:80  [active][ready]
> >>>>>
> >>>>>
> >>>>> As we use lvm2 we added /dev/sd* to the filter:
> >>>>> filter = [ "r|/dev/cdrom|", "r|/dev/sd.*|" ]
> >>>>>
> >>>>> Here is what happened and what we did to reconstruct the
situation
> >>> to
> >>>>> find a solution:
> >>>>>
> >>>>> On 02.06.2009 we did something wrong with the zoning on one of
our
> >>> two
> >>>>> SANs and all servers (about 40) lost one path to the SAN. Only
two
> >>>>> servers crashed. Those two are our Debian etch heartbeat cluster
> >>>>> described above.
> >>>>> The console showed a kernel panic because of ocfs2 was fencing
> > both
> >>>>> nodes.
> >>>>>
> >>>>> This was the message:
> >>>>> O2hb_write_timeout: 165 ERROR: Heartbeat write timeout to device
> >>> dm-7
> >>>>> after 12000 milliseconds
> >>>>>
> >>>>> So we decided to change the o2cb settings to:
> >>>>> O2CB_HEARTBEAT_THRESHOLD=31
> >>>>> O2CB_IDLE_TIMEOUT_MS=30000
> >>>>> O2CB_KEEPALIVE_DELAY_MS=2000
> >>>>> O2CB_RECONNECT_DELAY_MS=2000
> >>>>>
> >>>>> We switched all cluster resources to the 1st node to test the
new
> >>>>> settings on the second node. We removed the 2nd node from the
> > zoning
> >>> (we
> >>>>> also tested shutting down the port with the same result) and got
a
> >>>>> different error but still ending up with a kernel panic:
> >>>>>
> >>>>> Jun  4 16:41:05 defr1elcbtd02 kernel: o2net: no longer connected
> > to
> >>> node
> >>>>> defr1elcbtd01 (num 0) at 192.168.0.101:7777
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-0: blocked FC
> >>> remote
> >>>>> port time out: removing target and saving binding
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-1: blocked FC
> >>> remote
> >>>>> port time out: removing target and saving binding
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: sd 0:0:1:1: SCSI error:
> > return
> >>>>> code = 0x00010000
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: end_request: I/O error,
dev
> >>> sdd,
> >>>>> sector 1672
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath:
> >>> Failing
> >>>>> path 8:48.
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath:
> >>> Failing
> >>>>> path 8:16.
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: scsi 0:0:1:1: rejecting
I/O
> > to
> >>>>> device being removed
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> > emc:
> >>> long
> >>>>> trespass command will be send
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> > emc:
> >>>>> honor reservation bit will not be set (default)
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: table:
253:7:
> >>>>> multipath: error getting device
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: ioctl:
error
> >>> adding
> >>>>> target to table
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> > emc:
> >>> long
> >>>>> trespass command will be send
> >>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> > emc:
> >>>>> honor reservation bit will not be set (default)
> >>>>> Jun  4 16:41:29 defr1elcbtd02 kernel: device-mapper: multipath
> > emc:
> >>>>> emc_pg_init: sending switch-over command
> >>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
> >>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status =
-107
> >>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
> >>>>> (10751,1):dlm_wait_for_node_death:374
> >>> 5EE89BC01EFC405E9197C198DEEAE678:
> >>>>> waiting 5000ms for notification of death of node 0
> >>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
> >>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status =
-107
> >>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
> >>>>> (10751,1):dlm_wait_for_node_death:374
> >>> 5EE89BC01EFC405E9197C198DEEAE678:
> >>>>> waiting 5000ms for notification of death of node 0
> >>>>> [...]
> >>>>> After 60 seconds:
> >>>>>
> >>>>> (8,0): o2quo_make_decision:143 ERROR: fending this node because
it
> >>> is
> >>>>> connected to a half-quorum of 1 out of 2 nodes which doesn't
> > include
> >>> the
> >>>>> lowest active node 0
> >>>>>
> >>>>>
> >>>>> multipath -ll changed to:
> >>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7
DGC,RAID
> > 5
> >>>>> [size=133G][features=0][hwhandler=1 emc]
> >>>>> \_ round-robin 0 [prio=1][active]
> >>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
> >>>>> \_ round-robin 0 [prio=0][enabled]
> >>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
> >>>>>
> >>>>> The ocfs2 filesystem is still mounted an writable. Even if I
> > enable
> >>> the
> >>>>> zoneing (or the FC port) again within the 60 seconds ocfs2 does
> > not
> >>>>> reconnect to node 1 and panics the kernel after 60 seconds while
> >>>>> multipath -ll shows both path again.
> >>>>>
> >>>>> I do not understand at all what the Ethernet heartbeat
connection
> > of
> >>>>> ocfs2 has to do with the SAN connection.
> >>>>>
> >>>>> The strangest thing at all is - this does not happen always!
After
> >>> some
> >>>>> reboots the system keeps running stable even if I shutdown a FC
> > port
> >>> and
> >>>>> enable it again many times. There is no constant behaviour... It
> >>> happens
> >>>>> most of the times, but at about 10% it does not happen and
> >>> everything is
> >>>>> working as intended.
> >>>>>
> >>>>> Any explanations or ideas what causes this behaviour?
> >>>>>
> >>>>> I will test this on Debian lenny to see if the Debian version
> > makes
> >>> a
> >>>>> difference.
> >>>>>
> >>>>> Best regards,
> >>>>> Florian
> >>>>>
> >>>>> _______________________________________________
> >>>>> Ocfs2-users mailing list
> >>>>> Ocfs2-users at oss.oracle.com
> >>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>>>
> >>> _______________________________________________
> >>> Ocfs2-users mailing list
> >>> Ocfs2-users at oss.oracle.com
> >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users