[Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

Tao Ma tao.ma at oracle.com
Mon Jun 8 01:20:07 PDT 2009


Hi Florian,

florian.engelmann at bt.com wrote:
> Hi Tao,
> 
>> Hi florian,
>>
>> florian.engelmann at bt.com wrote:
>>>> Florian,
>>>> the problem here seems to be with network. The nodes are running
> into
>>>> network heartbeat timeout and hence second node is getting fenced.
> Do
>>>> you see o2net thread consuming 100% cpu on any node? if not then
>>>> probably check your network
>>>> thanks,
>>>> --Srini
>>> I forgot to post my /etc/ocfs2/cluster.conf
>>> node:
>>>         ip_port = 7777
>>>         ip_address = 192.168.0.101
>>>         number = 0
>>>         name = defr1elcbtd01
>>>         cluster = ocfs2
>>>
>>> node:
>>>         ip_port = 7777
>>>         ip_address = 192.168.0.102
>>>         number = 1
>>>         name = defr1elcbtd02
>>>         cluster = ocfs2
>>>
>>> cluster:
>>>         node_count = 2
>>>         name = ocfs2
>>>
>>>
>>> 192.168.0.10x is eth3 on both nodes and connected with a cross over
>>> cable. No active network component is involved here.
>>>
>>> defr1elcbtd02:~# traceroute 192.168.0.101
>>> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
>>> packets
>>>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
>>> defr1elcbtd02:~#
>>>
>>> The error message looks like a network problem but why should there
> be a
>>> network problem if I shutdown a FC port?! I testet it about 20 times
> and
>>> got about 16 kernel panics starting with the same error message:
>>>
>>> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at
>>> 192.168.0.101:7777
>> It isn't an error message, just a status report that we can't connect
> to
>> that node now. That node may be rebooted or something else, but this
>> node don't know, and it only knows the connection is down.
> 
> But node defr1elcbtd01 was never down and also the network link (eth3)
> wasn't down. I was able to ping from each node to the other.
> Node 1 is hosting all services and never was faulted while I was
> testing.
> 
> All I have to do to panic node 2 is to disable one of two fibre channel
> ports or pull one fibre channel cable or delete node 2 from the cisco
> SAN zoning.
> If I apply one of those 3 "errors" I get the message about o2net is no
> longer connected to node 1 and 60 seconds later the 2nd node panics
> because of ocfs2 fencing (but this happens only in about 80% of cases -
> in the other 20% of cases o2net does not disconnect and there are no
> messages about ocfs2 at all - like it should be...).
> Everything else is working fine in these 60 seconds. The filesystem is
> still writable from both nodes and both nodes can ping each other (via
> the cluster interconnect).
I just checked your log. The error why node 2 get the message is that 
node 1 get the message that node 2 stopped disk heartbeat for quite a 
long time so it stop the connection intentionally. So node 2 get this 
message.

See the log in node 1:
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node 1,
1 total
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0 ->
-107
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
1, 1 total
Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to node
defr1elcbtd02 (num 1) at 192.168.0.102:7777

And I guess the reason why you see this log sometimes(80%) is that the 
time interval. You know ocfs2 disk heartbeat try every 2 secs so 
sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes 
node2 panic after node 1 call o2quo_hb_down(which will put something 
like "no longer..." in node 2's log).

So would you please give your timeout configuration(o2cb)?

Regards,
Tao



> 
> Here are the logs with debug logging:
> 
> Node 2:
> 
> Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx 0000:04:00.0: LOOP DOWN
> detected (2).
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] data_ready hit
> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] get
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at page_off
> 24
> Jun  8 09:46:11 defr1elcbtd02 kernel:
> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> sys_stat 0 key 00000000 num 0] processing message
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 24] get
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = 1
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = -11
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] data_ready hit
> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] get
> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] data_ready hit
> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] get
> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_put:289 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at page_off
> 24
> Jun  8 09:46:16 defr1elcbtd02 kernel:
> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> sys_stat 0 key 00000000 num 0] processing message
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 24] get
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = 1
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at page_off
> 24
> Jun  8 09:46:16 defr1elcbtd02 kernel:
> (3463,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
> sys_stat 0 key 00000000 num 0] processing message
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 24] put
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 24] get
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = 1
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = -11
> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] data_ready hit
> Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] get
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at page_off
> 24
> Jun  8 09:46:21 defr1elcbtd02 kernel:
> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> sys_stat 0 key 00000000 num 0] processing message
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 24] get
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = 1
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = -11
> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_state_change:477 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] state_change to 8
> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] get
> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] data_ready hit
> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] get
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572 [sc
> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] shutting down
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_set_nn_state:382
> node 0 sc: ffff81007c2f0800 -> 0000000000000000, valid 1 -> 0, err 0 ->
> -107
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_conn_err:296 node
> 0, 1 total
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_set_hold:162 node
> 0, 1 total
> Jun  8 09:46:26 defr1elcbtd02 kernel: o2net: no longer connected to node
> defr1elcbtd01 (num 0) at 192.168.0.101:7777
> Jun  8 09:46:26 defr1elcbtd02 kernel:
> (3463,0):o2net_complete_nodes_nsw:255 completed 0 messages for node 0
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_set_nn_state:431
> queueing conn attempt in 0 jiffies
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] get
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] ret = 0
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_rx_until_empty:1215
> [sc ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] saw error 0, closing
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572 [sc
> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] shutting down
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_kref_release:274 [sc
> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
> ffff81007fafbb00 pg_off 0] releasing
> Jun  8 09:46:46 defr1elcbtd02 kernel: scsi 2:0:1:1: SCSI error: return
> code = 0x00010000
> Jun  8 09:46:46 defr1elcbtd02 kernel: end_request: I/O error, dev sdd,
> sector 1672
> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath: Failing
> path 8:48.
> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath: Failing
> path 8:16.
> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc: long
> trespass command will be send
> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
> honor reservation bit will not be set (default)
> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: ioctl: error adding
> target to table
> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc: long
> trespass command will be send
> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
> honor reservation bit will not be set (default)
> Jun  8 09:46:48 defr1elcbtd02 kernel: device-mapper: multipath emc:
> emc_pg_init: sending switch-over command
> 
> 
> 
> 
> Node 1:
> Jun  8 09:46:21 defr1elcbtd01 kernel: (16528,0):o2net_data_ready:452 [sc
> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] data_ready hit
> Jun  8 09:46:21 defr1elcbtd01 kernel: (16528,0):sc_get:294 [sc
> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] get
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 [sc
> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] receiving
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1170
> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at page_off
> 24
> Jun  8 09:46:21 defr1elcbtd01 kernel:
> (3505,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
> sys_stat 0 key 00000000 num 0] processing message
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 24] put
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 24] get
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 [sc
> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] ret = 1
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 [sc
> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] receiving
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 [sc
> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] ret = -11
> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node 1,
> 1 total
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0 ->
> -107
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
> 1, 1 total
> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to node
> defr1elcbtd02 (num 1) at 192.168.0.102:7777
> Jun  8 09:46:26 defr1elcbtd01 kernel:
> (3804,0):o2net_complete_nodes_nsw:255 completed 0 messages for node 1
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:431
> queueing conn attempt in 0 jiffies
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_get:294 [sc
> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] get
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_put:289 [sc
> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):o2net_shutdown_sc:572 [sc
> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] shutting down
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> ffff81007ddf4400 refs 2 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] put
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_kref_release:274 [sc
> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
> ffff81007f75b3e8 pg_off 0] releasing
> Jun  8 09:46:26 defr1elcbtd01 kernel:
> (16541,0):ocfs2_replay_journal:1184 Recovering node 1 from slot 0 on
> device (253,7)
> Jun  8 09:46:28 defr1elcbtd01 kernel: kjournald starting.  Commit
> interval 5 seconds
> Jun  8 09:46:48 defr1elcbtd01 kernel: Please double check your
> configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node 1, 2
> total
> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162 node
> 1, 1 total
> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err -107
> -> 0
> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:431
> queueing conn attempt in 0 jiffies
> Jun  8 09:46:58 defr1elcbtd01 kernel: (3505,0):o2net_set_nn_state:382
> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 ->
> -107
> Jun  8 09:46:58 defr1elcbtd01 kernel: (3505,0):o2quo_conn_err:296 node
> 1, 1 total
> Jun  8 09:46:58 defr1elcbtd01 kernel: (3505,0):o2net_set_nn_state:431
> queueing conn attempt in 0 jiffies
> Jun  8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_hb_still_up:242
> node 1
> Jun  8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171 node
> 1, 0 total
> Jun  8 09:47:16 defr1elcbtd01 kernel: (8,0):o2quo_make_decision:105
> heartbeating: 2, connected: 1, lowest: 0 (reachable)
> Jun  8 09:47:44 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node 1,
> 1 total
> Jun  8 09:47:44 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err -107
> -> -107
> Jun  8 09:47:44 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:431
> queueing conn attempt in 0 jiffies
> Jun  8 09:48:33 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: NIC
> Link is Down
> Jun  8 09:48:45 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: NIC
> Link is Up 1000 Mbps Full Duplex
> Jun  8 09:50:05 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: NIC
> Link is Down
> Jun  8 09:50:08 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: NIC
> Link is Up 1000 Mbps Full Duplex
> Jun  8 09:50:20 defr1elcbtd01 kernel: Please double check your
> configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node 1, 2
> total
> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162 node
> 1, 1 total
> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err -107
> -> 0
> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:431
> queueing conn attempt in 0 jiffies
> Jun  8 09:50:22 defr1elcbtd01 kernel: (0,0):o2net_listen_data_ready:1670
> bytes: 0
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_alloc:320 [sc
> ffff81007dddc400 refs 1 sock 0000000000000000 node 1 page
> 0000000000000000 pg_off 0] alloced
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_set_nn_state:382
> node 1 sc: 0000000000000000 -> ffff81007dddc400, valid 0 -> 0, err 0 ->
> 0
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_set_nn_state:431
> queueing conn attempt in 0 jiffies
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> ffff81007dddc400 refs 1 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] get
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> ffff81007dddc400 refs 2 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] get
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] get
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] put
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 [sc
> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] receiving
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 16] get
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_set_nn_state:382
> node 1 sc: ffff81007dddc400 -> ffff81007dddc400, valid 0 -> 1, err 0 ->
> 0
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_conn_up:267 node 1,
> 2 total
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171 node
> 1, 0 total
> Jun  8 09:50:22 defr1elcbtd01 kernel: o2net: accepted connection from
> node defr1elcbtd02 (num 1) at 192.168.0.102:7777
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 [sc
> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] ret = 16
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 [sc
> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] receiving
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 [sc
> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] ret = -11
> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
> ffff81007f272590 pg_off 0] put
> 
>> If there is some problem in FC port, the node will reboot and the
> other
>> node will get this message.
>>> The cluster is running fine if there is no problem with the SAN
>>> connection.
>>>
>>> How to enable verbose logging with ofcs2?
>> debugfs.ocfs2 -l will show current logging status.
>>
>> If you want to enable some log, use e.g
>> debugfs.ocfs2  -l DISK_ALLOC allow
>> and
>> debugfs.ocfs2 -l DISK_ALLOC off
>> will turn it off.
>>
>> Regards,
>> Tao
>>> Regards,
>>> Florian
>>>
>>>> florian.engelmann at bt.com wrote:
>>>>> Hello,
>>>>> our Debian etch cluster nodes are panicing because of ocfs2
> fencing
>>> if
>>>>> one SAN path fails.
>>>>>
>>>>> modinfo ocfs2
>>>>> filename:
> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
>>>>> author:         Oracle
>>>>> license:        GPL
>>>>> description:    OCFS2 1.3.3
>>>>> version:        1.3.3
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:        ocfs2_dlm,ocfs2_nodemanager,jbd
>>>>> srcversion:     0798424846E68F10172C203
>>>>>
>>>>> modinfo ocfs2_dlmfs
>>>>> filename:
>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
>>>>> author:         Oracle
>>>>> license:        GPL
>>>>> description:    OCFS2 DLMFS 1.3.3
>>>>> version:        1.3.3
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:        ocfs2_dlm,ocfs2_nodemanager
>>>>> srcversion:     E3780E12396118282B3C1AD
>>>>>
>>>>> defr1elcbtd02:~# modinfo ocfs2_dlm
>>>>> filename:
>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
>>>>> author:         Oracle
>>>>> license:        GPL
>>>>> description:    OCFS2 DLM 1.3.3
>>>>> version:        1.3.3
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:        ocfs2_nodemanager
>>>>> srcversion:     7DC395EA08AE4CE826C5B92
>>>>>
>>>>> modinfo ocfs2_nodemanager
>>>>> filename:
>>>>>
> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
>>>>> author:         Oracle
>>>>> license:        GPL
>>>>> description:    OCFS2 Node Manager 1.3.3
>>>>> version:        1.3.3
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:        configfs
>>>>> srcversion:     C4C9871302E1910B78DAE40
>>>>>
>>>>> modinfo qla2xxx
>>>>> filename:
>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
>>>>> author:         QLogic Corporation
>>>>> description:    QLogic Fibre Channel HBA Driver
>>>>> license:        GPL
>>>>> version:        8.01.07-k1
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:        scsi_mod,scsi_transport_fc,firmware_class
>>>>> alias:          pci:v00001077d00002100sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00002200sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00002300sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00002312sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00002322sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00006312sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00006322sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00002422sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00002432sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00005422sv*sd*bc*sc*i*
>>>>> alias:          pci:v00001077d00005432sv*sd*bc*sc*i*
>>>>> srcversion:     B8E1608E257391DCAFD9224
>>>>> parm:           ql2xfdmienable:Enables FDMI registratons Default
> is
>>> 0 -
>>>>> no FDMI. 1 - perfom FDMI. (int)
>>>>> parm:           extended_error_logging:Option to enable extended
>>> error
>>>>> logging, Default is 0 - no logging. 1 - log errors. (int)
>>>>> parm:           ql2xallocfwdump:Option to enable allocation of
>>> memory
>>>>> for a firmware dump during HBA initialization.  Memory allocation
>>>>> requirements vary by ISP type.  Default is 1 - allocate memory.
>>> (int)
>>>>> parm:           ql2xloginretrycount:Specify an alternate value for
>>> the
>>>>> NVRAM login retry count. (int)
>>>>> parm:           ql2xplogiabsentdevice:Option to enable PLOGI to
>>> devices
>>>>> that are not present after a Fabric scan.  This is needed for
>>> several
>>>>> broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI. (int)
>>>>> parm:           qlport_down_retry:Maximum number of command
> retries
>>> to a
>>>>> port that returns a PORT-DOWN status. (int)
>>>>> parm:           ql2xlogintimeout:Login timeout value in seconds.
>>> (int)
>>>>> modinfo dm_multipath
>>>>> filename:
>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-multipath.ko
>>>>> description:    device-mapper multipath target
>>>>> author:         Sistina Software <dm-devel at redhat.com>
>>>>> license:        GPL
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:        dm-mod
>>>>>
>>>>> modinfo dm_mod
>>>>> filename:
>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-mod.ko
>>>>> description:    device-mapper driver
>>>>> author:         Joe Thornber <dm-devel at redhat.com>
>>>>> license:        GPL
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:
>>>>> parm:           major:The major number of the device mapper (uint)
>>>>>
>>>>> modinfo dm_round_robin
>>>>> filename:
>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-round-robin.ko
>>>>> description:    device-mapper round-robin multipath path selector
>>>>> author:         Sistina Software <dm-devel at redhat.com>
>>>>> license:        GPL
>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>> depends:        dm-multipath
>>>>>
>>>>> There is no self compiled software just the official repository
> was
>>>>> used.
>>>>> The nodes are connected to our two independent SANs. The storage
>>> systems
>>>>> are EMC Clariion CX3-20f and EMC Clariion CX500.
>>>>>
>>>>> multipath.conf:
>>>>> defaults {
>>>>>         rr_min_io                       1000
>>>>>         polling_interval                2
>>>>>         no_path_retry                   5
>>>>>         user_friendly_names             yes
>>>>> }
>>>>>
>>>>> blacklist {
>>>>>         devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
>>>>>         devnode "^hd[a-z][[0-9]*]"
>>>>>         devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
>>>>>         device {
>>>>>                 vendor "DGC"
>>>>>                 product "LUNZ" # EMC Management LUN
>>>>>         }
>>>>>         device {
>>>>>                 vendor "ATA"  #We do not need mutlipathing for
> local
>>>>> drives
>>>>>                 product "*"
>>>>>         }
>>>>>         device {
>>>>>                 vendor "AMI" # No multipathing for SUN Virtual
>>> devices
>>>>>                 product "*"
>>>>>         }
>>>>>         device {
>>>>>                 vendor "HITACHI" # No multipathing for local scsi
>>> disks
>>>>>                 product "H101414SCSUN146G"
>>>>>         }
>>>>> }
>>>>>
>>>>> devices {
>>>>>         ## Device attributes for EMC CLARiiON
>>>>>         device {
>>>>>                 vendor                  "DGC"
>>>>>                 product                 "*"
>>>>>                 path_grouping_policy    group_by_prio
>>>>>                 getuid_callout          "/sbin/scsi_id -g -u -s
>>>>> /block/%n"
>>>>>                 prio_callout            "/sbin/mpath_prio_emc
>>> /dev/%n"
>>>>>                 hardware_handler        "1 emc"
>>>>>                 features                "1 queue_if_no_path"
>>>>>                 no_path_retry           fail
>>>>>                 path_checker            emc_clariion
>>>>>                 path_selector           "round-robin 0"
>>>>>                 failback                immediate
>>>>>                 user_friendly_names     yes
>>>>>         }
>>>>> }
>>>>>
>>>>> multipaths {
>>>>>         multipath {
>>>>>                 wwid
>>>>> 3600601603ac511001c7c92fec775dd11
>>>>>                 alias                   stosan01_lun070
>>>>>         }
>>>>> }
>>>>>
>>>>> multipath -ll:
>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 DGC,RAID
> 5
>>>>> [size=133G][features=0][hwhandler=1 emc]
>>>>> \_ round-robin 0 [prio=2][active]
>>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
>>>>>  \_ 1:0:1:1 sdh 8:112 [active][ready]
>>>>> \_ round-robin 0 [prio=0][enabled]
>>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
>>>>>  \_ 1:0:0:1 sdf 8:80  [active][ready]
>>>>>
>>>>>
>>>>> As we use lvm2 we added /dev/sd* to the filter:
>>>>> filter = [ "r|/dev/cdrom|", "r|/dev/sd.*|" ]
>>>>>
>>>>> Here is what happened and what we did to reconstruct the situation
>>> to
>>>>> find a solution:
>>>>>
>>>>> On 02.06.2009 we did something wrong with the zoning on one of our
>>> two
>>>>> SANs and all servers (about 40) lost one path to the SAN. Only two
>>>>> servers crashed. Those two are our Debian etch heartbeat cluster
>>>>> described above.
>>>>> The console showed a kernel panic because of ocfs2 was fencing
> both
>>>>> nodes.
>>>>>
>>>>> This was the message:
>>>>> O2hb_write_timeout: 165 ERROR: Heartbeat write timeout to device
>>> dm-7
>>>>> after 12000 milliseconds
>>>>>
>>>>> So we decided to change the o2cb settings to:
>>>>> O2CB_HEARTBEAT_THRESHOLD=31
>>>>> O2CB_IDLE_TIMEOUT_MS=30000
>>>>> O2CB_KEEPALIVE_DELAY_MS=2000
>>>>> O2CB_RECONNECT_DELAY_MS=2000
>>>>>
>>>>> We switched all cluster resources to the 1st node to test the new
>>>>> settings on the second node. We removed the 2nd node from the
> zoning
>>> (we
>>>>> also tested shutting down the port with the same result) and got a
>>>>> different error but still ending up with a kernel panic:
>>>>>
>>>>> Jun  4 16:41:05 defr1elcbtd02 kernel: o2net: no longer connected
> to
>>> node
>>>>> defr1elcbtd01 (num 0) at 192.168.0.101:7777
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-0: blocked FC
>>> remote
>>>>> port time out: removing target and saving binding
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-1: blocked FC
>>> remote
>>>>> port time out: removing target and saving binding
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: sd 0:0:1:1: SCSI error:
> return
>>>>> code = 0x00010000
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: end_request: I/O error, dev
>>> sdd,
>>>>> sector 1672
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath:
>>> Failing
>>>>> path 8:48.
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath:
>>> Failing
>>>>> path 8:16.
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: scsi 0:0:1:1: rejecting I/O
> to
>>>>> device being removed
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> emc:
>>> long
>>>>> trespass command will be send
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> emc:
>>>>> honor reservation bit will not be set (default)
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: table: 253:7:
>>>>> multipath: error getting device
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: ioctl: error
>>> adding
>>>>> target to table
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> emc:
>>> long
>>>>> trespass command will be send
>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
> emc:
>>>>> honor reservation bit will not be set (default)
>>>>> Jun  4 16:41:29 defr1elcbtd02 kernel: device-mapper: multipath
> emc:
>>>>> emc_pg_init: sending switch-over command
>>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status = -107
>>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
>>>>> (10751,1):dlm_wait_for_node_death:374
>>> 5EE89BC01EFC405E9197C198DEEAE678:
>>>>> waiting 5000ms for notification of death of node 0
>>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status = -107
>>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
>>>>> (10751,1):dlm_wait_for_node_death:374
>>> 5EE89BC01EFC405E9197C198DEEAE678:
>>>>> waiting 5000ms for notification of death of node 0
>>>>> [...]
>>>>> After 60 seconds:
>>>>>
>>>>> (8,0): o2quo_make_decision:143 ERROR: fending this node because it
>>> is
>>>>> connected to a half-quorum of 1 out of 2 nodes which doesn't
> include
>>> the
>>>>> lowest active node 0
>>>>>
>>>>>
>>>>> multipath -ll changed to:
>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 DGC,RAID
> 5
>>>>> [size=133G][features=0][hwhandler=1 emc]
>>>>> \_ round-robin 0 [prio=1][active]
>>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
>>>>> \_ round-robin 0 [prio=0][enabled]
>>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
>>>>>
>>>>> The ocfs2 filesystem is still mounted an writable. Even if I
> enable
>>> the
>>>>> zoneing (or the FC port) again within the 60 seconds ocfs2 does
> not
>>>>> reconnect to node 1 and panics the kernel after 60 seconds while
>>>>> multipath -ll shows both path again.
>>>>>
>>>>> I do not understand at all what the Ethernet heartbeat connection
> of
>>>>> ocfs2 has to do with the SAN connection.
>>>>>
>>>>> The strangest thing at all is - this does not happen always! After
>>> some
>>>>> reboots the system keeps running stable even if I shutdown a FC
> port
>>> and
>>>>> enable it again many times. There is no constant behaviour... It
>>> happens
>>>>> most of the times, but at about 10% it does not happen and
>>> everything is
>>>>> working as intended.
>>>>>
>>>>> Any explanations or ideas what causes this behaviour?
>>>>>
>>>>> I will test this on Debian lenny to see if the Debian version
> makes
>>> a
>>>>> difference.
>>>>>
>>>>> Best regards,
>>>>> Florian
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users



More information about the Ocfs2-users mailing list