[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

Sunil Mushran sunil.mushran at oracle.com
Mon Sep 12 11:01:49 PDT 2011


ocfs2 uses disk heartbeat to detect node liveness. It uses net heartbeat
to detect link liveness. Both need to operate for the cluster to function.
If the network link between two nodes snaps, then one of the two nodes
is fenced.

The stack below indicates that the two nodes are not able to communicate.
The two nodes are waiting on the quorum to fence one of the nodes.
It appears you have upped the disk heartbeat timeout > 2mins. I would imagine
one of the nodes reset after that timeout.

On 09/10/2011 08:54 PM, Hai Tao wrote:
> is ocfs2 heartbeat transferred over the network, or just updating a file to the shared disk?
>
> If the heartbeat lost, what should happen? what if only one node is writing, and the other is still? Will it still cause any file system issue?
>
>
> Thanks.
> Hai Tao
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> From: taoh666 at hotmail.com
> To: ocfs2-users at oss.oracle.com
> Date: Sat, 10 Sep 2011 00:50:23 -0700
> Subject: [Ocfs2-users] disable heartbeat nic caused ocfs2 errors
>
> I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with "ifdown eth1". I got following weird logs on both nodes:
>
> Sep  7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num 1) at 10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down.
> Sep  7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1315417519.185025 now 1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032 func (b9bb7168:504) 1315417518.872227:1315417518.872268)
> Sep  7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node dbtest-02 (num 1) at 10.194.59.65:7777
> Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_do_master_request:1334 ERROR: link to 1 went down!
> Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_get_lock_resource:917 ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457 ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604 ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604 ERROR: status = -112
> Sep  7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
> Sep  7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node dbtest-02 (num 1) at 10.194.59.65:7777
> Sep  7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more than 120 seconds.
> Sep  7 10:48:37 dbtest-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep  7 10:48:37 dbtest-01 kernel: events/0      D ffff810001004420     0    10      1            11     9 (L-TLB)
> Sep  7 10:48:37 dbtest-01 kernel:  ffff81083ffedc80 0000000000000046 ffffffff80333680 0000000000000001
> Sep  7 10:48:37 dbtest-01 kernel:  0000000000000400 000000000000000a ffff81083ffe1820 ffffffff80309b60
> Sep  7 10:48:37 dbtest-01 kernel:  0030b62498ce7b3f 000000000000416b ffff81083ffe1a08 0000000000000000
> Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
> Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80064167>] wait_for_completion+0x79/0xa2
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>] default_wake_function+0x0/0xe
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e64b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e78d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013e5>] :ocfs2:ocfs2_orphan_scan_work+0x0/0x83
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884ed1e4>] :ocfs2:ocfs2_orphan_scan_lock+0x55/0x84
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884fc59b>] :ocfs2:ocfs2_queue_orphan_scan+0x32/0x147
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013ff>] :ocfs2:ocfs2_orphan_scan_work+0x1a/0x83
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004dc37>] run_workqueue+0x94/0xe4
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a472>] worker_thread+0x0/0x122
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a562>] worker_thread+0xf0/0x122
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>] default_wake_function+0x0/0xe
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032bdc>] kthread+0xfe/0x132
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efb1>] child_rip+0xa/0x11
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032ade>] kthread+0x0/0x132
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efa7>] child_rip+0x0/0x11
> Sep  7 10:48:37 dbtest-01 kernel:
>
> Does anyone know why this happened?
>
> Thanks.
>
> _______________________________________________ Ocfs2-users mailing list Ocfs2-users at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110912/f31eac49/attachment.html 


More information about the Ocfs2-users mailing list