[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

Hai Tao taoh666 at hotmail.com
Sat Sep 10 20:54:47 PDT 2011


is ocfs2 heartbeat transferred over the network, or just updating a file to the shared disk?
 
If the heartbeat lost, what should happen? what if only one node is writing, and the other is still? Will it still cause any file system issue?


Thanks.
 
Hai Tao
 




From: taoh666 at hotmail.com
To: ocfs2-users at oss.oracle.com
Date: Sat, 10 Sep 2011 00:50:23 -0700
Subject: [Ocfs2-users] disable heartbeat nic caused ocfs2 errors





I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with "ifdown eth1". I got following weird logs on both nodes:
 
Sep  7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num 1) at 10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down.
Sep  7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1315417519.185025 now 1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032 func (b9bb7168:504) 1315417518.872227:1315417518.872268)
Sep  7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node dbtest-02 (num 1) at 10.194.59.65:7777
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_do_master_request:1334 ERROR: link to 1 went down!
Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_get_lock_resource:917 ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457 ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604 ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604 ERROR: status = -112
Sep  7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
Sep  7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node dbtest-02 (num 1) at 10.194.59.65:7777
Sep  7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more than 120 seconds.
Sep  7 10:48:37 dbtest-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 10:48:37 dbtest-01 kernel: events/0      D ffff810001004420     0    10      1            11     9 (L-TLB)
Sep  7 10:48:37 dbtest-01 kernel:  ffff81083ffedc80 0000000000000046 ffffffff80333680 0000000000000001
Sep  7 10:48:37 dbtest-01 kernel:  0000000000000400 000000000000000a ffff81083ffe1820 ffffffff80309b60
Sep  7 10:48:37 dbtest-01 kernel:  0030b62498ce7b3f 000000000000416b ffff81083ffe1a08 0000000000000000
Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80064167>] wait_for_completion+0x79/0xa2
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>] default_wake_function+0x0/0xe
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e64b7>] :ocfs2:ocfs2_wait_for_mask+0xd/0x19
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e78d8>] :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013e5>] :ocfs2:ocfs2_orphan_scan_work+0x0/0x83
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884ed1e4>] :ocfs2:ocfs2_orphan_scan_lock+0x55/0x84
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884fc59b>] :ocfs2:ocfs2_queue_orphan_scan+0x32/0x147
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013ff>] :ocfs2:ocfs2_orphan_scan_work+0x1a/0x83
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004dc37>] run_workqueue+0x94/0xe4
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a472>] worker_thread+0x0/0x122
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a562>] worker_thread+0xf0/0x122
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>] default_wake_function+0x0/0xe
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032bdc>] kthread+0xfe/0x132
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efb1>] child_rip+0xa/0x11
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032ade>] kthread+0x0/0x132
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efa7>] child_rip+0x0/0x11
Sep  7 10:48:37 dbtest-01 kernel:

Does anyone know why this happened?
 
Thanks.

_______________________________________________ Ocfs2-users mailing list Ocfs2-users at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110910/74ac93a4/attachment.html 


More information about the Ocfs2-users mailing list