[Ocfs2-users] Ocfs2 clients hang

gjprabu gjprabu at zohocorp.com
Tue Dec 22 18:45:01 PST 2015


Hi Joseph,



        I have enabled requested and Is the DLM log will capture to analyze further. Also do we need to enable network side setting for allow max packets.



debugfs.ocfs2 -l


DLM allow

MSG off

TCP off

CONN off

VOTE off

DLM_DOMAIN off

HB_BIO off

BASTS allow

DLMFS off

ERROR allow

DLM_MASTER off

KTHREAD off

NOTICE allow

QUORUM off

SOCKET off

DLM_GLUE off

DLM_THREAD off

DLM_RECOVERY allow

HEARTBEAT off

CLUSTER off



Regards

Prabu





---- On Wed, 23 Dec 2015 07:51:38 +0530 Joseph Qi <joseph.qi at huawei.com>wrote ---- 




Please also switch on BASTS and DLM_RECOVERY. 

 

On 2015/12/23 10:11, gjprabu wrote: 

> HI Joseph, 

> 

> Our current setup is having below details and DLM is now allowed (DLM allow). Do you suggest any other option to get more logs. 

> 

> debugfs.ocfs2 -l 

> DLM off ( DLM allow) 

> MSG off 

> TCP off 

> CONN off 

> VOTE off 

> DLM_DOMAIN off 

> HB_BIO off 

> BASTS off 

> DLMFS off 

> ERROR allow 

> DLM_MASTER off 

> KTHREAD off 

> NOTICE allow 

> QUORUM off 

> SOCKET off 

> DLM_GLUE off 

> DLM_THREAD off 

> DLM_RECOVERY off 

> HEARTBEAT off 

> CLUSTER off 

> 

> Regards 

> Prabu 

> ** 

> 

> 

> 

> ---- On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi <joseph.qi at huawei.com>*wrote ---- 

> 

> So you mean the four nodes are manually rebooted? If so you must 

> analyze messages before you rebooted. 

> If there are not enough messages, you can switch on some messages. IMO, 

> mostly hang problems are caused by DLM bug, so I suggest switch on DLM 

> related log and reproduce. 

> You can use debugfs.ocfs2 -l to show all message switches and switch on 

> you want. For example, 

> # debugfs.ocfs2 -l DLM allow 

> 

> Thanks, 

> Joseph 

> 

> On 2015/12/22 21:47, gjprabu wrote: 

> > Hi Joseph, 

> > 

> > We are facing ocfs2 server hang problem frequently and suddenly 4 nodes going to hang stat expect 1 node. After reboot everything is come to normal, this behavior happend many times. Do we have any debug and fix for this issue. 

> > 

> > Regards 

> > Prabu 

> > 

> > 

> > ---- On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi <joseph.qi at huawei.com <mailto:joseph.qi at huawei.com>>*wrote ---- 

> > 

> > Hi Prabu, 

> > From the log you provided, I can only see that node 5 disconnected with 

> > node 2, 3, 1 and 4. It seemed that something wrong happened on the four 

> > nodes, and node 5 did recovery for them. After that, the four nodes 

> > joined again. 

> > 

> > On 2015/12/22 16:23, gjprabu wrote: 

> > > Hi, 

> > > 

> > > Anybody please help me on this issue. 

> > > 

> > > Regards 

> > > Prabu 

> > > 

> > > ---- On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu <gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com> <mailto:gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>>>*wrote ---- 

> > > 

> > > Dear Team, 

> > > 

> > > Ocfs2 clients are getting hang often and unusable. Please find the logs. Kindly provide the solution, it will be highly appreciated. 

> > > 

> > > 

> > > [3659684.042530] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes 

> > > 

> > > [3992993.101490] (kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = DLM_IVLOCKID 

> > > [3993002.193285] (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: A895BC216BE641A8A7E20AA89D57E051:M0000000000000062d2dcd000000000: bad lockres name 

> > > [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2 

> > > [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 

> > > [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 

> > > [3993064.860804] o2cb: o2dlm has evicted node 2 from domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993073.280062] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 2 

> > > [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error -112 send AST to node 4 

> > > [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = -112 

> > > [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3 

> > > [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1 

> > > [3993094.816118] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112 

> > > [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3 

> > > [3993124.779032] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 

> > > [3993133.332516] o2cb: o2dlm has evicted node 3 from domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993139.915122] o2cb: o2dlm has evicted node 1 from domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993147.071956] o2cb: o2dlm has evicted node 4 from domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 4 

> > > [3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

> > > [3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

> > > [3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

> > > [3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

> > > [3993147.072009] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

> > > [3993147.075019] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 

> > > [3993147.075353] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 1 went down! 

> > > [3993147.075701] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 

> > > [3993147.076001] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down! 

> > > [3993147.076329] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 

> > > [3993147.076634] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! 

> > > [3993147.076968] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 

> > > [3993147.077275] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 1 

> > > [3993147.077591] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 3 up while restarting 

> > > [3993147.077594] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 

> > > [3993155.171570] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down! 

> > > [3993155.171874] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 

> > > [3993155.172150] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! 

> > > [3993155.172446] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 

> > > [3993155.172719] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 3 

> > > [3993155.173001] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 4 up while restarting 

> > > [3993155.173003] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 

> > > [3993155.173283] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down! 

> > > [3993155.173581] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 

> > > [3993155.173858] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 4 

> > > [3993155.174135] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 

> > > [3993155.174458] o2dlm: Node 5 (me) is the Recovery Master for the dead node 2 in domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993158.361220] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993158.361228] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 1 

> > > [3993158.361305] o2dlm: Node 5 (me) is the Recovery Master for the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993161.833543] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993161.833551] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 3 

> > > [3993161.833620] o2dlm: Node 5 (me) is the Recovery Master for the dead node 3 in domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993165.188817] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993165.188826] o2dlm: Begin recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 4 

> > > [3993165.188907] o2dlm: Node 5 (me) is the Recovery Master for the dead node 4 in domain A895BC216BE641A8A7E20AA89D57E051 

> > > [3993168.551610] o2dlm: End recovery on domain A895BC216BE641A8A7E20AA89D57E051 

> > > 

> > > [3996486.869628] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 4 5 ) 2 nodes 

> > > [3996778.703664] o2dlm: Node 4 leaves domain A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes 

> > > [3997012.295536] o2dlm: Node 2 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 5 ) 2 nodes 

> > > [3997099.498157] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 4 5 ) 3 nodes 

> > > [3997783.633140] o2dlm: Node 1 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 4 5 ) 4 nodes 

> > > [3997864.039868] o2dlm: Node 3 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes 

> > > 

> > > Regards 

> > > Prabu 

> > > ** 

> > > 

> > > 

> > > 

> > > 

> > > 

> > > _______________________________________________ 

> > > Ocfs2-users mailing list 

> > > Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com> <mailto:Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com>> 

> > > https://oss.oracle.com/mailman/listinfo/ocfs2-users 

> > > 

> > 

> > 

> > 

> 

> 

> 

 

 






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20151223/8e4934e0/attachment-0001.html 


More information about the Ocfs2-users mailing list