[Ocfs2-devel] A o2cb DLM problem

Gang He ghe at suse.com
Thu Oct 12 00:14:42 PDT 2017


Hello Junxiao,


Thank for quick reply, the information is very helpful.

-Gang


>>> 
> On 10/12/2017 02:37 PM, Gang He wrote:
>> Hello list,
>> 
>> We got a o2cb DLM problem from the customer, which is using o2cb stack for 
> OCFS2 file system on SLES12SP1(3.12.49-11-default).
>> The problem description is as below,
>> 
>> Customer has three node oracle rack cluster
>> gal7gblr2084
>> gal7gblr2085
>> gal7gblr2086
>> 
>> On each node they have configured two ocfs resources as a filesystem. The 
> two node gal7gblr2085 and gal7gblr2086 got hung and went into loop to kill 
> each other and they want root cause analysis.
>> Anyway, all I see in logs is those messages flooding /var/log/messages
>> 
>> 2017-10-05T06:50:25.980773+01:00 gal7gblr2085 kernel: [16874541.314199] o2net: 
> Connection to node gal7gblr2086 (num 2) at 10.233.217.12:7777 has been idle 
> for 30.5 secs, shutting it down.
> Looks it is an old kernel. Shutting down connection when idle timeout
> will cause losing dlm message which may cause hung. Please apply the
> following 3 patches.
> 
> 8c7b638cece1 ocfs2: quorum: add a log for node not fenced
> 8e9801dfe37c ocfs2: o2net: set tcp user timeout to max value
> c43c363def04 ocfs2: o2net: don't shutdown connection when idle timeout
> 
> Thanks,
> Junxiao.
>> 2017-10-05T06:50:37.456786+01:00 gal7gblr2085 kernel: [16874552.778726] o2net: 
> No longer connected to node gal7gblr2086 (num 2) at 10.233.217.12:7777
>> 2017-10-05T06:50:45.176798+01:00 gal7gblr2085 kernel: [16874560.487834] 
> (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error -107 
> when sending message 504 (key 0x4a68dd81) to node 2
>> 2017-10-05T06:50:45.176812+01:00 gal7gblr2085 kernel: [16874560.487838] o2dlm: 
> Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
>> 2017-10-05T06:50:50.284796+01:00 gal7gblr2085 kernel: [16874565.589996] 
> (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error -107 
> when sending message 504 (key 0x4a68dd81) to node 2
>> 2017-10-05T06:50:50.284811+01:00 gal7gblr2085 kernel: [16874565.590000] o2dlm: 
> Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
>> 2017-10-05T06:50:55.400808+01:00 gal7gblr2085 kernel: [16874570.700448] 
> (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error -107 
> when sending message 504 (key 0x4a68dd81) to node 2
>> 2017-10-05T06:50:55.400824+01:00 gal7gblr2085 kernel: [16874570.700452] o2dlm: 
> Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
>> 2017-10-05T06:51:00.512766+01:00 gal7gblr2085 kernel: [16874575.808944] 
> (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error -107 
> when sending message 504 (key 0x4a68dd81) to node 2
>> 2017-10-05T06:51:00.512783+01:00 gal7gblr2085 kernel: [16874575.808948] o2dlm: 
> Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
>> 2017-10-05T06:51:02.456785+01:00 gal7gblr2085 kernel: [16874577.749286] 
> (ora_diag_rcp2,24339,0):dlm_do_master_request:1344 ERROR: link to 2 went 
> down!
>> 2017-10-05T06:51:02.456797+01:00 gal7gblr2085 kernel: [16874577.749289] 
> (ora_diag_rcp2,24339,0):dlm_get_lock_resource:929 ERROR: status = -107
>> 2017-10-05T06:51:05.632955+01:00 gal7gblr2085 kernel: [16874580.920124] 
> (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error -107 
> when sending message 504 (key 0x4a68dd81) to node 2
>> 2017-10-05T06:51:05.632973+01:00 gal7gblr2085 kernel: [16874580.920132] o2dlm: 
> Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
>> 2017-10-05T06:51:07.976787+01:00 gal7gblr2085 kernel: [16874583.262561] o2net: 
> No connection established with node 2 after 30.0 seconds, giving up.
>> 2017-10-05T10:03:38.439542+01:00 gal7gblr2084 kernel: [1911889.097543] 
> (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 
> when sending message 506 (key 0x4a68dd81) to node 1
>> 2017-10-05T10:03:38.439543+01:00 gal7gblr2084 kernel: [1911889.097547] 
> (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 
> when sending message 506 (key 0x4a68dd81) to node 1
>> 
>> 
>> Did you guys encounter such problem when using o2cb stack? since we mainly 
> focus on pmck stack, but I still want to help this customer to know the root 
> cause.
>> 
>> 
>> Thanks
>> Gang
>> 
>> 
>> 
>> 
>> 
>> 



More information about the Ocfs2-devel mailing list