[Ocfs2-devel] A o2cb DLM problem

Gang He ghe at suse.com
Wed Oct 11 23:37:41 PDT 2017


Hello list,

We got a o2cb DLM problem from the customer, which is using o2cb stack for OCFS2 file system on SLES12SP1(3.12.49-11-default).
The problem description is as below,

Customer has three node oracle rack cluster
gal7gblr2084
gal7gblr2085
gal7gblr2086

On each node they have configured two ocfs resources as a filesystem. The two node gal7gblr2085 and gal7gblr2086 got hung and went into loop to kill each other and they want root cause analysis.
Anyway, all I see in logs is those messages flooding /var/log/messages

2017-10-05T06:50:25.980773+01:00 gal7gblr2085 kernel: [16874541.314199] o2net: Connection to node gal7gblr2086 (num 2) at 10.233.217.12:7777 has been idle for 30.5 secs, shutting it down.
2017-10-05T06:50:37.456786+01:00 gal7gblr2085 kernel: [16874552.778726] o2net: No longer connected to node gal7gblr2086 (num 2) at 10.233.217.12:7777
2017-10-05T06:50:45.176798+01:00 gal7gblr2085 kernel: [16874560.487834] (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error -107 when sending message 504 (key 0x4a68dd81) to node 2
2017-10-05T06:50:45.176812+01:00 gal7gblr2085 kernel: [16874560.487838] o2dlm: Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
2017-10-05T06:50:50.284796+01:00 gal7gblr2085 kernel: [16874565.589996] (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error -107 when sending message 504 (key 0x4a68dd81) to node 2
2017-10-05T06:50:50.284811+01:00 gal7gblr2085 kernel: [16874565.590000] o2dlm: Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
2017-10-05T06:50:55.400808+01:00 gal7gblr2085 kernel: [16874570.700448] (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error -107 when sending message 504 (key 0x4a68dd81) to node 2
2017-10-05T06:50:55.400824+01:00 gal7gblr2085 kernel: [16874570.700452] o2dlm: Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
2017-10-05T06:51:00.512766+01:00 gal7gblr2085 kernel: [16874575.808944] (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error -107 when sending message 504 (key 0x4a68dd81) to node 2
2017-10-05T06:51:00.512783+01:00 gal7gblr2085 kernel: [16874575.808948] o2dlm: Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
2017-10-05T06:51:02.456785+01:00 gal7gblr2085 kernel: [16874577.749286] (ora_diag_rcp2,24339,0):dlm_do_master_request:1344 ERROR: link to 2 went down!
2017-10-05T06:51:02.456797+01:00 gal7gblr2085 kernel: [16874577.749289] (ora_diag_rcp2,24339,0):dlm_get_lock_resource:929 ERROR: status = -107
2017-10-05T06:51:05.632955+01:00 gal7gblr2085 kernel: [16874580.920124] (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error -107 when sending message 504 (key 0x4a68dd81) to node 2
2017-10-05T06:51:05.632973+01:00 gal7gblr2085 kernel: [16874580.920132] o2dlm: Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246
2017-10-05T06:51:07.976787+01:00 gal7gblr2085 kernel: [16874583.262561] o2net: No connection established with node 2 after 30.0 seconds, giving up.
2017-10-05T10:03:38.439542+01:00 gal7gblr2084 kernel: [1911889.097543] (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0x4a68dd81) to node 1
2017-10-05T10:03:38.439543+01:00 gal7gblr2084 kernel: [1911889.097547] (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0x4a68dd81) to node 1


Did you guys encounter such problem when using o2cb stack? since we mainly focus on pmck stack, but I still want to help this customer to know the root cause.


Thanks
Gang









More information about the Ocfs2-devel mailing list