[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability
Joseph Qi
jiangqi903 at gmail.com
Tue Aug 22 18:06:52 PDT 2017
Hi Mark,
On 17/8/23 04:49, Mark Fasheh wrote:
> On Tue, Aug 8, 2017 at 5:56 AM, Changwei Ge <ge.changwei at h3c.com> wrote:
>>>> It will improve the reliability a lot.
>>> Can you detail your testing? Code-wise this looks fine to me but as
>>> you note, this is a pretty hard to hit corner case so it'd be nice to
>>> hear that you were able to exercise it.
>>>
>>> Thanks,
>>> --Mark
>> Hi Mark,
>>
>> My test is quite simple to perform.
>> Test environment includes 7 hosts. Ethernet devices in 6 of them are
>> down and then up repetitively.
>> After several rounds of up and down. Some file operation hangs.
>>
>> Through debugfs.ocfs2 tool involved in NODE 2 which was the owner of
>> lock resource 'O000000000000000011150300000000',
>> it told that:
>>
>> debugfs: dlm_locks O000000000000000011150300000000
>> Lockres: O000000000000000011150300000000 Owner: 2 State: 0x0
>> Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No
>> Refs: 4 Locks: 2 On Lists: None
>> Reference Map: 3
>> Lock-Queue Node Level Conv Cookie Refs AST BAST
>> Pending-Action
>> Granted 2 PR -1 2:53 2 No No None
>> Granted 3 PR -1 3:48 2 No No None
>>
>> That meant NODE 2 had granted NODE 3 and the AST had been transited to
>> NODE 3.
>>
>> Meanwhile, through debugfs.ocfs2 tool involved in NODE 3,
>> it told that:
>> debugfs: dlm_locks O000000000000000011150300000000
>> Lockres: O000000000000000011150300000000 Owner: 2 State: 0x0
>> Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No
>> Refs: 3 Locks: 1 On Lists: None
>> Reference Map:
>> Lock-Queue Node Level Conv Cookie Refs AST BAST
>> Pending-Action
>> Blocked 3 PR -1 3:48 2 No No None
>>
>> That meant NODE 3 didn't ever receive any AST to move local lock from
>> blocked list to grant list.
>>
>> This consequence makes sense, since AST sending is failed which can be
>> seen in kernel log.
>>
>> As for BAST, it is more or less the same.
>>
>> Thanks
>> Changwei
>
>
> Thanks for the testing details. I think you got Andrew's e-mail wrong
> so I'm CC'ing him now. It might be a good idea to re-send the patch
> with the right CC's - add some of your testing details to the log.
IMO, network error occurs cannot make sure that target node hasn't
received the message. A complete message round includes:
1. sending to the target node;
2. get response from the target node.
So if network error happens on phase 2, re-queue the message will
cause ast/bast to be sent twice. I'm afraid this cannot be handled
currently.
If I understand wrong, please point out.
Thanks,
Joseph
> You're free to use my
>
> Reviewed-by: Mark Fasheh <mfasheh at versity.com>
>
> as well.
>
> Thanks again,
> --Mark
>
More information about the Ocfs2-devel
mailing list