[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

Joseph Qi jiangqi903 at gmail.com
Tue Aug 22 20:34:35 PDT 2017



On 17/8/23 10:23, Junxiao Bi wrote:
> On 08/10/2017 06:49 PM, Changwei Ge wrote:
>> Hi Joseph,
>>
>>
>> On 2017/8/10 17:53, Joseph Qi wrote:
>>> Hi Changwei,
>>>
>>> On 17/8/9 23:24, ge changwei wrote:
>>>> Hi
>>>>
>>>>
>>>> On 2017/8/9 下午7:32, Joseph Qi wrote:
>>>>> Hi,
>>>>>
>>>>> On 17/8/7 15:13, Changwei Ge wrote:
>>>>>> Hi,
>>>>>>
>>>>>> In current code, while flushing AST, we don't handle an exception that
>>>>>> sending AST or BAST is failed.
>>>>>> But it is indeed possible that AST or BAST is lost due to some kind of
>>>>>> networks fault.
>>>>>>
>>>>> Could you please describe this issue more clearly? It is better analyze
>>>>> issue along with the error message and the status of related nodes.
>>>>> IMO, if network is down, one of the two nodes will be fenced. So what's
>>>>> your case here?
>>>>>
>>>>> Thanks,
>>>>> Joseph
>>>> I have posted the status of related lock resource in my preceding email. 
>>>> Please check them out.
>>>>
>>>> Moreover, network is not down forever even not longer than threshold  to 
>>>> be fenced.
>>>> So no node will be fenced.
>>>>
>>>> This issue happens in terrible network environment. Some messages may be 
>>>> abandoned by switch due to various conditions.
>>>> And even frequent and fast link up and down will also cause this issue.
>>>>
>>>> In a nutshell,  re-queuing AST and BAST is crucial when link between 
>>>> nodes recover quickly. It prevents cluster from hanging.
>>>> So you mean the tcp packet is lost due to connection reset? IIRC,
>> Yes, it's something like that exception which I think is deserved to be
>> fixed within OCFS2.
>>> Junxiao has posted a patchset to fix this issue.
>>> If you are using the way of re-queuing, how to make sure the original
>>> message is *truly* lost and the same ast/bast won't be sent twice?
>> With regards to TCP layer, if it returns error to OCFS2, packets must
>> not be sent successfully. So no node will obtain such an AST or BAST.
> Right, but not only AST/BAST, other messages pending in tcp queue will
> also lost if tcp return error to ocfs2, this can also caused hung.
> Besides, your fix may introduce duplicated ast/bast message Joseph
> mentioned.
> Ocfs2 depends tcp a lot, it can't work well if tcp return error to it.
> To fix it, maybe ocfs2 should maintain its own message queue and ack
> messages while not depend on TCP.>
Agree. Or we can add a sequence to distinguish duplicate message. Under
this, we can simply resend message if fails.

Thanks,
Joseph
 
> Thanks,
> Junxiao.



More information about the Ocfs2-devel mailing list