[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

Changwei Ge ge.changwei at h3c.com
Wed Sep 13 00:03:18 PDT 2017


Hi,

I think the mentioned duplicated AST issue doesn't even exist.
Because the re-sended AST won't find any lock on converting list or 
blocked list.
How AST callback can be called twice?

Thanks,
Changwei

> 
> On 2017/8/23 12:48, Gang He wrote:
>>
>>
>>> On 17/8/23 10:23, Junxiao Bi wrote:
>>>> On 08/10/2017 06:49 PM, Changwei Ge wrote:
>>>>> Hi Joseph,
>>>>>
>>>>>
>>>>> On 2017/8/10 17:53, Joseph Qi wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> On 17/8/9 23:24, ge changwei wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>>
>>>>>>> On 2017/8/9 下午7:32, Joseph Qi wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 17/8/7 15:13, Changwei Ge wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> In current code, while flushing AST, we don't handle an
>>>>>>>>> exception that sending AST or BAST is failed.
>>>>>>>>> But it is indeed possible that AST or BAST is lost due to some
>>>>>>>>> kind of networks fault.
>>>>>>>>>
>>>>>>>> Could you please describe this issue more clearly? It is better
>>>>>>>> analyze issue along with the error message and the status of related nodes.
>>>>>>>> IMO, if network is down, one of the two nodes will be fenced. So
>>>>>>>> what's your case here?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Joseph
>>>>>>> I have posted the status of related lock resource in my preceding email.
>>>>>>> Please check them out.
>>>>>>>
>>>>>>> Moreover, network is not down forever even not longer than
>>>>>>> threshold  to be fenced.
>>>>>>> So no node will be fenced.
>>>>>>>
>>>>>>> This issue happens in terrible network environment. Some messages
>>>>>>> may be abandoned by switch due to various conditions.
>>>>>>> And even frequent and fast link up and down will also cause this issue.
>>>>>>>
>>>>>>> In a nutshell,  re-queuing AST and BAST is crucial when link
>>>>>>> between nodes recover quickly. It prevents cluster from hanging.
>>>>>>> So you mean the tcp packet is lost due to connection reset? IIRC,
>>>>> Yes, it's something like that exception which I think is deserved
>>>>> to be fixed within OCFS2.
>>>>>> Junxiao has posted a patchset to fix this issue.
>>>>>> If you are using the way of re-queuing, how to make sure the
>>>>>> original message is *truly* lost and the same ast/bast won't be sent twice?
>>>>> With regards to TCP layer, if it returns error to OCFS2, packets
>>>>> must not be sent successfully. So no node will obtain such an AST or BAST.
>>>> Right, but not only AST/BAST, other messages pending in tcp queue
>>>> will also lost if tcp return error to ocfs2, this can also caused hung.
>>>> Besides, your fix may introduce duplicated ast/bast message Joseph
>>>> mentioned.
>>>> Ocfs2 depends tcp a lot, it can't work well if tcp return error to it.
>>>> To fix it, maybe ocfs2 should maintain its own message queue and ack
>>>> messages while not depend on TCP.>
>>> Agree. Or we can add a sequence to distinguish duplicate message.
>>> Under this, we can simply resend message if fails.
>> Look likes, we need to make the message stateless.
>> Maybe, we can refer to GFS2, to see if GFS2 has considered this issue.
>>
>> Thanks
>> Gang
> Um.
> Since Joseph, Junxiao and Gang all have a different or opposite opinion on this hang issue fix, I will perform more tests to check if the previously mentioned duplicated ast issue truly exists. And if it does exist, I will try to figure out a new way to fix it and send a improved version of this patch.
> 
> I will report the test results few days later. Anyway, thanks for your comments.
> 
> Thank,
> Changwei.
>>> Thanks,
>>> Joseph
>>>   
>>>> Thanks,
>>>> Junxiao.
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 




More information about the Ocfs2-devel mailing list