[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

Changwei Ge ge.changwei at h3c.com
Tue Aug 8 03:56:43 PDT 2017



On 2017/8/8 4:20, Mark Fasheh wrote:
> On Mon, Aug 7, 2017 at 2:13 AM, Changwei Ge <ge.changwei at h3c.com> wrote:
>> Hi,
>>
>> In current code, while flushing AST, we don't handle an exception that
>> sending AST or BAST is failed.
>> But it is indeed possible that AST or BAST is lost due to some kind of
>> networks fault.
>>
>> If above exception happens, the requesting node will never obtain an AST
>> back, hence, it will never acquire the lock or abort current locking.
>>
>> With this patch, I'd like to fix this issue by re-queuing the AST or
>> BAST if sending is failed due to networks fault.
>>
>> And the re-queuing AST or BAST will be dropped if the requesting node is
>> dead!
>>
>> It will improve the reliability a lot.
> Can you detail your testing? Code-wise this looks fine to me but as
> you note, this is a pretty hard to hit corner case so it'd be nice to
> hear that you were able to exercise it.
>
> Thanks,
>    --Mark
Hi Mark,

My test is quite simple to perform.
Test environment includes 7 hosts. Ethernet devices in 6 of them are
down and then up repetitively.
After several rounds of up and down. Some file operation hangs.

Through debugfs.ocfs2 tool involved in NODE 2 which was the owner of
lock resource 'O000000000000000011150300000000',
it told that:

debugfs: dlm_locks O000000000000000011150300000000
Lockres: O000000000000000011150300000000   Owner: 2    State: 0x0
Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
Refs: 4    Locks: 2    On Lists: None
Reference Map: 3
 Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST 
Pending-Action
 Granted     2     PR     -1    2:53             2     No   No    None
 Granted     3     PR     -1    3:48             2     No   No    None

That meant NODE 2 had granted NODE 3 and the AST had been transited to
NODE 3.

Meanwhile, through debugfs.ocfs2 tool involved in NODE 3,
it told that:
debugfs: dlm_locks O000000000000000011150300000000
Lockres: O000000000000000011150300000000   Owner: 2    State: 0x0
Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
Refs: 3    Locks: 1    On Lists: None
Reference Map:
 Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST 
Pending-Action
 Blocked     3     PR     -1    3:48             2     No   No    None

That meant NODE 3 didn't ever receive any AST to move local lock from
blocked list to grant list.

This consequence  makes sense, since AST sending is failed which can be
seen in kernel log.

As for BAST, it is more or less the same.

Thanks
Changwei


From




More information about the Ocfs2-devel mailing list