[Ocfs2-devel] [RFC] make ocfs2/o2net reliable
Changwei Ge
ge.changwei at h3c.com
Thu Nov 16 01:49:10 PST 2017
Hi all,
As far as we know, ocfs2/o2net is not a reliable message mechanism.
Messages might get lost due to a sudden TCP socket connection shutdown.
And the only customer of o2net is ocfs2/dlm, so this may cause ocfs2/dlm
hang(missing AST and ASSERT MASTER). Sometimes it also causes
ocfs2/dlm's infinite wait for accomplishment of DLM recovery. But that
won't happen since target node is still heartbeating and no dlm recovery
procedure will be launched.
So I think above cases drive us to improve current ocfs2/o2net making it
more reliable. I already have a draft design for it. And we indeed need
to change o2net behavior.
To accomplish this goal, we tag each o2net message with a sequence
::msg_seq to let receiver tell if the newly coming message is a
duplicated one or not and ::msg_seq will work as a key value for
searching a following key structure in a red-black tree.
A brandy new structure is added to o2net named as *Message Holder*, it
is responsible for _handle_status_ storing.
When TCP has to shutdown or reset due to unknown reason, although we
lose the packets in send or receive buffer, o2net still manages those
messages. This gives a chance to o2net to re-send the messages once TCP
connection is established again.
Below diagram demonstrates how it works:
SEND RECV
send message
tag message header with ::msg_seq
search for Message Holder with
::msg_seq
NOT FOUND - insert one
(FOUND - means a duplicated one)
handle message
store status into Message Holder
send back status
instruct RECV to remove MH
notify SEND that MH is already
removed
return to caller
I am expecting your comments especially from @Mark, @Joseph and @Junxiao.
Thanks,
Changwei.
More information about the Ocfs2-devel
mailing list