Whamcloud - gitweb
LU-17841 kfilnd: Race between hello and tagged RMA
A race exists between processing an incoming hello and initiating the
RMA for bulk operations that can result in RKEY re-use.
Initiator:
Posts tagged receive with RKEY based on peerA::kp_local_session_key X
and tn_mr_key Y
Bulk request (1) sent to target
Some earlier transaction fails:
- Deletes peerA::kp_local_session_key X
- Creates peerA::kp_local_session_key Z
- HELLO request send to peerA
Target:
Processes HELLO request - updates kp_remote_session_key from X to Z.
Handles bulk request (1)
Performs RMA using session key Z and tn_mr_key Y, but completion is
delayed
Initiator:
Bulk request (1) hits timeout
- Tagged receive canceled, and tn_mr_key Y is released
Posts tagged receive with RKEY based on peerA::kp_local_session_key Z
and tn_mr_key Y
Bulk request (2) sent to target
Target:
RMA for (1) is completed using the RKEY for (2)
The solution is to create a new bulk request message that contains
the session key used to set up the tagged buffer on the initiator.
This is compared against the session key exchanged during hello
handshake prior to initiating the RMA. If there's a mismatch
then the RMA is failed and the transaction is finalized. The session
key stored in the new bulk request is also used to generate the RKEY
rather than using the session key stored in the kfilnd_peer. This is
a protocol change so the KFILND_MSG_VERSION is bumped.
During testing it was found that the kfilnd_msg::version was not
being set correctly for immediate and bulk messages. To allow interop
the kfilnd_msg::version must be set to the handshaked negotiated
version that is stored in kfilnd_peer::kp_version. This has been
fixed. This issue only impacts kfilnd peers with message version > 1,
so backwards compatability between versions 1 and 2 will work
correctly.
The KFILND_TN_DEBUG macro is modified to print additional information
that was useful when debugging this issue.
Lastly, the TN_EVENT_TAG_TX_OK was missing from tn_event_to_str(), so
this is added.
HPE-bug-id: LUS-12317
Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: I0b52a8367cd45b7587ba9ec3fa5212f548bebb57
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/55072
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Ian Ziemba <ian.ziemba@hpe.com>
Reviewed-by: Ron Gredvig <ron.gredvig@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>