Whamcloud - gitweb
LU-17476 lnet: prefer to use bits only to match ME
authorSerguei Smirnov <ssmirnov@whamcloud.com>
Sat, 27 Jan 2024 20:17:34 +0000 (12:17 -0800)
committerAndreas Dilger <adilger@whamcloud.com>
Fri, 2 Feb 2024 16:09:33 +0000 (16:09 +0000)
commitd1509ff2ca29b2ac35a773ecb31523f92a1f06c6
tree7c53490180df4926b293d82393bca6c3154eceb5
parentbccdb8217c0587c223a34425d204bc9519fa64f8
LU-17476 lnet: prefer to use bits only to match ME

In some cases, it has been observed that a reply will arrive
at the portal with the correct match bits, but is dropped by
lnet_parse_put().  This appears to happen with LNet Multi-Rail
peers, each having two separate NIDs.

If a reply arrives with matchbits available and matching, but
the NIDs don't match, confirm the match if the NIDs are found
to belong to the same peer.  This will only happen in cases
where the reply would be dropped entirely, causing hundreds of
seconds of delay until the RPC is resent, so the extra overhead
of checking for a peer match before dropping the request is
only in the error path and minimal compared to the alternative.

Add CFS_FAIL_CHECK() for exercising the match NIDs code.

That is in a hot codepath, but CFS_FAIL_CHECK() is marked unlikely()
and this check is in the error case and _should_ only be hit when the
message would have been dropped anyway, so it seems unlikely to impact
performance in any meaningful way.

Lustre-change: https://review.whamcloud.com/53843
Lustre-commit: TBD (from 3360e892750d1bf4f2b7ceab60d9a637b3e649ad)

Test-Parameters: testlist=sanity-lnet env=ONLY=350,ONLY_REPEAT=10
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I10e1a2142539ddf5dabc26ce962cec1f2cfcf3db
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/53846
Tested-by: jenkins <devops@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
lnet/include/lnet/lib-lnet.h
lnet/lnet/lib-ptl.c [changed mode: 0644->0755]
lustre/tests/sanity.sh