Whamcloud - gitweb
LU-17476 lnet: prefer to use bits only to match ME 43/53843/8
authorSerguei Smirnov <ssmirnov@whamcloud.com>
Sat, 27 Jan 2024 20:17:34 +0000 (12:17 -0800)
committerOleg Drokin <green@whamcloud.com>
Mon, 4 Mar 2024 20:03:54 +0000 (20:03 +0000)
commit0b61b7d6d7940f67b75db2f4747169478512dd09
treed0a5ba895476107ca19c02bcaf66a45ab3aaebcd
parent28b4d02161c38e624efb10d4815856cb9df3dc07
LU-17476 lnet: prefer to use bits only to match ME

In some cases, it has been observed that a reply will arrive
at the portal with the correct match bits, but is dropped by
lnet_parse_put().  This appears to happen with LNet Multi-Rail
peers, each having two separate NIDs.

If a reply arrives with matchbits available and matching, but
the NIDs don't match, confirm the match if the NIDs are found
to belong to the same peer.  This will only happen in cases
where the reply would be dropped entirely, causing hundreds of
seconds of delay until the RPC is resent, so the extra overhead
of checking for a peer match before dropping the request is
only in the error path and minimal compared to the alternative.

Add CFS_FAIL_CHECK() for exercising the match NIDs code.

That is in a hot codepath, but CFS_FAIL_CHECK() is marked unlikely()
and this check is in the error case and _should_ only be hit when
the message would have been dropped anyway, so it seems unlikely to
impact performance in any meaningful way.

Test-Parameters: testlist=sanity env=ONLY=350,ONLY_REPEAT=10
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Change-Id: I10e1a2142539ddf5dabc26ce962cec1f2cfcf3db
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53843
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
lnet/include/lnet/lib-lnet.h
lnet/lnet/lib-ptl.c
lustre/tests/sanity.sh