Whamcloud - gitweb
LU-14810 lnet: Do not issue multiple PUSHes 59/55559/2
authorChris Horn <chris.horn@hpe.com>
Thu, 27 Jun 2024 16:40:19 +0000 (10:40 -0600)
committerOleg Drokin <green@whamcloud.com>
Sat, 13 Jul 2024 20:51:07 +0000 (20:51 +0000)
commit72726a311814bc0c0eefb22a769c9ebf7912839e
tree73f6b988bf5bbbdfd6c6c2aa858b1ebddf030a7e
parentd44d3f87e6afeceb3fc0aca3c7d3c435847a320a
LU-14810 lnet: Do not issue multiple PUSHes

PUSH ACK may be delayed in network. Meanwhile, some event could cause
peer to go through discovery again (e.g. config change or NI state
change). The discovery state machine doesn't consider whether there
is an outstanding PUSH so it may issue another one for the same peer.
When delayed ACK arrives it will then clear PUSH_SENT, so now
discovery doesn't know that there is an outstanding PUSH. If discovery
is stopped then it doesn't unlink the push MD and this can cause an
assert in lnet_assert_handler_unused() because the push event handler
is still in use.

Modify the discovery state machine to check for PUSH_SENT when
determining whether a peer needs a PUSH.

sanity-lnet test_304 can reproduce this issue under ipv6
configuration if modules are unloaded at the end of the test.

Test-Parameters: trivial
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Change-Id: Ic3f7a8b44f85a18afb939fdbfa1f9bc5dc64d93d
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/55559
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
lnet/include/lnet/lib-lnet.h
lustre/tests/sanity-lnet.sh