Whamcloud - gitweb
LU-17440 lnet: prevent errorneous decref for asym route
authorGian-Carlo DeFazio <defazio1@llnl.gov>
Thu, 29 Feb 2024 00:44:48 +0000 (16:44 -0800)
committerAndreas Dilger <adilger@whamcloud.com>
Sat, 27 Apr 2024 22:28:16 +0000 (22:28 +0000)
commit9ead05fa6255c92fc412479a3e7fe8020c84769a
tree19df098c748962eaacffe574648483ddfc4fbcc0
parente8a834333989bde0cf5426e8239c696265dfbaf6
LU-17440 lnet: prevent errorneous decref for asym route

The following stack trace was seen on a lustre server:
Call Trace TBD:
[<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[<0>] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet]
[<0>] lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
[<0>] lnet_select_pathway+0xb95/0x16c0 [lnet]
[<0>] lnet_send+0x6d/0x1e0 [lnet]
[<0>] lnet_parse_local+0x3ed/0xdd0 [lnet]
[<0>] lnet_parse+0xd7d/0x1490 [lnet]
[<0>] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd]
[<0>] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd]
[<0>] kthread+0x14c/0x170
[<0>] ret_from_fork+0x1f/0x40

It was discovered that the lnet routes between the server
and a client cluster were misconfigured, so that the clients
had routes to the server through all 8 available routers,
but the server had routes to the clients through only 7 of
the routers.

The server was contacted by a client node through the
router with the missing route. It incremented the ref count
for the corresponding struct lnet_peer_ni for that router,
but then, because it had no route through that peer, changed
the value of the struct lnet_peer_ni to a peer with a route
back to the client. It then decremented the new
struct lnet_peer_ni which resulted in the ref count being
decremented to 0 which caused an LBUG.

Detect if the peer is a router to the appropriate net.
If so, decrement its ref count at the end of the function,
if not, decrement its ref count immediately.

Lustre-change: https://review.whamcloud.com/53896
Lustre-commit: 2b210f39059be998b80b0acc13c12451960b63bb

Fixes: 60cfce ("LU-17062 lnet: Update lnet_peer_*_decref_locked usage")
Test-Parameters: testlist=sanity-lnet mdscount=1 osscount=2 clientcount=1
Signed-off-by: Gian-Carlo DeFazio <defazio1@llnl.gov>
Change-Id: I2d00faef60ae8768afa7afbb1b00a62ba90535bb
Reviewed-on: https://review.whamcloud.com/c/ex/lustre-release/+/54883
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
lnet/lnet/lib-move.c
lustre/tests/sanity-lnet.sh