Whamcloud - gitweb
LU-17440 lnet: prevent errorneous decref for asym route 96/53896/16
authorGian-Carlo DeFazio <defazio1@llnl.gov>
Thu, 29 Feb 2024 00:44:48 +0000 (16:44 -0800)
committerOleg Drokin <green@whamcloud.com>
Tue, 23 Apr 2024 19:45:38 +0000 (19:45 +0000)
commit2b210f39059be998b80b0acc13c12451960b63bb
treec34d2283d2ce7ada6f94bc12528a72bcfc134704
parenta6645f3f4c0b3e12a3f26203a898908a8277ddd7
LU-17440 lnet: prevent errorneous decref for asym route

The following stack trace was seen on a lustre server:
Call Trace TBD:
[<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[<0>] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet]
[<0>] lnet_handle_find_routed_path+0x86c/0xee0 [lnet]
[<0>] lnet_select_pathway+0xb95/0x16c0 [lnet]
[<0>] lnet_send+0x6d/0x1e0 [lnet]
[<0>] lnet_parse_local+0x3ed/0xdd0 [lnet]
[<0>] lnet_parse+0xd7d/0x1490 [lnet]
[<0>] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd]
[<0>] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd]
[<0>] kthread+0x14c/0x170
[<0>] ret_from_fork+0x1f/0x40

It was discovered that the lnet routes between the server
and a client cluster were misconfigured, so that the clients
had routes to the server through all 8 available routers,
but the server had routes to the clients through only 7 of
the routers.

The server was contacted by a client node through the
router with the missing route. It incremented the ref count
for the corresponding struct lnet_peer_ni for that router,
but then, because it had no route through that peer, changed
the value of the struct lnet_peer_ni to a peer with a route
back to the client. It then decremented the new
struct lnet_peer_ni which resulted in the ref count being
decremented to 0 which caused an LBUG.

Detect if the peer is a router to the appropriate net.
If so, decrement its ref count at the end of the function,
if not, decrement its ref count immediately.

Fixes: 2e27193 ("LU-17062 lnet: Update lnet_peer_*_decref_locked usage")
Test-Parameters: testlist=sanity-lnet mdscount=1 osscount=2 clientcount=1
Signed-off-by: Gian-Carlo DeFazio <defazio1@llnl.gov>
Change-Id: I2d00faef60ae8768afa7afbb1b00a62ba90535bb
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/53896
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
lnet/lnet/lib-move.c
lustre/tests/sanity-lnet.sh