git://git.whamcloud.com - fs/lustre-release.git/commit

LU-17906 pltrpc: don't use non-uptodate peer at connect

If peer is not yet discovered then LNET puts messages into
pending queue until discovery is done. That pins ptlrpc
request as well, thus a connect RPC to not alive peer is
stuck until peer discovery timed out despite RPC timeout.
Moreover that means no connect attempt to other peers are
made for that time:

nids_stats:
   "192.168.252.112@tcp": { connects: 1, ... sec_ago: 31 }
   "192.168.252.113@tcp": { connects: 0, ... sec_ago: never }
   "192.168.252.115@tcp": { connects: 0, ... sec_ago: never }

After 30s it is still stuck with first NID and never tried
any other, despite connect RPC timeout is about 5-10s in
ptlrpc.

Patch prevents RPC stuck on non-uptodate peer just by
dropping such request in ptl_send_rpc(). That lets ptlrpc
to keep control over connection request expiration and new
connect attempts, so all peers are tried one by one until
some is ready.

Results with patch:
nids_stats:
   "192.168.252.112@tcp": { connects: 4, ... sec_ago: 9 }
   "192.168.252.113@tcp": { connects: 4, ... sec_ago: 4 }
   "192.168.255.115@tcp": { connects: 3, ... sec_ago: 14 }

After the same 30s we had 11 connect attempts with all
failover NIDs tried

Patch modifies also LNetPeerDiscovered() to consider
a local peer as uptodate and return error code instead of
boolean.

Import uptodate state is also not boolen now but shows
discovery status

Test-Parameters: env=ONLY=153a,ONLY_REPEAT=10 testlist=conf-sanity
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I51d8973aa8475ce1930f292c42aa22c70cfc13db
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/54286
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>

author	Mikhail Pershin <mpershin@whamcloud.com>
	Sun, 8 Sep 2024 08:10:55 +0000 (11:10 +0300)
committer	Oleg Drokin <green@whamcloud.com>
	Tue, 8 Oct 2024 06:20:13 +0000 (06:20 +0000)
commit	6fe522d3d4f92aa2a48a573419f4590b10ef13d3
tree	911b84fd382d7c31c27e6a10592d3bfcc3ad02db	tree \| snapshot
parent	ff018bb77a371415a3973a58a70dfcc431862535	commit \| diff

lnet/include/lnet/api.h		diff \| blob \| history
lnet/lnet/peer.c		diff \| blob \| history
lustre/include/lustre_import.h		diff \| blob \| history
lustre/obdclass/lprocfs_status.c		diff \| blob \| history
lustre/ptlrpc/import.c		diff \| blob \| history
lustre/ptlrpc/niobuf.c		diff \| blob \| history
lustre/tests/conf-sanity.sh		diff \| blob \| history