Whamcloud - gitweb
LU-17906 pltrpc: don't use non-uptodate peer at connect
If peer is not yet discovered then LNET puts messages into
pending queue until discovery is done. That pins ptlrpc
request as well, thus a connect RPC to not alive peer is
stuck until peer discovery timed out despite RPC timeout.
Moreover that means no connect attempt to other peers are
made for that time:
nids_stats:
"192.168.252.112@tcp": { connects: 1, ... sec_ago: 31 }
"192.168.252.113@tcp": { connects: 0, ... sec_ago: never }
"192.168.252.115@tcp": { connects: 0, ... sec_ago: never }
After 30s it is still stuck with first NID and never tried
any other, despite connect RPC timeout is about 5-10s in
ptlrpc.
Patch prevents RPC stuck on non-uptodate peer just by
dropping such request in ptl_send_rpc(). That lets ptlrpc
to keep control over connection request expiration and new
connect attempts, so all peers are tried one by one until
some is ready.
Results with patch:
nids_stats:
"192.168.252.112@tcp": { connects: 4, ... sec_ago: 9 }
"192.168.252.113@tcp": { connects: 4, ... sec_ago: 4 }
"192.168.255.115@tcp": { connects: 3, ... sec_ago: 14 }
After the same 30s we had 11 connect attempts with all
failover NIDs tried
Patch modifies also LNetPeerDiscovered() to consider
a local peer as uptodate and return error code instead of
boolean.
Import uptodate state is also not boolen now but shows
discovery status
Test-Parameters: env=ONLY=153a,ONLY_REPEAT=10 testlist=conf-sanity
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Change-Id: I51d8973aa8475ce1930f292c42aa22c70cfc13db
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/54286
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>