Whamcloud - gitweb
LU-9120 lnet: handle local ni failure
Added an enumerated type listing the different errors which
the LND can propagate up to LNet for further handling.
All local timeout errors will trigger a resend if the
system is configured for resends. Remote errors will
not trigger a resend to avoid creating duplicate message
scenario on the receiving end. If a transmit error is encountered
where we're sure the message wasn't received by the remote end
we will attempt a resend.
LNet level logic to handle local NI failure. When the LND finalizes
a message lnet_finalize() will check if the message completed
successfully, if so it increments the healthv of the local NI, but
not beyond the max, and if it failed then it'll decrement the healthv
but not below 0 and put the message on the resend queue.
On local NI failure the local NI is placed on a recovery queue.
The monitor thread will wake up and resend all the messages pending.
The selection algorithm will properly select the local and remote NIs
based on the new healthv.
The monitor thread will ping each NI on the local recovery queue. On
reply it will check if the NIs healthv is back to maximum, if it is
then it will remove it from the recovery queue, otherwise it'll
keep it there until it's fully recovered.
Test-Parameters: forbuildonly
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Change-Id: I1cf5c6e74b9c5e5b06b15209f6ac77b49014e270
Reviewed-on: https://review.whamcloud.com/32764
Tested-by: Jenkins
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>