From: Cyril Bordage Date: Thu, 18 Nov 2021 23:33:41 +0000 (+0100) Subject: LUDOC-499 LNet health: fix bad formula X-Git-Url: https://git.whamcloud.com/doc/manual.git/shortlog?p=doc%2Fmanual.git;a=commitdiff_plain;h=0d61dc723371fc8daee50bf2f06535e9f62cbbac LUDOC-499 LNet health: fix bad formula The formula for lnet_lnd_timeout is wrong. It should be lnet_lnd_timeout = (lnet_transaction_timeout-1) / (retry_count+1) Also, some default values were fixed. Change-Id: I742a8a56ab0e49e68668db5d13170415a2306e4a Signed-off-by: Cyril Bordage Reviewed-on: https://review.whamcloud.com/c/doc/manual/+/45618 Tested-by: jenkins Reviewed-by: Andreas Dilger --- diff --git a/LNetMultiRail.xml b/LNetMultiRail.xml index f742752..b103853 100644 --- a/LNetMultiRail.xml +++ b/LNetMultiRail.xml @@ -667,7 +667,7 @@ lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1 mrhealth interface User Interface - LNet Health is turned off by default. There are multiple module + LNet Health is turned on by default. There are multiple module parameters available to control the LNet Health feature. All the module parameters are implemented in sysfs and are located in /sys/module/lnet/parameters/. They can be set directly by echoing a @@ -697,12 +697,11 @@ lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1 lnet_health_sensitivity. The greater the value, the longer it takes for that interface to become healthy again. The default value of lnet_health_sensitivity - is set to 0, which means the health value will not be decremented. - In essense, the health feature is turned off. - The sensitivity value can be set greater than 0. A - lnet_health_sensitivity of 100 would mean that - 10 consecutive message failures or a steady-state failure rate - over 1% would degrade the interface Health Value until it is + is set to 100. To disable LNet health, the value can be set to 0. + + An lnet_health_sensitivity of 100 means + that 10 consecutive message failures or a steady-state failure + rate over 1% would degrade the interface Health Value until it is disabled, while a lower failure rate would steer traffic away from the interface but it would continue to be available. When a failure occurs on an interface then its Health Value is @@ -770,7 +769,7 @@ lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1 re-sending a message it will check if a message has passed the maximum retry_count specified. After which if a message wasn't sent successfully a failure event will be passed up to the layer - which initiated message sending. + which initiated message sending. The default value is 2. Since the message retry interval (lnet_lnd_timeout) is computed from lnet_transaction_timeout / lnet_retry_count, @@ -793,7 +792,7 @@ lnetctl route add --net o2ib0 --gateway <rtrX-nidB>@o2ib1 two configurable parameters: lnet_transaction_timeout and retry_count. - lnet_lnd_timeout = lnet_transaction_timeout / retry_count + lnet_lnd_timeout = (lnet_transaction_timeout-1) / (retry_count+1) As such there is a restriction that lnet_transaction_timeout >= retry_count