- <para>Imperative Recovery (IR) was first introduced in Lustre 2.2.0</para>
- <para>Large-scale lustre implementations have historically experienced problems recovering in a timely manner after a server failure. This is due to the way that clients detect the server failure and how the servers perform their recovery. Many of the processes are driven by the RPC timeout, which must be scaled with system size to prevent false diagnosis of server death. The purpose of imperative recovery is to reduce the recovery window by actively informing clients of server failure. The resulting reduction in the recovery window will minimize target downtime and therefore increase overall system availability. Imperative Recovery does not remove previous recovery mechanisms, and client timeout-based recovery actions can occur in a cluster when IR is enabled as each client can still independently disconnect and reconnect from a target. In case of a mix of IR and non-IR clients connecting to an OST or MDT, the server cannot reduce its recovery timeout window, because it cannot be sure that all clients have been notified of the server restart in a timely manner. Even in such mixed environments the time to complete recovery may be reduced, since IR-enabled clients will still be notified reconnect to the server promptly and allow recovery to complete as soon as the last the non-IR client detects the server failure.</para>
+ <para>Imperative Recovery (IR) was first introduced in Lustre software release 2.2.0.</para>
+ <para>Large-scale Lustre file system implementations have historically experienced problems
+ recovering in a timely manner after a server failure. This is due to the way that clients
+ detect the server failure and how the servers perform their recovery. Many of the processes
+ are driven by the RPC timeout, which must be scaled with system size to prevent false
+ diagnosis of server death. The purpose of imperative recovery is to reduce the recovery window
+ by actively informing clients of server failure. The resulting reduction in the recovery
+ window will minimize target downtime and therefore increase overall system availability.
+ Imperative Recovery does not remove previous recovery mechanisms, and client timeout-based
+ recovery actions can occur in a cluster when IR is enabled as each client can still
+ independently disconnect and reconnect from a target. In case of a mix of IR and non-IR
+ clients connecting to an OST or MDT, the server cannot reduce its recovery timeout window,
+ because it cannot be sure that all clients have been notified of the server restart in a
+ timely manner. Even in such mixed environments the time to complete recovery may be reduced,
+ since IR-enabled clients will still be notified to reconnect to the server promptly and allow
+ recovery to complete as soon as the last non-IR client detects the server failure.</para>