+ <section xml:id="imperativerecovery">
+ <title><indexterm><primary>imperative recovery</primary></indexterm>Imperative Recovery</title>
+ <para>Imperative Recovery (IR) was first introduced in Lustre software release 2.2.0.</para>
+ <para>Large-scale Lustre file system implementations have historically experienced problems
+ recovering in a timely manner after a server failure. This is due to the way that clients
+ detect the server failure and how the servers perform their recovery. Many of the processes
+ are driven by the RPC timeout, which must be scaled with system size to prevent false
+ diagnosis of server death. The purpose of imperative recovery is to reduce the recovery window
+ by actively informing clients of server failure. The resulting reduction in the recovery
+ window will minimize target downtime and therefore increase overall system availability.
+ Imperative Recovery does not remove previous recovery mechanisms, and client timeout-based
+ recovery actions can occur in a cluster when IR is enabled as each client can still
+ independently disconnect and reconnect from a target. In case of a mix of IR and non-IR
+ clients connecting to an OST or MDT, the server cannot reduce its recovery timeout window,
+ because it cannot be sure that all clients have been notified of the server restart in a
+ timely manner. Even in such mixed environments the time to complete recovery may be reduced,
+ since IR-enabled clients will still be notified to reconnect to the server promptly and allow
+ recovery to complete as soon as the last non-IR client detects the server failure.</para>
+ <section remap="h3">
+ <title><indexterm><primary>imperative recovery</primary><secondary>MGS role</secondary></indexterm>MGS role</title>
+ <para>The MGS now holds additional information about Lustre targets, in the form of a Target Status
+ Table. Whenever a target registers with the MGS, there is a corresponding entry in this
+ table identifying the target. This entry includes NID information, and state/version
+ information for the target. When a client mounts the file system, it caches a locked copy of
+ this table, in the form of a Lustre configuration log. When a target restart occurs, the MGS
+ revokes the client lock, forcing all clients to reload the table. Any new targets will have
+ an updated version number, the client detects this and reconnects to the restarted target.
+ Since successful IR notification of server restart depends on all clients being registered
+ with the MGS, and there is no other node to notify clients in case of MGS restart, the MGS
+ will disable IR for a period when it first starts. This interval is configurable, as shown
+ in <xref linkend="imperativerecoveryparameters"/></para>
+ <para>Because of the increased importance of the MGS in recovery, it is strongly recommended that the MGS node be separate from the MDS. If the MGS is co-located on the MDS node, then in case of MDS/MGS failure there will be no IR notification for the MDS restart, and clients will always use timeout-based recovery for the MDS. IR notification would still be used in the case of OSS failure and recovery.</para>
+ <para>Unfortunately, it’s impossible for the MGS to know how many clients have been successfully notified or whether a specific client has received the restarting target information. The only thing the MGS can do is tell the target that, for example, all clients are imperative recovery-capable, so it is not necessary to wait as long for all clients to reconnect. For this reason, we still require a timeout policy on the target side, but this timeout value can be much shorter than normal recovery. </para>
+ </section>
+ <section remap="h3" xml:id="imperativerecoveryparameters">
+ <title><indexterm><primary>imperative recovery</primary><secondary>Tuning</secondary></indexterm>Tuning Imperative Recovery</title>
+ <para>Imperative recovery has a default parameter set which means it can work without any extra configuration. However, the default parameter set only fits a generic configuration. The following sections discuss the configuration items for imperative recovery.</para>
+ <section remap="h5">
+ <title>ir_factor</title>
+ <para>Ir_factor is used to control targets’ recovery window. If imperative recovery is enabled, the recovery timeout window on the restarting target is calculated by: <emphasis>new timeout = recovery_time * ir_factor / 10 </emphasis>Ir_factor must be a value in range of [1, 10]. The default value of ir_factor is 5. The following example will set imperative recovery timeout to 80% of normal recovery timeout on the target testfs-OST0000: </para>
+<screen>lctl conf_param obdfilter.testfs-OST0000.ir_factor=8</screen>
+ <note> <para>If this value is too small for the system, clients may be unnecessarily evicted</para> </note>
+<para>You can read the current value of the parameter in the standard manner with <emphasis>lctl get_param</emphasis>:</para>
+ <screen>
+# lctl get_param obdfilter.testfs-OST0000.ir_factor
+# obdfilter.testfs-OST0000.ir_factor=8
+</screen>
+ </section>
+ <section remap="h5">
+ <title>Disabling Imperative Recovery</title>
+ <para>Imperative recovery can be disabled manually by a mount option. For example, imperative recovery can be disabled on an OST by:</para>
+ <screen># mount -t lustre -onoir /dev/sda /mnt/ost1</screen>
+ <para>Imperative recovery can also be disabled on the client side with the same mount option:</para>
+ <screen># mount -t lustre -onoir mymgsnid@tcp:/testfs /mnt/testfs</screen>
+ <note><para>When a single client is deactivated in this manner, the MGS will deactivate imperative recovery for the whole cluster. IR-enabled clients will still get notification of target restart, but targets will not be allowed to shorten the recovery window. </para></note>
+ <para>You can also disable imperative recovery globally on the MGS by writing `state=disabled’ to the controlling procfs entry</para>
+ <screen># lctl set_param mgs.MGS.live.testfs="state=disabled"</screen>
+ <para>The above command will disable imperative recovery for file system named <emphasis>testfs</emphasis></para>
+ </section>
+ <section remap="h5">
+ <title>Checking Imperative Recovery State - MGS</title>
+ <para>You can get the imperative recovery state from the MGS. Let’s take an example and explain states of imperative recovery:</para>
+<screen>
+[mgs]$ lctl get_param mgs.MGS.live.testfs
+...
+imperative_recovery_state:
+ state: full
+ nonir_clients: 0
+ nidtbl_version: 242
+ notify_duration_total: 0.470000
+ notify_duation_max: 0.041000
+ notify_count: 38
+</screen>
+<informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="50*"/>
+ <colspec colname="c2" colwidth="50*"/>
+ <thead>
+ <row>
+ <entry>
+ <para><emphasis role="bold">Item</emphasis></para>
+ </entry>
+ <entry>
+ <para><emphasis role="bold">Meaning</emphasis></para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>state</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para><itemizedlist>
+ <listitem>
+ <para><emphasis role="bold">full: </emphasis>IR is working, all clients are connected and can be notified.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">partial: </emphasis>some clients are not IR capable.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">disabled: </emphasis>IR is disabled, no client notification.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold">startup: </emphasis>the MGS was just restarted, so not all clients may reconnect to the MGS.</para>
+ </listitem>
+ </itemizedlist></para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>nonir_clients</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para>Number of non-IR capable clients in the system.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>nidtbl_version</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para>Version number of the target status table. Client version must match MGS.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>notify_duration_total</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para>[Seconds.microseconds] Total time spent by MGS notifying clients</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>notify_duration_max</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para>[Seconds.microseconds] Maximum notification time for the MGS to notify a single IR client.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>notify_count</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para>Number of MGS restarts - to obtain average notification time, divide <literal>notify_duration_total</literal> by <literal>notify_count</literal></para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+</informaltable>
+
+ </section>
+ <section remap="h5">
+ <title>Checking Imperative Recovery State - client</title>
+ <para>A `client’ in IR means a Lustre client or a MDT. You can get the IR state on any node which
+ running client or MDT, those nodes will always have an MGC running. An example from a
+ client:</para>
+ <screen>
+[client]$ lctl get_param mgc.*.ir_state
+mgc.MGC192.168.127.6@tcp.ir_state=
+imperative_recovery: ON
+client_state:
+ - { client: testfs-client, nidtbl_version: 242 }
+ </screen>
+ <para>An example from a MDT:</para>
+ <screen>
+mgc.MGC192.168.127.6@tcp.ir_state=
+imperative_recovery: ON
+client_state:
+ - { client: testfs-MDT0000, nidtbl_version: 242 }
+ </screen>
+<informaltable frame="all">
+ <tgroup cols="2">
+ <colspec colname="c1" colwidth="50*"/>
+ <colspec colname="c2" colwidth="50*"/>
+ <thead>
+ <row>
+ <entry>
+ <para><emphasis role="bold">Item</emphasis></para>
+ </entry>
+ <entry>
+ <para><emphasis role="bold">Meaning</emphasis></para>
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>imperative_recovery</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para><literal>imperative_recovery</literal>can be ON or OFF. If it’s OFF state, then IR is disabled by administrator at mount time. Normally this should be ON state.</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>client_state: client:</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para>The name of the client</para>
+ </entry>
+ </row>
+ <row>
+ <entry>
+ <para><emphasis role="bold">
+ <literal>client_state: nidtbl_version</literal>
+ </emphasis></para>
+ </entry>
+ <entry>
+ <para>Version number of the target status table. Client version must match MGS.</para>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+</informaltable>
+ </section>
+ <section remap="h5">
+ <title>Target Instance Number</title>
+ <para>The Target Instance number is used to determine if a client is connecting to the latest instance of a target. We use the lowest 32 bit of mount count as target instance number. For an OST you can get the target instance number of testfs-OST0001 in this way (the command is run from an OSS login prompt):</para>
+<screen>
+$ lctl get_param obdfilter.testfs-OST0001*.instance
+obdfilter.testfs-OST0001.instance=5
+</screen>
+ <para>From a client, query the relevant OSC:</para>
+<screen>
+$ lctl get_param osc.testfs-OST0001-osc-*.import |grep instance
+ instance: 5
+</screen>
+ </section>
+ </section>
+ <section remap="h3" xml:id="imperativerecoveryrecomendations">
+ <title><indexterm><primary>imperative recovery</primary><secondary>Configuration Suggestions</secondary></indexterm>Configuration Suggestions for Imperative Recovery</title>
+<para>We used to build the MGS and MDT0 on the same target to save a server node. However, to make
+ IR work efficiently, we strongly recommend running the MGS node on a separate node for any
+ significant Lustre file system installation. There are three main advantages of doing this: </para>
+<orderedlist>
+<listitem><para>Be able to notify clients if the MDT0 is dead</para></listitem>
+<listitem><para>Load balance. The load on the MDS may be very high which may make the MGS unable to notify the clients in time</para></listitem>
+<listitem><para>Safety. The MGS code is simpler and much smaller compared to the code of MDT. This means the chance of MGS down time due to a software bug is very low.</para></listitem>
+</orderedlist>
+ </section>
+ </section>
+
+ <section xml:id="suppressingpings">
+ <title><indexterm><primary>suppressing pings</primary></indexterm>Suppressing Pings</title>
+ <para>On clusters with large numbers of clients and OSTs, OBD_PING messages may impose
+ significant performance overheads. As an intermediate solution before a more self-contained
+ one is built, Lustre software release 2.4 introduces an option to suppress pings, allowing
+ ping overheads to be considerably reduced. Before turning on this option, administrators
+ should consider the following requirements and understand the trade-offs involved:</para>
+ <itemizedlist>
+ <listitem>
+ <para>When suppressing pings, a target can not detect client deaths, since clients do not
+ send pings that are only to keep their connections alive. Therefore, a mechanism external
+ to the Lustre file system shall be set up to notify Lustre targets of client deaths in a
+ timely manner, so that stale connections do not exist for too long and lock callbacks to
+ dead clients do not always have to wait for timeouts.</para>
+ </listitem>
+ <listitem>
+ <para>Without pings, a client has to rely on Imperative Recovery to notify it of target failures, in order to join recoveries in time. This dictates that the client shall eargerly keep its MGS connection alive. Thus, a highly available standalone MGS is recommended and, on the other hand, MGS pings are always sent regardless of how the option is set.</para>
+ </listitem>
+ <listitem>
+ <para>If a client has uncommitted requests to a target and it is not sending any new requests on the connection, it will still ping that target even when pings should be suppressed. This is because the client needs to query the target's last committed transaction numbers in order to free up local uncommitted requests (and possibly other resources associated). However, these pings shall stop as soon as all the uncommitted requests have been freed or new requests need to be sent, rendering the pings unnecessary.</para>
+ </listitem>
+ </itemizedlist>
+ <section remap="h3">
+ <title><indexterm><primary>pings</primary><secondary>suppress_pings</secondary></indexterm>"suppress_pings" Kernel Module Parameter</title>
+ <para>The new option that controls whether pings are suppressed is implemented as the ptlrpc kernel module parameter "suppress_pings". Setting it to "1" on a server turns on ping suppressing for all targets on that server, while leaving it with the default value "0" gives previous pinging behavior. The parameter is ignored on clients and the MGS. While the parameter is recommended to be set persistently via the modprobe.conf(5) mechanism, it also accept online changes through sysfs. Note that an online change only affects connections established later; existing connections' pinging behaviors stay the same.</para>
+ </section>
+ <section remap="h3">
+ <title><indexterm><primary>pings</primary><secondary>evict_client</secondary></indexterm>Client Death Notification</title>
+ <para>The required external client death notification shall write UUIDs of dead clients into targets' "evict_client" procfs entries like</para>
+ <screen>
+/proc/fs/lustre/obdfilter/testfs-OST0000/evict_client
+/proc/fs/lustre/obdfilter/testfs-OST0001/evict_client
+/proc/fs/lustre/mdt/testfs-MDT0000/evict_client
+ </screen>
+ <para>Clients' UUIDs can be obtained from their "uuid" procfs entries like</para>
+ <screen>
+/proc/fs/lustre/llite/testfs-ffff8800612bf800/uuid
+ </screen>
+ </section>
+ </section>
+