LustreRecovery.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <chapter xmlns="http://docbook.org/ns/docbook"
   3  xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US"
   4  xml:id="lustrerecovery">
   5   <title xml:id="lustrerecovery.title">Lustre File System Recovery</title>
   6   <para>This chapter describes how recovery is implemented in a Lustre file system and includes the
   7     following sections:</para>
   8   <itemizedlist>
   9     <listitem>
  10       <para><xref linkend="recoveryoverview"/></para>
  11     </listitem>
  12     <listitem>
  13       <para><xref linkend="metadatereplay"/></para>
  14     </listitem>
  15     <listitem>
  16       <para><xref linkend="replyreconstruction"/></para>
  17     </listitem>
  18     <listitem>
  19       <para><xref linkend="versionbasedrecovery"/></para>
  20     </listitem>
  21     <listitem>
  22       <para><xref linkend="commitonshare"/></para>
  23     </listitem>
  24     <listitem>
  25       <para><xref linkend="imperativerecovery"/></para>
  26     </listitem>
  27   </itemizedlist>
  28   <section xml:id="recoveryoverview">
  29       <title>
  30           <indexterm><primary>recovery</primary></indexterm>
  31           <indexterm><primary>recovery</primary><secondary>VBR</secondary><see>version-based recovery</see></indexterm>
  32           <indexterm><primary>recovery</primary><secondary>commit on share</secondary><see>commit on share</see></indexterm>
  33           <indexterm><primary>lustre</primary><secondary>recovery</secondary><see>recovery</see></indexterm>
  34           Recovery Overview</title>
  35     <para>The recovery feature provided in the Lustre software is responsible for dealing with node
  36       or network failure and returning the cluster to a consistent, performant state. Because the
  37       Lustre software allows servers to perform asynchronous update operations to the on-disk file
  38       system (i.e., the server can reply without waiting for the update to synchronously commit to
  39       disk), the clients may have state in memory that is newer than what the server can recover
  40       from disk after a crash.</para>
  41     <para>A handful of different types of failures can cause recovery to occur:</para>
  42     <itemizedlist>
  43       <listitem>
  44         <para> Client (compute node) failure</para>
  45       </listitem>
  46       <listitem>
  47         <para> MDS failure (and failover)</para>
  48       </listitem>
  49       <listitem>
  50         <para> OST failure (and failover)</para>
  51       </listitem>
  52       <listitem>
  53         <para> Transient network partition</para>
  54       </listitem>
  55     </itemizedlist>
  56     <para>For Lustre, all Lustre file system failure and recovery operations
  57       are based on the concept of connection failure; all imports or exports
  58       associated with a given connection are considered to fail if any of
  59       them fail.  The <xref linkend="imperativerecovery"/> feature allows
  60       the MGS to actively inform clients when a target restarts after a
  61       failure, failover, or other interruption to speed up recovery.</para>
  62     <para>For information on Lustre file system recovery, see
  63       <xref linkend="metadatereplay"/>. For information on recovering from a
  64       corrupt file system, see <xref linkend="commitonshare"/>. For
  65       information on resolving orphaned objects, a common issue after recovery,
  66       see <xref linkend="dbdoclet.50438225_13916"/>. For information on
  67       imperative recovery see <xref linkend="imperativerecovery"/>
  68     </para>
  69     <section remap="h3">
  70       <title><indexterm><primary>recovery</primary><secondary>client failure</secondary></indexterm>Client Failure</title>
  71       <para>Recovery from client failure in a Lustre file system is based on lock revocation and
  72         other resources, so surviving clients can continue their work uninterrupted. If a client
  73         fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM)
  74         or fails to communicate with the server in a long period of time (i.e., no pings), the
  75         client is forcibly removed from the cluster (evicted). This enables other clients to acquire
  76         locks blocked by the dead client&apos;s locks, and also frees resources (file handles,
  77         export data) associated with that client. Note that this scenario can be caused by a network
  78         partition, as well as an actual client node system failure. <xref linkend="networkpartition"
  79         /> describes this case in more detail.</para>
  80     </section>
  81     <section xml:id="clientevictions">
  82       <title><indexterm><primary>recovery</primary><secondary>client eviction</secondary></indexterm>Client Eviction</title>
  83       <para>If a client is not behaving properly from the server&apos;s point of view, it will be evicted. This ensures that the whole file system can continue to function in the presence of failed or misbehaving clients. An evicted client must invalidate all locks, which in turn, results in all cached inodes becoming invalidated and all cached data being flushed.</para>
  84       <para>Reasons why a client might be evicted:</para>
  85       <itemizedlist>
  86         <listitem>
  87           <para>Failure to respond to a server request in a timely manner</para>
  88           <itemizedlist>
  89             <listitem>
  90               <para>Blocking lock callback (i.e., client holds lock that another client/server wants)</para>
  91             </listitem>
  92             <listitem>
  93               <para>Lock completion callback (i.e., client is granted lock previously held by another client)</para>
  94             </listitem>
  95             <listitem>
  96               <para>Lock glimpse callback (i.e., client is asked for size of object by another client)</para>
  97             </listitem>
  98             <listitem>
  99               <para>Server shutdown notification (with simplified interoperability)</para>
 100             </listitem>
 101           </itemizedlist>
 102         </listitem>
 103         <listitem>
 104           <para>Failure to ping the server in a timely manner, unless the server is receiving no RPC traffic at all (which may indicate a network partition).</para>
 105         </listitem>
 106       </itemizedlist>
 107     </section>
 108     <section remap="h3">
 109       <title><indexterm><primary>recovery</primary><secondary>MDS failure</secondary></indexterm>MDS Failure (Failover)</title>
 110       <para>Highly-available (HA) Lustre file system operation requires that the metadata server
 111         have a peer configured for failover, including the use of a shared storage device for the
 112         MDT backing file system. The actual mechanism for detecting peer failure, power off
 113         (STONITH) of the failed peer (to prevent it from continuing to modify the shared disk), and
 114         takeover of the Lustre MDS service on the backup node depends on external HA software such
 115         as Heartbeat. It is also possible to have MDS recovery with a single MDS node. In this case,
 116         recovery will take as long as is needed for the single MDS to be restarted.</para>
 117       <para>When <xref linkend="imperativerecovery"/> is enabled, clients are notified of an MDS restart (either the backup or a restored primary). Clients always may detect an MDS failure either by timeouts of in-flight requests or idle-time ping messages. In either case the clients then connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
 118       <para>The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the <literal>--failnode=</literal> option to <literal>mkfs.lustre</literal> or <literal>tunefs.lustre</literal>), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see <xref linkend="metadatereplay"/>.</para>
 119       <para>Transaction numbers are used to ensure that operations are
 120       replayed in the order they were originally performed, so that they
 121       are guaranteed to succeed and present the same file system state as
 122       before the failure. In addition, clients inform the new server of their
 123       existing lock state (including locks that have not yet been granted).
 124       All metadata and lock replay must complete before new, non-recovery
 125       operations are permitted. In addition, only clients that were connected
 126       at the time of MDS failure are permitted to reconnect during the recovery
 127       window, to avoid the introduction of state changes that might conflict
 128       with what is being replayed by previously-connected clients.</para>
 129       <para>If multiple MDTs are in use, active-active failover
 130       is possible (e.g. two MDS nodes, each actively serving one or more
 131       different MDTs for the same filesystem). See
 132       <xref linkend="dbdoclet.mdtactiveactive"/> for more information.</para>
 133     </section>
 134     <section remap="h3">
 135       <title><indexterm><primary>recovery</primary><secondary>OST failure</secondary></indexterm>OST Failure (Failover)</title>
 136         <para>When an OST fails or has communication problems with the client, the default action is that the corresponding OSC enters recovery, and I/O requests going to that OST are blocked waiting for OST recovery or failover. It is possible to administratively mark the OSC as <emphasis>inactive</emphasis> on the client, in which case file operations that involve the failed OST will return an IO error (<literal>-EIO</literal>). Otherwise, the application waits until the OST has recovered or the client process is interrupted (e.g. ,with <emphasis>CTRL-C</emphasis>).</para>
 137       <para>The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see <xref linkend="troubleshootingrecovery"/> (Working with Orphaned Objects).</para>
 138       <para>While the OSC to OST operation recovery protocol is the same as that between the MDC and
 139         MDT using the Metadata Replay protocol, typically the OST commits bulk write operations to
 140         disk synchronously and each reply indicates that the request is already committed and the
 141         data does not need to be saved for recovery. In some cases, the OST replies to the client
 142         before the operation is committed to disk (e.g. truncate, destroy, setattr, and I/O
 143         operations in newer releases of the Lustre software), and normal replay and resend handling
 144         is done, including resending of the bulk writes. In this case, the client keeps a copy of
 145         the data available in memory until the server indicates that the write has committed to
 146         disk.</para>
 147       <para>To force an OST recovery, unmount the OST and then mount it again. If the OST was connected to clients before it failed, then a recovery process starts after the remount, enabling clients to reconnect to the OST and replay transactions in their queue. When the OST is in recovery mode, all new client connections are refused until the recovery finishes. The recovery is complete when either all previously-connected clients reconnect and their transactions are replayed or a client connection attempt times out. If a connection attempt times out, then all clients waiting to reconnect (and their transactions) are lost.</para>
 148       <note>
 149         <para>If you know an OST will not recover a previously-connected client (if, for example, the client has crashed), you can manually abort the recovery using this command:</para>
 150         <para><screen>oss# lctl --device <replaceable>lustre_device_number</replaceable> abort_recovery</screen></para>
 151         <para>To determine an OST&apos;s device number and device name, run the <literal>lctl dl</literal> command. Sample <literal>lctl dl</literal> command output is shown below:</para>
 152         <screen>7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 </screen>
 153         <para>In this example, 7 is the OST device number. The device name is <literal>ddn_data-OST0009</literal>. In most instances, the device name can be used in place of the device number.</para>
 154       </note>
 155     </section>
 156     <section xml:id="networkpartition">
 157       <title><indexterm><primary>recovery</primary><secondary>network</secondary></indexterm>Network Partition</title>
 158       <para>Network failures may be transient. To avoid invoking recovery, the client tries, initially, to re-send any timed out request to the server. If the resend also fails, the client tries to re-establish a connection to the server. Clients can detect harmless partition upon reconnect if the server has not had any reason to evict the client.</para>
 159       <para>If a request was processed by the server, but the reply was dropped (i.e., did not arrive back at the client), the server must reconstruct the reply when the client resends the request, rather than performing the same request twice.</para>
 160     </section>
 161     <section remap="h3">
 162       <title><indexterm><primary>recovery</primary><secondary>failed recovery</secondary></indexterm>Failed Recovery</title>
 163       <para>In the case of failed recovery, a client is evicted by the server and must reconnect after having flushed its saved state related to that server, as described in <xref linkend="clientevictions"/>, above. Failed recovery might occur for a number of reasons, including:</para>
 164       <itemizedlist>
 165         <listitem>
 166           <para> Failure of recovery</para>
 167           <itemizedlist>
 168             <listitem>
 169               <para> Recovery fails if the operations of one client directly depend on the operations of another client that failed to participate in recovery. Otherwise, Version Based Recovery (VBR) allows recovery to proceed for all of the connected clients, and only missing clients are evicted.</para>
 170             </listitem>
 171             <listitem>
 172               <para> Manual abort of recovery</para>
 173             </listitem>
 174           </itemizedlist>
 175         </listitem>
 176         <listitem>
 177           <para> Manual eviction by the administrator</para>
 178         </listitem>
 179       </itemizedlist>
 180     </section>
 181   </section>
 182   <section xml:id="metadatereplay">
 183     <title><indexterm><primary>recovery</primary><secondary>metadata replay</secondary></indexterm>Metadata Replay</title>
 184     <para>Highly available Lustre file system operation requires that the MDS have a peer configured
 185       for failover, including the use of a shared storage device for the MDS backing file system.
 186       When a client detects an MDS failure, it connects to the new MDS and uses the metadata replay
 187       protocol to replay its requests.</para>
 188     <para>Metadata replay ensures that the failover MDS re-accumulates state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
 189     <section remap="h3">
 190       <title>XID Numbers</title>
 191       <para>Each request sent by the client contains an XID number, which is a client-unique, monotonically increasing 64-bit integer. The initial value of the XID is chosen so that it is highly unlikely that the same client node reconnecting to the same server after a reboot would have the same XID sequence. The XID is used by the client to order all of the requests that it sends, until such a time that the request is assigned a transaction number. The XID is also used in Reply Reconstruction to uniquely identify per-client requests at the server.</para>
 192     </section>
 193     <section remap="h3">
 194       <title>Transaction Numbers</title>
 195       <para>Each client request processed by the server that involves any state change (metadata update, file open, write, etc., depending on server type) is assigned a transaction number by the server that is a target-unique, monotonically increasing, server-wide 64-bit integer. The transaction number for each file system-modifying request is sent back to the client along with the reply to that client request. The transaction numbers allow the client and server to unambiguously order every modification to the file system in case recovery is needed.</para>
 196       <para>Each reply sent to a client (regardless of request type) also contains the last
 197         committed transaction number that indicates the highest transaction number committed to the
 198         file system. The <literal>ldiskfs</literal> and <literal>ZFS</literal> backing file systems that the Lustre software
 199         uses enforces the requirement that any earlier disk operation will always be committed to
 200         disk before a later disk operation, so the last committed transaction number also reports
 201         that any requests with a lower transaction number have been committed to disk.</para>
 202     </section>
 203     <section remap="h3">
 204       <title>Replay and Resend</title>
 205       <para>Lustre file system recovery can be separated into two distinct types of operations:
 206           <emphasis>replay</emphasis> and <emphasis>resend</emphasis>.</para>
 207       <para><emphasis>Replay</emphasis> operations are those for which the client received a reply from the server that the operation had been successfully completed. These operations need to be redone in exactly the same manner after a server restart as had been reported before the server failed. Replay can only happen if the server failed; otherwise it will not have lost any state in memory.</para>
 208       <para><emphasis>Resend</emphasis> operations are those for which the client never received a reply, so their final state is unknown to the client. The client sends unanswered requests to the server again in XID order, and again awaits a reply for each one. In some cases, resent requests have been handled and committed to disk by the server (possibly also having dependent operations committed), in which case, the server performs reply reconstruction for the lost reply. In other cases, the server did not receive the lost request at all and processing proceeds as with any normal request. These are what happen in the case of a network interruption. It is also possible that the server received the request, but was unable to reply or commit it to disk before failure.</para>
 209     </section>
 210     <section remap="h3">
 211       <title>Client Replay List</title>
 212       <para>All file system-modifying requests have the potential to be required for server state recovery (replay) in case of a server failure. Replies that have an assigned transaction number that is higher than the last committed transaction number received in any reply from each server are preserved for later replay in a per-server replay list. As each reply is received from the server, it is checked to see if it has a higher last committed transaction number than the previous highest last committed number. Most requests that now have a lower transaction number can safely be removed from the replay list. One exception to this rule is for open requests, which need to be saved for replay until the file is closed so that the MDS can properly reference count open-unlinked files.</para>
 213     </section>
 214     <section remap="h3">
 215       <title>Server Recovery</title>
 216       <para>A server enters recovery if it was not shut down cleanly. If, upon startup, if any client entries are in the <literal>last_rcvd</literal> file for any previously connected clients, the server enters recovery mode and waits for these previously-connected clients to reconnect and begin replaying or resending their requests. This allows the server to recreate state that was exposed to clients (a request that completed successfully) but was not committed to disk before failure.</para>
 217       <para>In the absence of any client connection attempts, the server waits indefinitely for the clients to reconnect. This is intended to handle the case where the server has a network problem and clients are unable to reconnect and/or if the server needs to be restarted repeatedly to resolve some problem with hardware or software. Once the server detects client connection attempts - either new clients or previously-connected clients - a recovery timer starts and forces recovery to finish in a finite time regardless of whether the previously-connected clients are available or not.</para>
 218       <para>If no client entries are present in the <literal>last_rcvd</literal> file, or if the administrator manually aborts recovery, the server does not wait for client reconnection and proceeds to allow all clients to connect.</para>
 219       <para>As clients connect, the server gathers information from each one to determine how long the recovery needs to take. Each client reports its connection UUID, and the server does a lookup for this UUID in the <literal>last_rcvd</literal> file to determine if this client was previously connected. If not, the client is refused connection and it will retry until recovery is completed. Each client reports its last seen transaction, so the server knows when all transactions have been replayed. The client also reports the amount of time that it was previously waiting for request completion so that the server can estimate how long some clients might need to detect the server failure and reconnect.</para>
 220       <para>If the client times out during replay, it attempts to reconnect. If the client is unable to reconnect, <literal>REPLAY</literal> fails and it returns to <literal>DISCON</literal> state. It is possible that clients will timeout frequently during <literal>REPLAY</literal>, so reconnection should not delay an already slow process more than necessary. We can mitigate this by increasing the timeout during replay.</para>
 221     </section>
 222     <section remap="h3">
 223       <title>Request Replay</title>
 224       <para>If a client was previously connected, it gets a response from the server telling it that the server is in recovery and what the last committed transaction number on disk is. The client can then iterate through its replay list and use this last committed transaction number to prune any previously-committed requests. It replays any newer requests to the server in transaction number order, one at a time, waiting for a reply from the server before replaying the next request.</para>
 225       <para>Open requests that are on the replay list may have a transaction number lower than the server&apos;s last committed transaction number. The server processes those open requests immediately. The server then processes replayed requests from all of the clients in transaction number order, starting at the last committed transaction number to ensure that the state is updated on disk in exactly the same manner as it was before the crash. As each replayed request is processed, the last committed transaction is incremented. If the server receives a replay request from a client that is higher than the current last committed transaction, that request is put aside until other clients provide the intervening transactions. In this manner, the server replays requests in the same sequence as they were previously executed on the server until either all clients are out of requests to replay or there is a gap in a sequence.</para>
 226     </section>
 227     <section remap="h3">
 228       <title>Gaps in the Replay Sequence</title>
 229       <para>In some cases, a gap may occur in the reply sequence. This might be caused by lost replies, where the request was processed and committed to disk but the reply was not received by the client. It can also be caused by clients missing from recovery due to partial network failure or client death.</para>
 230       <para>In the case where all clients have reconnected, but there is a gap in the replay sequence the only possibility is that some requests were processed by the server but the reply was lost. Since the client must still have these requests in its resend list, they are processed after recovery is finished.</para>
 231       <para>In the case where all clients have not reconnected, it is likely that the failed clients had requests that will no longer be replayed. The VBR feature is used to determine if a request following a transaction gap is safe to be replayed. Each item in the file system (MDS inode or OST object) stores on disk the number of the last transaction in which it was modified. Each reply from the server contains the previous version number of the objects that it affects. During VBR replay, the server matches the previous version numbers in the resend request against the current version number. If the versions match, the request is the next one that affects the object and can be safely replayed. For more information, see <xref linkend="versionbasedrecovery"/>.</para>
 232     </section>
 233     <section remap="h3">
 234       <title><indexterm><primary>recovery</primary><secondary>locks</secondary></indexterm>Lock Recovery</title>
 235       <para>If all requests were replayed successfully and all clients reconnected, clients then do
 236         lock replay locks -- that is, every client sends information about every lock it holds from
 237         this server and its state (whenever it was granted or not, what mode, what properties and so
 238         on), and then recovery completes successfully. Currently, the Lustre software does not do
 239         lock verification and just trusts clients to present an accurate lock state. This does not
 240         impart any security concerns since Lustre software release 1.x clients are trusted for other
 241         information (e.g. user ID) during normal operation also.</para>
 242       <para>After all of the saved requests and locks have been replayed, the client sends an <literal>MDS_GETSTATUS</literal> request with last-replay flag set. The reply to that request is held back until all clients have completed replay (sent the same flagged getstatus request), so that clients don&apos;t send non-recovery requests before recovery is complete.</para>
 243     </section>
 244     <section remap="h3">
 245       <title>Request Resend</title>
 246       <para>Once all of the previously-shared state has been recovered on the server (the target file system is up-to-date with client cache and the server has recreated locks representing the locks held by the client), the client can resend any requests that did not receive an earlier reply. This processing is done like normal request processing, and, in some cases, the server may do reply reconstruction.</para>
 247     </section>
 248   </section>
 249   <section xml:id="replyreconstruction">
 250     <title>Reply Reconstruction</title>
 251     <para>When a reply is dropped, the MDS needs to be able to reconstruct the reply when the original request is re-sent. This must be done without repeating any non-idempotent operations, while preserving the integrity of the locking system. In the event of MDS failover, the information used to reconstruct the reply must be serialized on the disk in transactions that are joined or nested with those operating on the disk.</para>
 252     <section remap="h3">
 253       <title>Required State</title>
 254       <para>For the majority of requests, it is sufficient for the server to store three pieces of data in the <literal>last_rcvd</literal> file:</para>
 255       <itemizedlist>
 256         <listitem>
 257           <para> XID of the request</para>
 258         </listitem>
 259         <listitem>
 260           <para> Resulting transno (if any)</para>
 261         </listitem>
 262         <listitem>
 263           <para> Result code (<literal>req-&gt;rq_status</literal>)</para>
 264         </listitem>
 265       </itemizedlist>
 266       <para>For open requests, the &quot;disposition&quot; of the open must also be stored.</para>
 267     </section>
 268     <section remap="h3">
 269       <title>Reconstruction of Open Replies</title>
 270       <para>An open reply consists of up to three pieces of information (in addition to the contents of the &quot;request log&quot;):</para>
 271       <itemizedlist>
 272         <listitem>
 273           <para>File handle</para>
 274         </listitem>
 275         <listitem>
 276           <para>Lock handle</para>
 277         </listitem>
 278         <listitem>
 279           <para><literal>mds_body</literal> with information about the file created (for <literal>O_CREAT</literal>)</para>
 280         </listitem>
 281       </itemizedlist>
 282       <para>The disposition, status and request data (re-sent intact by the client) are sufficient to determine which type of lock handle was granted, whether an open file handle was created, and which resource should be described in the <literal>mds_body</literal>.</para>
 283       <section remap="h5">
 284         <title>Finding the File Handle</title>
 285         <para>The file handle can be found in the XID of the request and the list of per-export open file handles. The file handle contains the resource/FID.</para>
 286       </section>
 287       <section remap="h5">
 288         <title>Finding the Resource/fid</title>
 289         <para>The file handle contains the resource/fid.</para>
 290       </section>
 291       <section remap="h5">
 292         <title>Finding the Lock Handle</title>
 293         <para>The lock handle can be found by walking the list of granted locks for the resource looking for one with the appropriate remote file handle (present in the re-sent request). Verify that the lock has the right mode (determined by performing the disposition/request/status analysis above) and is granted to the proper client.</para>
 294       </section>
 295     </section>
 296     <section remap="h3" condition="l28">
 297       <title>Multiple Reply Data per Client</title>
 298       <para>Since Lustre 2.8, the MDS is able to save several reply data per client. The reply data are stored in the <literal>reply_data</literal> internal file of the MDT. Additionally to the XID of the request, the transaction number, the result code and the open "disposition", the reply data contains a generation number that identifies the client thanks to the content of the <literal>last_rcvd</literal> file.</para>
 299     </section>
 300   </section>
 301   <section xml:id="versionbasedrecovery">
 302     <title><indexterm><primary>Version-based recovery (VBR)</primary></indexterm>Version-based Recovery</title>
 303     <para>The Version-based Recovery (VBR) feature improves Lustre file system reliability in cases
 304       where client requests (RPCs) fail to replay during recovery <footnote>
 305         <para>There are two scenarios under which client RPCs are not replayed: (1) Non-functioning
 306           or isolated clients do not reconnect, and they cannot replay their RPCs, causing a gap in
 307           the replay sequence. These clients get errors and are evicted. (2) Functioning clients
 308           connect, but they cannot replay some or all of their RPCs that occurred after the gap
 309           caused by the non-functioning/isolated clients. These clients get errors (caused by the
 310           failed clients). With VBR, these requests have a better chance to replay because the
 311           &quot;gaps&quot; are only related to specific files that the missing client(s)
 312           changed.</para>
 313       </footnote>.</para>
 314     <para>In pre-VBR releases of the Lustre software, if the MGS or an OST went down and then
 315       recovered, a recovery process was triggered in which clients attempted to replay their
 316       requests. Clients were only allowed to replay RPCs in serial order. If a particular client
 317       could not replay its requests, then those requests were lost as well as the requests of
 318       clients later in the sequence. The &apos;&apos;downstream&apos;&apos; clients never got to
 319       replay their requests because of the wait on the earlier client&apos;s RPCs. Eventually, the
 320       recovery period would time out (so the component could accept new requests), leaving some
 321       number of clients evicted and their requests and data lost.</para>
 322     <para>With VBR, the recovery mechanism does not result in the loss of clients or their data, because changes in inode versions are tracked, and more clients are able to reintegrate into the cluster. With VBR, inode tracking looks like this:</para>
 323     <itemizedlist>
 324       <listitem>
 325         <para>Each inode<footnote>
 326             <para>Usually, there are two inodes, a parent and a child.</para>
 327           </footnote> stores a version, that is, the number of the last transaction (transno) in which the inode was changed.</para>
 328       </listitem>
 329       <listitem>
 330         <para>When an inode is about to be changed, a pre-operation version of the inode is saved in the client&apos;s data.</para>
 331       </listitem>
 332       <listitem>
 333         <para>The client keeps the pre-operation inode version and the post-operation version (transaction number) for replay, and sends them in the event of a server failure.</para>
 334       </listitem>
 335       <listitem>
 336         <para>If the pre-operation version matches, then the request is replayed. The post-operation version is assigned on all inodes modified in the request.</para>
 337       </listitem>
 338     </itemizedlist>
 339     <note>
 340       <para>An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a &apos;&apos;rename&apos;&apos; operation, four different inodes can be modified.</para>
 341     </note>
 342     <para>During normal operation, the server:</para>
 343     <itemizedlist>
 344       <listitem>
 345         <para>Updates the versions of all inodes involved in a given operation</para>
 346       </listitem>
 347       <listitem>
 348         <para>Returns the old and new inode versions to the client with the reply</para>
 349       </listitem>
 350     </itemizedlist>
 351     <para>When the recovery mechanism is underway, VBR follows these steps:</para>
 352     <orderedlist>
 353       <listitem>
 354         <para>VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client.</para>
 355       </listitem>
 356       <listitem>
 357         <para>The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure.</para>
 358       </listitem>
 359       <listitem>
 360         <para>When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted.</para>
 361       </listitem>
 362     </orderedlist>
 363     <para>VBR recovery is fully transparent to users. It may lead to slightly longer recovery times if the cluster loses several clients during server recovery.</para>
 364     <section remap="h3">
 365         <title><indexterm><primary>Version-based recovery (VBR)</primary><secondary>messages</secondary></indexterm>VBR Messages</title>
 366       <para>The VBR feature is built into the Lustre file system recovery functionality. It cannot
 367         be disabled. These are some VBR messages that may be displayed:</para>
 368       <screen>DEBUG_REQ(D_WARNING, req, &quot;Version mismatch during replay\n&quot;);</screen>
 369       <para>This message indicates why the client was evicted. No action is needed.</para>
 370       <screen>CWARN(&quot;%s: version recovery fails, reconnecting\n&quot;);</screen>
 371       <para>This message indicates why the recovery failed. No action is needed.</para>
 372     </section>
 373     <section remap="h3">
 374         <title><indexterm><primary>Version-based recovery (VBR)</primary><secondary>tips</secondary></indexterm>Tips for Using VBR</title>
 375       <para>VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a client&apos;s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost.</para>
 376     </section>
 377   </section>
 378   <section xml:id="commitonshare">
 379     <title><indexterm><primary>commit on share</primary></indexterm>Commit on Share</title>
 380     <para>The commit-on-share (COS) feature makes Lustre file system recovery more reliable by
 381       preventing missing clients from causing cascading evictions of other clients. With COS
 382       enabled, if some Lustre clients miss the recovery window after a reboot or a server failure,
 383       the remaining clients are not evicted.</para>
 384     <note>
 385       <para>The commit-on-share feature is enabled, by default.</para>
 386     </note>
 387     <section remap="h3">
 388       <title><indexterm><primary>commit on share</primary><secondary>working with</secondary></indexterm>Working with Commit on Share</title>
 389       <para>To illustrate how COS works, let&apos;s first look at the old recovery scenario. After a service restart, the MDS would boot and enter recovery mode. Clients began reconnecting and replaying their uncommitted transactions. Clients could replay transactions independently as long as their transactions did not depend on each other (one client&apos;s transactions did not depend on a different client&apos;s transactions). The MDS is able to determine whether one transaction is dependent on another transaction via the <xref linkend="versionbasedrecovery"/> feature.</para>
 390       <para>If there was a dependency between client transactions (for example, creating and deleting the same file), and one or more clients did not reconnect in time, then some clients may have been evicted because their transactions depended on transactions from the missing clients. Evictions of those clients caused more clients to be evicted and so on, resulting in &quot;cascading&quot; client evictions.</para>
 391       <para>COS addresses the problem of cascading evictions by eliminating dependent transactions between clients. It ensures that one transaction is committed to disk if another client performs a transaction dependent on the first one. With no dependent, uncommitted transactions to apply, the clients replay their requests independently without the risk of being evicted.</para>
 392     </section>
 393     <section remap="h3">
 394       <title><indexterm><primary>commit on share</primary><secondary>tuning</secondary></indexterm>Tuning Commit On Share</title>
 395       <para>Commit on Share can be enabled or disabled using the <literal>mdt.commit_on_sharing</literal> tunable (0/1). This tunable can be set when the MDS is created (<literal>mkfs.lustre</literal>) or when the Lustre file system is active, using the <literal>lctl set/get_param</literal> or <literal>lctl conf_param</literal> commands.</para>
 396       <para>To set a default value for COS (disable/enable) when the file system is created, use:</para>
 397       <screen>--param mdt.commit_on_sharing=0/1
 398 </screen>
 399       <para>To disable or enable COS when the file system is running, use:</para>
 400       <screen>lctl set_param mdt.*.commit_on_sharing=0/1
 401 </screen>
 402       <note>
 403         <para>Enabling COS may cause the MDS to do a large number of synchronous disk operations, hurting performance. Placing the <literal>ldiskfs</literal> journal on a low-latency external device may improve file system performance.</para>
 404       </note>
 405     </section>
 406   </section>
 407    <section xml:id="imperativerecovery">
 408     <title><indexterm><primary>imperative recovery</primary></indexterm>Imperative Recovery</title>
 409       <para>Large-scale Lustre filesystems will experience server hardware
 410       failures over their lifetime, and it is important that servers can
 411       recover in a timely manner after such failures.  High Availability
 412       software can move storage targets over to a backup server automatically.
 413       Clients can detect the server failure by RPC timeouts, which must be
 414       scaled with system size to prevent false diagnosis of server death in
 415       cases of heavy load. The purpose of imperative recovery is to reduce
 416       the recovery window by actively informing clients of server failure.
 417       The resulting reduction in the recovery window will minimize target
 418       downtime and therefore increase overall system availability.</para>
 419       <para>
 420       Imperative Recovery does not remove previous recovery mechanisms, and client timeout-based
 421       recovery actions can occur in a cluster when IR is enabled as each client can still
 422       independently disconnect and reconnect from a target. In case of a mix of IR and non-IR
 423       clients connecting to an OST or MDT, the server cannot reduce its recovery timeout window,
 424       because it cannot be sure that all clients have been notified of the server restart in a
 425       timely manner. Even in such mixed environments the time to complete recovery may be reduced,
 426       since IR-enabled clients will still be notified to reconnect to the server promptly and allow
 427       recovery to complete as soon as the last non-IR client detects the server failure.</para>
 428         <section remap="h3">
 429          <title><indexterm><primary>imperative recovery</primary><secondary>MGS role</secondary></indexterm>MGS role</title>
 430         <para>The MGS now holds additional information about Lustre targets, in the form of a Target Status
 431         Table. Whenever a target registers with the MGS, there is a corresponding entry in this
 432         table identifying the target. This entry includes NID information, and state/version
 433         information for the target. When a client mounts the file system, it caches a locked copy of
 434         this table, in the form of a Lustre configuration log. When a target restart occurs, the MGS
 435         revokes the client lock, forcing all clients to reload the table. Any new targets will have
 436         an updated version number, the client detects this and reconnects to the restarted target.
 437         Since successful IR notification of server restart depends on all clients being registered
 438         with the MGS, and there is no other node to notify clients in case of MGS restart, the MGS
 439         will disable IR for a period when it first starts. This interval is configurable, as shown
 440         in <xref linkend="imperativerecoveryparameters"/></para>
 441         <para>Because of the increased importance of the MGS in recovery, it is strongly recommended that the MGS node be separate from the MDS. If the MGS is co-located on the MDS node, then in case of MDS/MGS failure there will be no IR notification for the MDS restart, and clients will always use timeout-based recovery for the MDS.  IR notification would still be used in the case of OSS failure and recovery.</para>
 442         <para>Unfortunately, it’s impossible for the MGS to know how many clients have been successfully notified or whether a specific client has received the restarting target information. The only thing the MGS can do is tell the target that, for example, all clients are imperative recovery-capable, so it is not necessary to wait as long for all clients to reconnect. For this reason, we still require a timeout policy on the target side, but this timeout value can be much shorter than normal recovery. </para>
 443         </section>
 444         <section remap="h3" xml:id="imperativerecoveryparameters">
 445         <title><indexterm><primary>imperative recovery</primary><secondary>Tuning</secondary></indexterm>Tuning Imperative Recovery</title>
 446         <para>Imperative recovery has a default parameter set which means it can work without any extra configuration. However, the default parameter set only fits a generic configuration. The following sections discuss the configuration items for imperative recovery.</para>
 447         <section remap="h5">
 448         <title>ir_factor</title>
 449         <para>Ir_factor is used to control targets’ recovery window. If imperative recovery is enabled, the recovery timeout window on the restarting target is calculated by: <emphasis>new timeout = recovery_time * ir_factor / 10 </emphasis>Ir_factor must be a value in range of [1, 10]. The default value of ir_factor is 5. The following example will set imperative recovery timeout to 80% of normal recovery timeout on the target testfs-OST0000: </para>
 450 <screen>lctl conf_param obdfilter.testfs-OST0000.ir_factor=8</screen>
 451                 <note> <para>If this value is too small for the system, clients may be unnecessarily evicted</para> </note>
 452 <para>You can read the current value of the parameter in the standard manner with <emphasis>lctl get_param</emphasis>:</para>
 453         <screen>
 454 # lctl get_param obdfilter.testfs-OST0000.ir_factor
 455 # obdfilter.testfs-OST0000.ir_factor=8
 456 </screen>
 457         </section>
 458         <section remap="h5">
 459         <title>Disabling Imperative Recovery</title>
 460         <para>Imperative recovery can be disabled manually by a mount option. For example, imperative recovery can be disabled on an OST by:</para>
 461         <screen># mount -t lustre -onoir /dev/sda /mnt/ost1</screen>
 462         <para>Imperative recovery can also be disabled on the client side with the same mount option:</para>
 463         <screen># mount -t lustre -onoir mymgsnid@tcp:/testfs /mnt/testfs</screen>
 464         <note><para>When a single client is deactivated in this manner, the MGS will deactivate imperative recovery for the whole cluster. IR-enabled clients will still get notification of target restart, but targets will not be allowed to shorten the recovery window. </para></note>
 465         <para>You can also disable imperative recovery globally on the MGS by writing `state=disabled’ to the controlling procfs entry</para>
 466         <screen># lctl set_param mgs.MGS.live.testfs="state=disabled"</screen>
 467         <para>The above command will disable imperative recovery for file system named <emphasis>testfs</emphasis></para>
 468         </section>
 469         <section remap="h5">
 470         <title>Checking Imperative Recovery State - MGS</title>
 471         <para>You can get the imperative recovery state from the MGS. Let’s take an example and explain states of imperative recovery:</para>
 472 <screen>
 473 [mgs]$ lctl get_param mgs.MGS.live.testfs
 474 ...
 475 imperative_recovery_state:
 476     state: full
 477     nonir_clients: 0
 478     nidtbl_version: 242
 479     notify_duration_total: 0.470000
 480     notify_duation_max: 0.041000
 481     notify_count: 38
 482 </screen>
 483 <informaltable frame="all">
 484         <tgroup cols="2">
 485         <colspec colname="c1" colwidth="50*"/>
 486         <colspec colname="c2" colwidth="50*"/>
 487         <thead>
 488                 <row>
 489                 <entry>
 490                 <para><emphasis role="bold">Item</emphasis></para>
 491                 </entry>
 492                 <entry>
 493                 <para><emphasis role="bold">Meaning</emphasis></para>
 494                 </entry>
 495                 </row>
 496         </thead>
 497         <tbody>
 498                 <row>
 499                 <entry>
 500                         <para><emphasis role="bold">
 501                         <literal>state</literal>
 502                         </emphasis></para>
 503                 </entry>
 504                 <entry>
 505                         <para><itemizedlist>
 506                         <listitem>
 507                         <para><emphasis role="bold">full: </emphasis>IR is working, all clients are connected and can be notified.</para>
 508                         </listitem>
 509                         <listitem>
 510                         <para><emphasis role="bold">partial: </emphasis>some clients are not IR capable.</para>
 511                         </listitem>
 512                         <listitem>
 513                         <para><emphasis role="bold">disabled: </emphasis>IR is disabled, no client notification.</para>
 514                         </listitem>
 515                         <listitem>
 516                         <para><emphasis role="bold">startup: </emphasis>the MGS was just restarted, so not all clients may reconnect to the MGS.</para>
 517                         </listitem>
 518                         </itemizedlist></para>
 519                 </entry>
 520                 </row>
 521                 <row>
 522                 <entry>
 523                         <para><emphasis role="bold">
 524                         <literal>nonir_clients</literal>
 525                         </emphasis></para>
 526                 </entry>
 527                 <entry>
 528                         <para>Number of non-IR capable clients in the system.</para>
 529                 </entry>
 530                 </row>
 531                 <row>
 532                 <entry>
 533                         <para><emphasis role="bold">
 534                         <literal>nidtbl_version</literal>
 535                         </emphasis></para>
 536                 </entry>
 537                 <entry>
 538                         <para>Version number of the target status table. Client version must match MGS.</para>
 539                 </entry>
 540                 </row>
 541                 <row>
 542                 <entry>
 543                         <para><emphasis role="bold">
 544                         <literal>notify_duration_total</literal>
 545                         </emphasis></para>
 546                 </entry>
 547                 <entry>
 548                         <para>[Seconds.microseconds] Total time spent by MGS notifying clients</para>
 549                 </entry>
 550                 </row>
 551                 <row>
 552                 <entry>
 553                         <para><emphasis role="bold">
 554                         <literal>notify_duration_max</literal>
 555                         </emphasis></para>
 556                 </entry>
 557                 <entry>
 558                         <para>[Seconds.microseconds] Maximum notification time for the MGS to notify a single IR client.</para>
 559                 </entry>
 560                 </row>
 561                 <row>
 562                 <entry>
 563                         <para><emphasis role="bold">
 564                         <literal>notify_count</literal>
 565                         </emphasis></para>
 566                 </entry>
 567                 <entry>
 568                         <para>Number of MGS restarts - to obtain average notification time, divide <literal>notify_duration_total</literal> by <literal>notify_count</literal></para>
 569                 </entry>
 570                 </row>
 571         </tbody>
 572         </tgroup>
 573 </informaltable>
 574
 575         </section>
 576         <section remap="h5">
 577         <title>Checking Imperative Recovery State - client</title>
 578         <para>A `client’ in IR means a Lustre client or a MDT. You can get the IR state on any node which
 579           running client or MDT, those nodes will always have an MGC running. An example from a
 580           client:</para>
 581         <screen>
 582 [client]$ lctl get_param mgc.*.ir_state
 583 mgc.MGC192.168.127.6@tcp.ir_state=
 584 imperative_recovery: ON
 585 client_state:
 586     - { client: testfs-client, nidtbl_version: 242 }
 587         </screen>
 588         <para>An example from a MDT:</para>
 589         <screen>
 590 mgc.MGC192.168.127.6@tcp.ir_state=
 591 imperative_recovery: ON
 592 client_state:
 593     - { client: testfs-MDT0000, nidtbl_version: 242 }
 594         </screen>
 595 <informaltable frame="all">
 596         <tgroup cols="2">
 597         <colspec colname="c1" colwidth="50*"/>
 598         <colspec colname="c2" colwidth="50*"/>
 599         <thead>
 600                 <row>
 601                 <entry>
 602                 <para><emphasis role="bold">Item</emphasis></para>
 603                 </entry>
 604                 <entry>
 605                 <para><emphasis role="bold">Meaning</emphasis></para>
 606                 </entry>
 607                 </row>
 608         </thead>
 609         <tbody>
 610                 <row>
 611                 <entry>
 612                         <para><emphasis role="bold">
 613                         <literal>imperative_recovery</literal>
 614                         </emphasis></para>
 615                 </entry>
 616                 <entry>
 617                         <para><literal>imperative_recovery</literal>can be ON or OFF. If it’s OFF state, then IR is disabled by administrator at mount time. Normally this should be ON state.</para>
 618                 </entry>
 619                 </row>
 620                 <row>
 621                 <entry>
 622                         <para><emphasis role="bold">
 623                         <literal>client_state: client:</literal>
 624                         </emphasis></para>
 625                 </entry>
 626                 <entry>
 627                         <para>The name of the client</para>
 628                 </entry>
 629                 </row>
 630                 <row>
 631                 <entry>
 632                         <para><emphasis role="bold">
 633                         <literal>client_state: nidtbl_version</literal>
 634                         </emphasis></para>
 635                 </entry>
 636                 <entry>
 637                         <para>Version number of the target status table. Client version must match MGS.</para>
 638                 </entry>
 639                 </row>
 640         </tbody>
 641         </tgroup>
 642 </informaltable>
 643         </section>
 644         <section remap="h5">
 645         <title>Target Instance Number</title>
 646         <para>The Target Instance number is used to determine if a client is connecting to the latest instance of a target. We use the lowest 32 bit of mount count as target instance number. For an OST you can get the target instance number of testfs-OST0001 in this way (the command is run from an OSS login prompt):</para>
 647 <screen>
 648 $ lctl get_param obdfilter.testfs-OST0001*.instance
 649 obdfilter.testfs-OST0001.instance=5
 650 </screen>
 651         <para>From a client, query the relevant OSC:</para>
 652 <screen>
 653 $ lctl get_param osc.testfs-OST0001-osc-*.import |grep instance
 654     instance: 5
 655 </screen>
 656         </section>
 657         </section>
 658         <section remap="h3" xml:id="imperativerecoveryrecomendations">
 659         <title><indexterm><primary>imperative recovery</primary><secondary>Configuration Suggestions</secondary></indexterm>Configuration Suggestions for Imperative Recovery</title>
 660           <para>We used to build the MGS and MDT0000 on the same target to save
 661             a server node. However, to make IR work efficiently, we strongly
 662             recommend running the MGS node on a separate node for any
 663             significant Lustre file system installation. There are three main
 664             advantages of doing this: </para>
 665           <orderedlist>
 666             <listitem><para>Be able to notify clients when MDT0000 recovered.
 667             </para></listitem>
 668             <listitem><para>Improved load balance. The load on the MDS may be
 669               very high which may make the MGS unable to notify the clients in
 670               time.</para></listitem>
 671             <listitem><para>Robustness. The MGS code is simpler and much smaller
 672               compared to the MDS code. This means the chance of an MGS downtime
 673               due to a software bug is very low.
 674             </para></listitem>
 675             </orderedlist>
 676         </section>
 677   </section>
 678
 679   <section xml:id="suppressingpings">
 680   <title><indexterm><primary>suppressing pings</primary></indexterm>Suppressing Pings</title>
 681     <para>On clusters with large numbers of clients and OSTs,
 682       <literal>OBD_PING</literal> messages may impose significant performance
 683       overheads. There is an option to suppress pings, allowing ping overheads
 684       to be considerably reduced. Before turning on this option, administrators
 685       should consider the following requirements and understand the trade-offs
 686       involved:</para>
 687     <itemizedlist>
 688       <listitem>
 689         <para>When suppressing pings, a server cannot detect client deaths,
 690           since clients do not send pings that are only to keep their
 691           connections alive. Therefore, a mechanism external to the Lustre
 692           file system shall be set up to notify Lustre targets of client
 693           deaths in a timely manner, so that stale connections do not exist
 694           for too long and lock callbacks to
 695           dead clients do not always have to wait for timeouts.</para>
 696       </listitem>
 697       <listitem>
 698         <para>Without pings, a client has to rely on Imperative Recovery to notify it of target failures, in order to join recoveries in time.  This dictates that the client shall eargerly keep its MGS connection alive.  Thus, a highly available standalone MGS is recommended and, on the other hand, MGS pings are always sent regardless of how the option is set.</para>
 699       </listitem>
 700       <listitem>
 701         <para>If a client has uncommitted requests to a target and it is not sending any new requests on the connection, it will still ping that target even when pings should be suppressed.  This is because the client needs to query the target's last committed transaction numbers in order to free up local uncommitted requests (and possibly other resources associated).  However, these pings shall stop as soon as all the uncommitted requests have been freed or new requests need to be sent, rendering the pings unnecessary.</para>
 702       </listitem>
 703     </itemizedlist>
 704     <section remap="h3">
 705     <title><indexterm><primary>pings</primary><secondary>suppress_pings</secondary></indexterm>"suppress_pings" Kernel Module Parameter</title>
 706       <para>The new option that controls whether pings are suppressed is
 707         implemented as the ptlrpc kernel module parameter "suppress_pings".
 708         Setting it to "1" on a server turns on ping suppressing for all
 709         targets on that server, while leaving it with the default value "0"
 710         gives previous pinging behavior.  The parameter is ignored on clients
 711         and the MGS.  While the parameter is recommended to be set persistently
 712         via the modprobe.conf(5) mechanism, it also accept online changes
 713         through sysfs.  Note that an online change only affects connections
 714         established later; existing connections' pinging behaviors stay the same.
 715       </para>
 716     </section>
 717     <section remap="h3">
 718     <title><indexterm><primary>pings</primary><secondary>evict_client</secondary></indexterm>Client Death Notification</title>
 719       <para>The required external client death notification shall write UUIDs
 720       of dead clients into targets' <literal>evict_client</literal> procfs
 721       entries in order to remove stale clients from recovery.</para>
 722       <para>A client UUID can be obtained from their <literal>uuid</literal>
 723       procfs entry and that UUID can be used to evict the client, like:
 724       </para>
 725 <screen>
 726 client$ lctl get_param llite.testfs-*.uuid
 727 llite.testfs-ffff991ae1992000.uuid=dd599d28-0a85-a9e4-82cd-dc6357a42c77
 728 oss# lctl set_param obdfilter.testfs-*.evict_client=dd599d28-0a85-a9e4-82cd-dc6357a42c77
 729 mds# lctl set_param mdt.testfs-*.evict_client=dd599d28-0a85-a9e4-82cd-dc6357a42c77
 730 </screen>
 731     </section>
 732   </section>
 733
 734 </chapter>
 735 <!--
 736   vim:expandtab:shiftwidth=2:tabstop=8:
 737   -->