FIX: xrefs and tidying

author Richard Henwood <rhenwood@whamcloud.com>

Wed, 18 May 2011 17:24:04 +0000 (12:24 -0500)

committer Richard Henwood <rhenwood@whamcloud.com>

Wed, 18 May 2011 17:24:04 +0000 (12:24 -0500)
author Richard Henwood <rhenwood@whamcloud.com>
Wed, 18 May 2011 17:24:04 +0000 (12:24 -0500)
committer Richard Henwood <rhenwood@whamcloud.com>
Wed, 18 May 2011 17:24:04 +0000 (12:24 -0500)
diff --git a/LustreRecovery.xml b/LustreRecovery.xml

index b30a519..24d4d08 100644 (file)
--- a/LustreRecovery.xml
+++ b/LustreRecovery.xml
@@ -1,86 +1,59 @@
  <?xml version="1.0" encoding="UTF-8"?>
-<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink">
+<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" xml:id='lustrerecovery'>
    <info>
-    <title>Lustre Recovery</title>
+    <title  xml:id='lustrerecovery.title'>Lustre Recovery</title>
    </info>
    <para><anchor xml:id="dbdoclet.50438268_pgfId-1292351" xreflabel=""/>This chapter describes how recovery is implemented in Lustre and includes the following sections:</para>
    <itemizedlist><listitem>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1287370" xreflabel=""/><link xl:href="LustreRecovery.html#50438268_58047">Recovery Overview</link></para>
+      <para><xref linkend="dbdoclet.50438268_58047"/></para>
      </listitem>
+
  <listitem>
-      <para> </para>
+      <para><xref linkend="dbdoclet.50438268_65824"/></para>
      </listitem>
+
  <listitem>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1289682" xreflabel=""/><link xl:href="LustreRecovery.html#50438268_65824">Metadata Replay</link></para>
+      <para><xref linkend="dbdoclet.50438268_23736"/></para>
      </listitem>
+
  <listitem>
-      <para> </para>
+      <para><xref linkend="dbdoclet.50438268_80068"/></para>
      </listitem>
+
  <listitem>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1291218" xreflabel=""/><link xl:href="LustreRecovery.html#50438268_23736">Reply Reconstruction</link></para>
-    </listitem>
-<listitem>
-      <para> </para>
-    </listitem>
-<listitem>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1288133" xreflabel=""/><link xl:href="LustreRecovery.html#50438268_80068">Version-based Recovery</link></para>
-    </listitem>
-<listitem>
-      <para> </para>
-    </listitem>
-<listitem>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1290063" xreflabel=""/><link xl:href="LustreRecovery.html#50438268_83826">Commit on Share</link></para>
-    </listitem>
-<listitem>
-      <para> </para>
+      <para><xref linkend="dbdoclet.50438268_83826"/></para>
      </listitem>
+
  </itemizedlist>
-   <informaltable frame="none">
-    <tgroup cols="1">
-      <colspec colname="c1" colwidth="100*"/>
-      <tbody>
-        <row>
-          <entry><para><emphasis role="bold">Note -</emphasis><anchor xml:id="dbdoclet.50438268_pgfId-1292744" xreflabel=""/>Usually the Lustre recovery process is transparent. For information about troubleshooting recovery when something goes wrong, see <link xl:href="TroubleShootingRecovery.html#50438225_33059">Chapter 27</link>: <link xl:href="LustreRecovery.html#50438268_93053">Lustre Recovery</link>.</para></entry>
-        </row>
-      </tbody>
-    </tgroup>
-  </informaltable>
-  <section remap="h2">
-    <title><anchor xml:id="dbdoclet.50438268_pgfId-1289094" xreflabel=""/></title>
-    <section remap="h2">
-      <title>30.1 <anchor xml:id="dbdoclet.50438268_58047" xreflabel=""/>Recovery Overview</title>
+
+<note><para>Usually the Lustre recovery process is transparent. For information about troubleshooting recovery when something goes wrong, see <xref linkend="dbdoclet.50438268_93053"/>.</para></note>
+
+    <section xml:id="dbdoclet.50438268_58047">
+      <title>30.1 Recovery Overview</title>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1291584" xreflabel=""/>Lustre&apos;s recovery feature is responsible for dealing with node or network failure and returning the cluster to a consistent, performant state. Because Lustre allows servers to perform asynchronous update operations to the on-disk file system (i.e., the server can reply without waiting for the update to synchronously commit to disk), the clients may have state in memory that is newer than what the server can recover from disk after a crash.</para>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1291585" xreflabel=""/>A handful of different types of failures can cause recovery to occur:</para>
        <itemizedlist><listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1291586" xreflabel=""/> Client (compute node) failure</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  <listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1291587" xreflabel=""/> MDS failure (and failover)</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  <listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1291588" xreflabel=""/> OST failure (and failover)</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  <listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1291589" xreflabel=""/> Transient network partition</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  </itemizedlist>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1291597" xreflabel=""/>Currently, all Lustre failure and recovery operations are based on the concept of connection failure; all imports or exports associated with a given connection are considered to fail if any of them fail.</para>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1290652" xreflabel=""/>For information on Lustre recovery, see <link xl:href="LustreRecovery.html#50438268_65824">Metadata Replay</link>. For information on recovering from a corrupt file system, see <link xl:href="LustreRecovery.html#50438268_83826">Commit on Share</link>. For information on resolving orphaned objects, a common issue after recovery, see <link xl:href="TroubleShootingRecovery.html#50438225_13916">Working with Orphaned Objects</link>.</para>
+      <para><anchor xml:id="dbdoclet.50438268_pgfId-1290652" xreflabel=""/>For information on Lustre recovery, see <xref linkend="dbdoclet.50438268_65824"/>. For information on recovering from a corrupt file system, see <xref linkend="dbdoclet.50438268_83826"/>. For information on resolving orphaned objects, a common issue after recovery, see <xref linkend='troubleshootingrecovery'/> (Working with Orphaned Objects).</para>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1287395" xreflabel=""/>30.1.1 <anchor xml:id="dbdoclet.50438268_96839" xreflabel=""/>Client <anchor xml:id="dbdoclet.50438268_marker-1287394" xreflabel=""/>Failure</title>
-        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287396" xreflabel=""/>Recovery from client failure in Lustre is based on lock revocation and other resources, so surviving clients can continue their work uninterrupted. If a client fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM) or fails to communicate with the server in a long period of time (i.e., no pings), the client is forcibly removed from the cluster (evicted). This enables other clients to acquire locks blocked by the dead client&apos;s locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition, as well as an actual client node system failure. <link xl:href="LustreRecovery.html#50438268_96876">Network Partition</link> describes this case in more detail.</para>
+        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287396" xreflabel=""/>Recovery from client failure in Lustre is based on lock revocation and other resources, so surviving clients can continue their work uninterrupted. If a client fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM) or fails to communicate with the server in a long period of time (i.e., no pings), the client is forcibly removed from the cluster (evicted). This enables other clients to acquire locks blocked by the dead client&apos;s locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition, as well as an actual client node system failure. <xref linkend="dbdoclet.50438268_96876"/> describes this case in more detail.</para>
        </section>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1290714" xreflabel=""/>30.1.2 <anchor xml:id="dbdoclet.50438268_43796" xreflabel=""/>Client <anchor xml:id="dbdoclet.50438268_marker-1292164" xreflabel=""/>Eviction</title>
@@ -88,65 +61,44 @@
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1291610" xreflabel=""/>Reasons why a client might be evicted:</para>
          <itemizedlist><listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1291611" xreflabel=""/> Failure to respond to a server request in a timely manner</para>
-          </listitem>
-<listitem>
              <itemizedlist><listitem>
                  <para><anchor xml:id="dbdoclet.50438268_pgfId-1291612" xreflabel=""/> Blocking lock callback (i.e., client holds lock that another client/server wants)</para>
                </listitem>
-<listitem>
-                <para> </para>
-              </listitem>
+
  <listitem>
                  <para><anchor xml:id="dbdoclet.50438268_pgfId-1291613" xreflabel=""/> Lock completion callback (i.e., client is granted lock previously held by another client)</para>
                </listitem>
-<listitem>
-                <para> </para>
-              </listitem>
+
  <listitem>
                  <para><anchor xml:id="dbdoclet.50438268_pgfId-1291614" xreflabel=""/> Lock glimpse callback (i.e., client is asked for size of object by another client)</para>
                </listitem>
-<listitem>
-                <para> </para>
-              </listitem>
+
  <listitem>
                  <para><anchor xml:id="dbdoclet.50438268_pgfId-1291615" xreflabel=""/> Server shutdown notification (with simplified interoperability)</para>
                </listitem>
-<listitem>
-                <para> </para>
-              </listitem>
+
  </itemizedlist>
            </listitem>
  <listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1291616" xreflabel=""/> Failure to ping the server in a timely manner, unless the server is receiving no RPC traffic at all (which may indicate a network partition).</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  </itemizedlist>
        </section>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1287398" xreflabel=""/>30.1.3 <anchor xml:id="dbdoclet.50438268_37508" xreflabel=""/>MDS Failure <anchor xml:id="dbdoclet.50438268_marker-1287397" xreflabel=""/>(Failover)</title>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1291624" xreflabel=""/>Highly-available (HA) Lustre operation requires that the metadata server have a peer configured for failover, including the use of a shared storage device for the MDT backing file system. The actual mechanism for detecting peer failure, power off (STONITH) of the failed peer (to prevent it from continuing to modify the shared disk), and takeover of the Lustre MDS service on the backup node depends on external HA software such as Heartbeat. It is also possible to have MDS recovery with a single MDS node. In this case, recovery will take as long as is needed for the single MDS to be restarted.</para>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1291625" xreflabel=""/>When clients detect an MDS failure (either by timeouts of in-flight requests or idle-time ping messages), they connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
-        <para><anchor xml:id="dbdoclet.50438268_pgfId-1290890" xreflabel=""/>The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the --failnode= option to mkfs.lustre or tunefs.lustre), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see <link xl:href="LustreRecovery.html#50438268_65824">Metadata Replay</link>.</para>
+        <para><anchor xml:id="dbdoclet.50438268_pgfId-1290890" xreflabel=""/>The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the --failnode= option to mkfs.lustre or tunefs.lustre), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see <xref linkend="dbdoclet.50438268_65824"/>.</para>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1290891" xreflabel=""/>Transaction numbers are used to ensure that operations are replayed in the order they were originally performed, so that they are guaranteed to succeed and present the same filesystem state as before the failure. In addition, clients inform the new server of their existing lock state (including locks that have not yet been granted). All metadata and lock replay must complete before new, non-recovery operations are permitted. In addition, only clients that were connected at the time of MDS failure are permitted to reconnect during the recovery window, to avoid the introduction of state changes that might conflict with what is being replayed by previously-connected clients.</para>
        </section>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1289241" xreflabel=""/>30.1.4 <anchor xml:id="dbdoclet.50438268_28881" xreflabel=""/>OST <anchor xml:id="dbdoclet.50438268_marker-1289240" xreflabel=""/>Failure (Failover)</title>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1291633" xreflabel=""/>When an OST fails or has communication problems with the client, the default action is that the corresponding OSC enters recovery, and I/O requests going to that OST are blocked waiting for OST recovery or failover. It is possible to administratively mark the OSC as <emphasis>inactive</emphasis> on the client, in which case file operations that involve the failed OST will return an IO error (-EIO). Otherwise, the application waits until the OST has recovered or the client process is interrupted (e.g. ,with <emphasis>CTRL-C</emphasis>).</para>
-        <para><anchor xml:id="dbdoclet.50438268_pgfId-1290917" xreflabel=""/>The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see <link xl:href="TroubleShootingRecovery.html#50438225_13916">Working with Orphaned Objects</link>.</para>
+        <para><anchor xml:id="dbdoclet.50438268_pgfId-1290917" xreflabel=""/>The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see <xref linkend='troubleshootingrecovery'/> (Working with Orphaned Objects).</para>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1290921" xreflabel=""/>While the OSC to OST operation recovery protocol is the same as that between the MDC and MDT using the Metadata Replay protocol, typically the OST commits bulk write operations to disk synchronously and each reply indicates that the request is already committed and the data does not need to be saved for recovery. In some cases, the OST replies to the client before the operation is committed to disk (e.g. truncate, destroy, setattr, and I/O operations in very new versions of Lustre), and normal replay and resend handling is done, including resending of the bulk writes. In this case, the client keeps a copy of the data available in memory until the server indicates that the write has committed to disk.</para>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1290922" xreflabel=""/>To force an OST recovery, unmount the OST and then mount it again. If the OST was connected to clients before it failed, then a recovery process starts after the remount, enabling clients to reconnect to the OST and replay transactions in their queue. When the OST is in recovery mode, all new client connections are refused until the recovery finishes. The recovery is complete when either all previously-connected clients reconnect and their transactions are replayed or a client connection attempt times out. If a connection attempt times out, then all clients waiting to reconnect (and their transactions) are lost.</para>
-        <informaltable frame="none">
-          <tgroup cols="1">
-            <colspec colname="c1" colwidth="100*"/>
-            <tbody>
-              <row>
-                <entry><para><emphasis role="bold">Note -</emphasis><anchor xml:id="dbdoclet.50438268_pgfId-1289423" xreflabel=""/>If you know an OST will not recover a previously-connected client (if, for example, the client has crashed), you can manually abort the recovery using this command:</para><para>lctl --device &lt;OST device number&gt; abort_recovery To determine an OSTâ€™s device number and device name, run the lctl dl command. Sample lctl dl command output is shown below:</para><para>7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 In this example, 7 is the OST device number. The device name is ddn_data-OST0009. In most instances, the device name can be used in place of the device number.</para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
+                <note><para>If you know an OST will not recover a previously-connected client (if, for example, the client has crashed), you can manually abort the recovery using this command:</para><para>lctl --device &lt;OST device number&gt; abort_recovery To determine an OST's device number and device name, run the lctl dl command. Sample lctl dl command output is shown below:</para><para>7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 In this example, 7 is the OST device number. The device name is ddn_data-OST0009. In most instances, the device name can be used in place of the device number.</para></note>
        </section>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1289389" xreflabel=""/>30.1.5 <anchor xml:id="dbdoclet.50438268_96876" xreflabel=""/>Network <anchor xml:id="dbdoclet.50438268_marker-1289388" xreflabel=""/>Partition</title>
@@ -158,33 +110,25 @@
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1290945" xreflabel=""/>In the case of failed recovery, a client is evicted by the server and must reconnect after having flushed its saved state related to that server, as described in <link xl:href="LustreRecovery.html#50438268_43796">Client Eviction</link>, above. Failed recovery might occur for a number of reasons, including:</para>
          <itemizedlist><listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1290949" xreflabel=""/> Failure of recovery</para>
-          </listitem>
-<listitem>
              <itemizedlist><listitem>
                  <para><anchor xml:id="dbdoclet.50438268_pgfId-1290951" xreflabel=""/> Recovery fails if the operations of one client directly depend on the operations of another client that failed to participate in recovery. Otherwise, Version Based Recovery (VBR) allows recovery to proceed for all of the connected clients, and only missing clients are evicted.</para>
                </listitem>
-<listitem>
-                <para> </para>
-              </listitem>
+
  <listitem>
                  <para><anchor xml:id="dbdoclet.50438268_pgfId-1290965" xreflabel=""/> Manual abort of recovery</para>
                </listitem>
-<listitem>
-                <para> </para>
-              </listitem>
+
  </itemizedlist>
            </listitem>
  <listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1290953" xreflabel=""/> Manual eviction by the administrator</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  </itemizedlist>
        </section>
      </section>
-    <section remap="h2">
-      <title>30.2 <anchor xml:id="dbdoclet.50438268_65824" xreflabel=""/>Metadata <anchor xml:id="dbdoclet.50438268_marker-1292175" xreflabel=""/>Replay</title>
+    <section xml:id="dbdoclet.50438268_65824">
+      <title>30.2 Metadata <anchor xml:id="dbdoclet.50438268_marker-1292175" xreflabel=""/>Replay</title>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287399" xreflabel=""/>Highly available Lustre operation requires that the MDS have a peer configured for failover, including the use of a shared storage device for the MDS backing file system. When a client detects an MDS failure, it connects to the new MDS and uses the metadata replay protocol to replay its requests.</para>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1288905" xreflabel=""/>Metadata replay ensures that the failover MDS re-accumulates state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
        <section remap="h3">
@@ -235,8 +179,8 @@
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1288653" xreflabel=""/>Once all of the previously-shared state has been recovered on the server (the target file system is up-to-date with client cache and the server has recreated locks representing the locks held by the client), the client can resend any requests that did not receive an earlier reply. This processing is done like normal request processing, and, in some cases, the server may do reply reconstruction.</para>
        </section>
      </section>
-    <section remap="h2">
-      <title>30.3 <anchor xml:id="dbdoclet.50438268_23736" xreflabel=""/>Reply <anchor xml:id="dbdoclet.50438268_marker-1292176" xreflabel=""/>Reconstruction</title>
+    <section xml:id="dbdoclet.50438268_23736">
+      <title>30.3 Reply <anchor xml:id="dbdoclet.50438268_marker-1292176" xreflabel=""/>Reconstruction</title>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1289740" xreflabel=""/>When a reply is dropped, the MDS needs to be able to reconstruct the reply when the original request is re-sent. This must be done without repeating any non-idempotent operations, while preserving the integrity of the locking system. In the event of MDS failover, the information used to reconstruct the reply must be serialized on the disk in transactions that are joined or nested with those operating on the disk.</para>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1289741" xreflabel=""/>30.3.1 Required State</title>
@@ -244,21 +188,15 @@
          <itemizedlist><listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1289808" xreflabel=""/> XID of the request</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  <listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1289814" xreflabel=""/> Resulting transno (if any)</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  <listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1289821" xreflabel=""/> Result code (req-&gt;rq_status)</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  </itemizedlist>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1289749" xreflabel=""/>For open requests, the &quot;disposition&quot; of the open must also be stored.</para>
        </section>
@@ -268,21 +206,15 @@
          <itemizedlist><listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1289840" xreflabel=""/> File handle</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  <listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1289874" xreflabel=""/> Lock handle</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  <listitem>
              <para><anchor xml:id="dbdoclet.50438268_pgfId-1289850" xreflabel=""/> mds_body with information about the file created (for O_CREAT)</para>
            </listitem>
-<listitem>
-            <para> </para>
-          </listitem>
+
  </itemizedlist>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1289738" xreflabel=""/>The disposition, status and request data (re-sent intact by the client) are sufficient to determine which type of lock handle was granted, whether an open file handle was created, and which resource should be described in the mds_body.</para>
          <section remap="h5">
@@ -299,66 +231,49 @@
          </section>
        </section>
      </section>
-    <section remap="h2">
-      <title>30.4 <anchor xml:id="dbdoclet.50438268_80068" xreflabel=""/>Version-based <anchor xml:id="dbdoclet.50438268_marker-1288580" xreflabel=""/>Recovery</title>
+    <section xml:id="dbdoclet.50438268_80068">
+      <title>30.4 Version-based <anchor xml:id="dbdoclet.50438268_marker-1288580" xreflabel=""/>Recovery</title>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287888" xreflabel=""/>The Version-based Recovery (VBR) feature improves Lustre reliability in cases where client requests (RPCs) fail to replay during recovery
            <footnote><para><anchor xml:id="dbdoclet.50438268_pgfId-1288438" xreflabel=""/>There are two scenarios under which client RPCs are not replayed:   (1) Non-functioning or isolated clients do not reconnect, and they cannot replay their RPCs, causing a gap in the replay sequence. These clients get errors and are evicted.   (2) Functioning clients connect, but they cannot replay some or all of their RPCs that occurred after the gap caused by the non-functioning/isolated clients. These clients get errors (caused by the failed clients). With VBR, these requests have a better chance to replay because the &quot;gaps&quot; are only related to specific files that the missing client(s) changed.</para></footnote>.
            .</para>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1287894" xreflabel=""/>In pre-VBR versions of Lustre, if the MGS or an OST went down and then recovered, a recovery process was triggered in which clients attempted to replay their requests. Clients were only allowed to replay RPCs in serial order. If a particular client could not replay its requests, then those requests were lost as well as the requests of clients later in the sequence. The &apos;&apos;downstream&apos;&apos; clients never got to replay their requests because of the wait on the earlier clientâ€™s RPCs. Eventually, the recovery period would time out (so the component could accept new requests), leaving some number of clients evicted and their requests and data lost.</para>
+      <para><anchor xml:id="dbdoclet.50438268_pgfId-1287894" xreflabel=""/>In pre-VBR versions of Lustre, if the MGS or an OST went down and then recovered, a recovery process was triggered in which clients attempted to replay their requests. Clients were only allowed to replay RPCs in serial order. If a particular client could not replay its requests, then those requests were lost as well as the requests of clients later in the sequence. The &apos;&apos;downstream&apos;&apos; clients never got to replay their requests because of the wait on the earlier client'™s RPCs. Eventually, the recovery period would time out (so the component could accept new requests), leaving some number of clients evicted and their requests and data lost.</para>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287896" xreflabel=""/>With VBR, the recovery mechanism does not result in the loss of clients or their data, because changes in inode versions are tracked, and more clients are able to reintegrate into the cluster. With VBR, inode tracking looks like this:</para>
        <itemizedlist><listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1288169" xreflabel=""/> Each inode<footnote><para><anchor xml:id="dbdoclet.50438268_pgfId-1288489" xreflabel=""/>Usually, there are two inodes, a parent and a child.</para></footnote> stores a version, that is, the number of the last transaction (transno) in which the inode was changed.</para>
          </listitem>
+
  <listitem>
-          <para> </para>
-        </listitem>
-<listitem>
-          <para><anchor xml:id="dbdoclet.50438268_pgfId-1288212" xreflabel=""/> When an inode is about to be changed, a pre-operation version of the inode is saved in the clientâ€™s data.</para>
-        </listitem>
-<listitem>
-          <para> </para>
+          <para><anchor xml:id="dbdoclet.50438268_pgfId-1288212" xreflabel=""/> When an inode is about to be changed, a pre-operation version of the inode is saved in the client'™s data.</para>
          </listitem>
+
  <listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1288241" xreflabel=""/> The client keeps the pre-operation inode version and the post-operation version (transaction number) for replay, and sends them in the event of a server failure.</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  <listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1288505" xreflabel=""/> If the pre-operation version matches, then the request is replayed. The post-operation version is assigned on all inodes modified in the request.</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  </itemizedlist>
-      <informaltable frame="none">
-        <tgroup cols="1">
-          <colspec colname="c1" colwidth="100*"/>
-          <tbody>
-            <row>
-              <entry><para><emphasis role="bold">Note -</emphasis><anchor xml:id="dbdoclet.50438268_pgfId-1288473" xreflabel=""/>An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a &apos;&apos;rename&apos;&apos; operation, four different inodes can be modified.</para></entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </informaltable>
+              <note><para>An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a &apos;&apos;rename&apos;&apos; operation, four different inodes can be modified.</para></note>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287777" xreflabel=""/>During normal operation, the server:</para>
        <itemizedlist><listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1287779" xreflabel=""/> Updates the versions of all inodes involved in a given operation</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  <listitem>
            <para><anchor xml:id="dbdoclet.50438268_pgfId-1287952" xreflabel=""/> Returns the old and new inode versions to the client with the reply</para>
          </listitem>
-<listitem>
-          <para> </para>
-        </listitem>
+
  </itemizedlist>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287979" xreflabel=""/>When the recovery mechanism is underway, VBR follows these steps:</para>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1287980" xreflabel=""/>1. VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client.</para>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1287992" xreflabel=""/> 2. The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure.</para>
-      <para><anchor xml:id="dbdoclet.50438268_pgfId-1288004" xreflabel=""/> 3. When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted.</para>
+      <orderedlist><listitem>
+      <para><anchor xml:id="dbdoclet.50438268_pgfId-1287980" xreflabel=""/>VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client.</para>
+  </listitem><listitem>
+      <para><anchor xml:id="dbdoclet.50438268_pgfId-1287992" xreflabel=""/>The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure.</para>
+  </listitem><listitem>
+      <para><anchor xml:id="dbdoclet.50438268_pgfId-1288004" xreflabel=""/>When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted.</para>
+  </listitem></orderedlist>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1288023" xreflabel=""/>VBR recovery is fully transparent to users. It may lead to slightly longer recovery times if the cluster loses several clients during server recovery.</para>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1287803" xreflabel=""/>30.4.1 <anchor xml:id="dbdoclet.50438268_marker-1288583" xreflabel=""/>VBR Messages</title>
@@ -372,22 +287,13 @@
        </section>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1287839" xreflabel=""/>30.4.2 Tips for <anchor xml:id="dbdoclet.50438268_marker-1288584" xreflabel=""/>Using VBR</title>
-        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287767" xreflabel=""/>VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a clientâ€™s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost.</para>
+        <para><anchor xml:id="dbdoclet.50438268_pgfId-1287767" xreflabel=""/>VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a client'™s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost.</para>
        </section>
      </section>
-    <section remap="h2">
-      <title>30.5 <anchor xml:id="dbdoclet.50438268_83826" xreflabel=""/>Commit on <anchor xml:id="dbdoclet.50438268_marker-1292182" xreflabel=""/>Share</title>
+    <section xml:id="dbdoclet.50438268_83826">
+      <title>30.5 Commit on <anchor xml:id="dbdoclet.50438268_marker-1292182" xreflabel=""/>Share</title>
        <para><anchor xml:id="dbdoclet.50438268_pgfId-1292074" xreflabel=""/>The commit-on-share (COS) feature makes Lustre recovery more reliable by preventing missing clients from causing cascading evictions of other clients. With COS enabled, if some Lustre clients miss the recovery window after a reboot or a server failure, the remaining clients are not evicted.</para>
-      <informaltable frame="none">
-        <tgroup cols="1">
-          <colspec colname="c1" colwidth="100*"/>
-          <tbody>
-            <row>
-              <entry><para><emphasis role="bold">Note -</emphasis><anchor xml:id="dbdoclet.50438268_pgfId-1292117" xreflabel=""/>The commit-on-share feature is enabled, by default.</para></entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </informaltable>
+              <note><para>The commit-on-share feature is enabled, by default.</para></note>
        <section remap="h3">
          <title><anchor xml:id="dbdoclet.50438268_pgfId-1292075" xreflabel=""/>30.5.1 Working with Commit on Share</title>
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1292076" xreflabel=""/>To illustrate how COS works, let&apos;s first look at the old recovery scenario. After a service restart, the MDS would boot and enter recovery mode. Clients began reconnecting and replaying their uncommitted transactions. Clients could replay transactions independently as long as their transactions did not depend on each other (one client&apos;s transactions did not depend on a different client&apos;s transactions). The MDS is able to determine whether one transaction is dependent on another transaction via the <link xl:href="LustreRecovery.html#50438268_80068">Version-based Recovery</link> feature.</para>
@@ -403,17 +309,7 @@
          <para><anchor xml:id="dbdoclet.50438268_pgfId-1292103" xreflabel=""/>To disable or enable COS when the file system is running, use:</para>
          <screen><anchor xml:id="dbdoclet.50438268_pgfId-1292104" xreflabel=""/>lctl set_param mdt.*.commit_on_sharing=0/1
  </screen>
-        <informaltable frame="none">
-          <tgroup cols="1">
-            <colspec colname="c1" colwidth="100*"/>
-            <tbody>
-              <row>
-                <entry><para><emphasis role="bold">Note -</emphasis><anchor xml:id="dbdoclet.50438268_pgfId-1292105" xreflabel=""/>Enabling COS may cause the MDS to do a large number of synchronous disk operations, hurting performance. Placing the ldiskfs journal on a low-latency external device may improve file system performance.</para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
+                <note><para>Enabling COS may cause the MDS to do a large number of synchronous disk operations, hurting performance. Placing the ldiskfs journal on a low-latency external device may improve file system performance.</para></note>
        </section>
      </section>
-  </section>
  </chapter>
author	Richard Henwood <rhenwood@whamcloud.com>
	Wed, 18 May 2011 17:24:04 +0000 (12:24 -0500)
committer	Richard Henwood <rhenwood@whamcloud.com>
	Wed, 18 May 2011 17:24:04 +0000 (12:24 -0500)