LUDOC-154 striping: Updated Ch 18. Also fixes LUDOC-132.

author Linda Bebernes <linda.bebernes@intel.com>

Mon, 5 Aug 2013 20:16:08 +0000 (13:16 -0700)

committer Richard Henwood <richard.henwood@intel.com>

Tue, 6 Aug 2013 18:31:01 +0000 (18:31 +0000)
author Linda Bebernes <linda.bebernes@intel.com>
Mon, 5 Aug 2013 20:16:08 +0000 (13:16 -0700)
committer Richard Henwood <richard.henwood@intel.com>
Tue, 6 Aug 2013 18:31:01 +0000 (18:31 +0000)
diff --git a/LustreProc.xml b/LustreProc.xml

index 2baf720..52cfb53 100644 (file)
--- a/LustreProc.xml
+++ b/LustreProc.xml
@@ -1,11 +1,8 @@
  <?xml version='1.0' encoding='UTF-8'?>
-<!-- This document was created with Syntext Serna Free. -->
-<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
-  xml:lang="en-US" xml:id="lustreproc">
+<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustreproc">
    <title xml:id="lustreproc.title">LustreProc</title>
-  <para>The <literal>/proc</literal> file system acts as an interface to internal data structures in
-    the kernel. This chapter describes entries in <literal>/proc</literal> that are useful for
-    tuning and monitoring aspects of a Lustre file system. It includes these sections:</para>
+  <para>The <literal>/proc</literal> file system acts as an interface to internal data structures in the kernel. The <literal>/proc</literal> variables can be used to control aspects of Lustre performance and provide information.</para>
+  <para>This chapter describes Lustre /proc entries and includes the following sections:</para>
    <itemizedlist>
      <listitem>
        <para><xref linkend="dbdoclet.50438271_90999"/></para>
@@ -15,48 +12,33 @@
      </listitem>
      <listitem>
        <para><xref linkend="dbdoclet.50438271_83523"/></para>
-      <para>The <literal>/proc</literal> directory provides a file-system like interface to internal
-        data structures in the kernel. These data structures include settings and metrics for
-        components such as memory, networking, file systems, and kernel housekeeping routines, which
-        are available throughout the hierarchical file layout in <literal>/proc.</literal>
-        Typically, metrics are accessed by reading from <literal>/proc</literal> files and settings
-        are changed by writing to <literal>/proc</literal> files. </para>
-      <para>The <literal>/proc</literal> directory contains files that allow an operator to
-        interface with the Lustre file system to tune and monitor many aspects of system and
-        application performance.</para>
      </listitem>
    </itemizedlist>
    <section xml:id="dbdoclet.50438271_90999">
-    <title><indexterm>
-        <primary>proc</primary>
-      </indexterm> Lustre Entries in /proc</title>
+    <title><indexterm><primary>proc</primary></indexterm>Proc Entries for Lustre</title>
      <para>This section describes <literal>/proc</literal> entries for Lustre.</para>
      <section remap="h3">
        <title>Locating Lustre File Systems and Servers</title>
-      <para>Use the <literal>/proc</literal> files on the MGS to locate the following:</para>
+      <para>Use the proc files on the MGS to locate the following:</para>
        <itemizedlist>
          <listitem>
            <para> All known file systems</para>
            <screen>mgs# cat /proc/fs/lustre/mgs/MGS/filesystems
-testfs
+spfs
  lustre</screen>
          </listitem>
        </itemizedlist>
        <itemizedlist>
          <listitem>
-          <para> The names of the servers in a file system (for a file system that has at least one
-            server running)</para>
-          <screen>mgs# cat /proc/fs/lustre/mgs/MGS/live/testfs
-fsname: testfs
+          <para> The server names participating in a file system (for each file system that has at least one server running)</para>
+          <screen>mgs# cat /proc/fs/lustre/mgs/MGS/live/spfs
+fsname: spfs
  flags: 0x0         gen: 7
-testfs-MDT0000
-testfs-OST0000</screen>
+spfs-MDT0000
+spfs-OST0000</screen>
          </listitem>
        </itemizedlist>
-      <para>All servers are named according to the convention
-            <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>.
-        Server names for live servers are listed under
-        <literal>/proc/fs/lustre/devices</literal>:</para>
+      <para>All servers are named according to this convention: <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>. This can be shown for live servers under <literal>/proc/fs/lustre/devices</literal>:</para>
        <screen>mds# cat /proc/fs/lustre/devices 
  0 UP mgs MGS MGS 11
  1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
@@ -69,126 +51,57 @@ testfs-OST0000</screen>
  8 UP mdc lustre-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
  9 UP osc lustre-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
  10 UP osc lustre-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05</screen>
-      <para>A server name can also be displayed by viewing the device label at any time.</para>
+      <para>Or from the device label at any time:</para>
        <screen>mds# e2label /dev/sda
  lustre-MDT0000</screen>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>timeouts</secondary>
-        </indexterm>Timeouts in a Lustre File System</title>
-      <para>Two types of timeouts are used in a Lustre file system.</para>
-      <itemizedlist>
-        <listitem>
-          <para><emphasis role="italic">LND timeouts</emphasis> - LND timeouts ensure that
-            point-to-point communications complete in a finite time in the presence of failures.
-            These timeouts are logged with the <literal>S_LND</literal> flag set. They are not
-            printed as console messages, so you should check the Lustre log for
-              <literal>D_NETERROR</literal> messages or enable printing of
-              <literal>D_NETERROR</literal> messages to the console (<literal>lctl set_param
-              printk=+neterror</literal>).</para>
-          <para>Congested routers can be a source of spurious LND timeouts. To avoid this situation,
-            increase the number of LNET router buffers to reduce back-pressure and/or increase LND
-            timeouts on all nodes on all connected networks. Also consider increasing the total
-            number of LNET router nodes in the system so that the aggregate router bandwidth matches
-            the aggregate server bandwidth.</para>
-        </listitem>
-      </itemizedlist>
+      <title><indexterm><primary>proc</primary><secondary>timeouts</secondary></indexterm>Lustre Timeouts</title>
+      <para>Lustre uses two types of timeouts.</para>
        <itemizedlist>
          <listitem>
-          <para><emphasis role="italic">Lustre timeouts </emphasis>- Lustre timeouts ensure that
-            Lustre RPCs complete in a finite time in the presence of failures. These timeouts are
-            always printed as console messages. If Lustre timeouts are not accompanied by LNET
-            timeouts, then increase the Lustre timeout on both servers and clients.</para>
+          <para>LND timeouts that ensure point-to-point communications complete in finite time in the presence of failures. These timeouts are logged with the <literal>S_LND</literal> flag set. They may <emphasis>not</emphasis> be printed as console messages, so you should check the Lustre log for <literal>D_NETERROR</literal> messages, or enable printing of <literal>D_NETERROR</literal> messages to the console (<literal>lctl set_param printk=+neterror</literal>).</para>
          </listitem>
        </itemizedlist>
-      <para>Specific Lustre timeouts include:</para>
+      <para>Congested routers can be a source of spurious LND timeouts. To avoid this, increase the number of LNET router buffers to reduce back-pressure and/or increase LND timeouts on all nodes on all connected networks. You should also consider increasing the total number of LNET router nodes in the system so that the aggregate router bandwidth matches the aggregate server bandwidth.</para>
        <itemizedlist>
          <listitem>
-          <para><literal>/proc/sys/lustre/timeout</literal> - The time period that a client waits
-            for a server to complete an RPC (default is 100s). Servers wait half of this time for a
-            normal client RPC to complete and a quarter of this time for a single bulk request (read
-            or write of up to 4 MB) to complete. The client pings recoverable targets (MDS and OSTs)
-            at one quarter of the timeout, and the server waits one and a half times the timeout
-            before evicting a client for being &quot;stale.&quot;</para>
-          <note>
-            <para>A Lustre client sends periodic &apos;ping&apos; messages to servers with which it
-              has had no communication for a specified period of time. Any network activity between
-              a client and a server in the file system also serves as a ping.</para>
-          </note>
-        </listitem>
-        <listitem>
-          <para><literal>/proc/sys/lustre/ldlm_timeout</literal> - The time period for which a
-            server will wait for a client to reply to an initial AST (lock cancellation request),
-            where the default is 20s for an OST and 6s for an MDS. If the client replies to the AST,
-            the server will give it a normal timeout (half the client timeout) to flush any dirty
-            data and release the lock.</para>
-        </listitem>
-        <listitem>
-          <para><literal>/proc/sys/lustre/fail_loc</literal> - The internal debugging failure hook.
-            See <literal>lustre/include/linux/obd_support.h</literal> for the definitions of
-            individual failure locations. The default value is 0 (zero).</para>
-        </listitem>
-        <listitem>
-          <para><literal>/proc/sys/lustre/dump_on_timeout</literal> - Triggers dumps of the Lustre
-            debug log when timeouts occur. The default value is 0 (zero).</para>
-        </listitem>
-        <listitem>
-          <para><literal>/proc/sys/lustre/dump_on_eviction</literal> - Triggers dumps of the Lustre
-            debug log when an eviction occurs. The default value is 0 (zero). </para>
+          <para>Lustre timeouts that ensure Lustre RPCs complete in finite time in the presence of failures. These timeouts should <emphasis>always</emphasis> be printed as console messages. If Lustre timeouts are not accompanied by LNET timeouts, then you need to increase the lustre timeout on both servers and clients.</para>
          </listitem>
        </itemizedlist>
+      <para>Specific Lustre timeouts are described below.</para>
+      <para><literal> /proc/sys/lustre/timeout </literal></para>
+      <para>This is the time period that a client waits for a server to complete an RPC (default is 100s). Servers wait half of this time for a normal client RPC to complete and a quarter of this time for a single bulk request (read or write of up to 4 MB) to complete. The client pings recoverable targets (MDS and OSTs) at one quarter of the timeout, and the server waits one and a half times the timeout before evicting a client for being &quot;stale.&quot;</para>
+      <note>
+        <para>Lustre sends periodic &apos;PING&apos; messages to servers with which it had no communication for a specified period of time. Any network activity on the file system that triggers network traffic toward servers also works as a health check.</para>
+      </note>
+      <para><literal> /proc/sys/lustre/ldlm_timeout </literal></para>
+      <para>This is the time period for which a server will wait for a client to reply to an initial AST (lock cancellation request) where default is 20s for an OST and 6s for an MDS. If the client replies to the AST, the server will give it a normal timeout (half of the client timeout) to flush any dirty data and release the lock.</para>
+      <para><literal> /proc/sys/lustre/fail_loc </literal></para>
+      <para>This is the internal debugging failure hook.</para>
+      <para>See <literal>lustre/include/linux/obd_support.h</literal> for the definitions of individual failure locations. The default value is 0 (zero).</para>
+      <screen>sysctl -w lustre.fail_loc=0x80000122 # drop a single reply</screen>
+      <para><literal> /proc/sys/lustre/dump_on_timeout </literal></para>
+      <para>This triggers dumps of the Lustre debug log when timeouts occur. The default value is 0 (zero).</para>
+      <para><literal> /proc/sys/lustre/dump_on_eviction </literal></para>
+      <para>This triggers dumps of the Lustre debug log when an eviction occurs. The default value is 0 (zero). By default, debug logs are dumped to the /tmp folder; this location can be changed via /proc.</para>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>adaptive timeouts</secondary>
-        </indexterm>Adaptive Timeouts</title>
-      <para>In a Lustre file system, an adaptive mechanism is used to set RPC timeouts. The adaptive
-        timeouts feature (enabled, by default) causes servers to track actual RPC completion times
-        and to report estimated completion times for future RPCs back to clients. The clients use
-        these estimates to set their future RPC timeout values. If server request processing slows
-        down for any reason, the RPC completion estimates increase, and the clients allow more time
-        for RPC completion.</para>
-      <para>If RPCs queued on the server approach their timeouts, then the server sends an early
-        reply to the client, telling the client to allow more time. In this manner, clients avoid
-        RPC timeouts and disconnect/reconnect cycles. Conversely, as a server speeds up, RPC timeout
-        values decrease, allowing faster detection of non-responsive servers and faster attempts to
-        reconnect to the failover partner of the server.</para>
-      <para>Adaptive timeouts were introduced in the Lustre 1.8.0.1 release. Prior to this release,
-        the static <literal>obd_timeout</literal> (<literal>/proc/sys/lustre/timeout</literal>)
-        value was used as the maximum completion time for all RPCs; this value also affected the
-        client-server ping interval and initial recovery timer. With adaptive timeouts,
-          <literal>obd_timeout</literal> is only used for the ping interval and initial recovery
-        estimate. When a client reconnects during recovery, the server uses the client&apos;s
-        timeout value to reset the recovery wait period; i.e., the server learns how long the client
-        had been willing to wait, and takes this into account when adjusting the recovery
-        period.</para>
+      <title><indexterm><primary>proc</primary><secondary>adaptive timeouts</secondary></indexterm>Adaptive Timeouts</title>
+      <para>Lustre offers an adaptive mechanism to set RPC timeouts. The adaptive timeouts feature (enabled, by default) causes servers to track actual RPC completion times, and to report estimated completion times for future RPCs back to clients. The clients use these estimates to set their future RPC timeout values. If server request processing slows down for any reason, the RPC completion estimates increase, and the clients allow more time for RPC completion.</para>
+      <para>If RPCs queued on the server approach their timeouts, then the server sends an early reply to the client, telling the client to allow more time. In this manner, clients avoid RPC timeouts and disconnect/reconnect cycles. Conversely, as a server speeds up, RPC timeout values decrease, allowing faster detection of non-responsive servers and faster attempts to reconnect to a server&apos;s failover partner.</para>
+      <para>In previous Lustre versions, the static obd_timeout (<literal>/proc/sys/lustre/timeout</literal>) value was used as the maximum completion time for all RPCs; this value also affected the client-server ping interval and initial recovery timer. Now, with adaptive timeouts, obd_timeout is only used for the ping interval and initial recovery estimate. When a client reconnects during recovery, the server uses the client&apos;s timeout value to reset the recovery wait period; i.e., the server learns how long the client had been willing to wait, and takes this into account when adjusting the recovery period.</para>
        <section remap="h4">
-        <title><indexterm>
-            <primary>proc</primary>
-            <secondary>configuring adaptive timeouts</secondary>
-          </indexterm><indexterm>
-            <primary>configuring</primary>
-            <secondary>adaptive timeouts</secondary>
-          </indexterm>Configuring Adaptive Timeouts</title>
-        <para>A goal of adaptive timeouts is to relieve users from having to tune the
-            <literal>obd_timeout</literal> value. In general, <literal>obd_timeout</literal> should
-          no longer need to be changed. However, several parameters related to adaptive timeouts can
-          be set by users. In most situations, the default values should be used.</para>
-        <para>The following parameters can be set persistently system-wide using <literal>lctl
-            conf_param</literal> on the MGS. For example, <literal>lctl conf_param
-            testfs.sys.at_max=1500</literal> sets the <literal>at_max</literal> value for all
-          servers and clients using the testfs file system.</para>
+        <title><indexterm><primary>proc</primary><secondary>configuring adaptive timeouts</secondary></indexterm><indexterm><primary>configuring</primary><secondary>adaptive timeouts</secondary></indexterm>Configuring Adaptive Timeouts</title>
+        <para>One of the goals of adaptive timeouts is to relieve users from having to tune the <literal>obd_timeout</literal> value. In general, <literal>obd_timeout</literal> should no longer need to be changed. However, there are several parameters related to adaptive timeouts that users can set. In most situations, the default values should be used.</para>
+        <para>The following parameters can be set persistently system-wide using <literal>lctl conf_param</literal> on the MGS. For example, <literal>lctl conf_param work1.sys.at_max=1500</literal> sets the at_max value for all servers and clients using the work1 file system.</para>
          <note>
-          <para>Nodes using multiple Lustre file systems must use the same <literal>at_*</literal>
-            values for all file systems.)</para>
+          <para>Nodes using multiple Lustre file systems must use the same <literal>at_*</literal> values for all file systems.)</para>
          </note>
          <informaltable frame="all">
            <tgroup cols="2">
-            <colspec colname="c1" colwidth="30*"/>
-            <colspec colname="c2" colwidth="80*"/>
+            <colspec colname="c1" colwidth="50*"/>
+            <colspec colname="c2" colwidth="50*"/>
              <thead>
                <row>
                  <entry>
@@ -202,151 +115,92 @@ lustre-MDT0000</screen>
              <tbody>
                <row>
                  <entry>
-                  <para>
-                    <literal> at_min </literal></para>
+                  <para> <literal> at_min </literal></para>
                  </entry>
                  <entry>
-                  <para>Sets the minimum adaptive timeout (in seconds). Default value is 0. The
-                      <literal>at_min</literal> parameter is the minimum processing time that a
-                    server will report. Clients base their timeouts on this value, but they do not
-                    use this value directly. If you experience cases in which, for unknown reasons,
-                    the adaptive timeout value is too short and clients time out their RPCs (usually
-                    due to temporary network outages), then you can increase the
-                      <literal>at_min</literal> value to compensate for this. Ideally, users should
-                    leave <literal>at_min</literal> set to its default.</para>
+                  <para>Sets the minimum adaptive timeout (in seconds). Default value is 0. The at_min parameter is the minimum processing time that a server will report. Clients base their timeouts on this value, but they do not use this value directly. If you experience cases in which, for unknown reasons, the adaptive timeout value is too short and clients time out their RPCs (usually due to temporary network outages), then you can increase the at_min value to compensate for this. Ideally, users should leave at_min set to its default.</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> at_max </literal></para>
+                  <para> <literal> at_max </literal></para>
                  </entry>
                  <entry>
-                  <para>Sets the maximum adaptive timeout (in seconds). The
-                      <literal>at_max</literal> parameter is an upper-limit on the service time
-                    estimate and is used as a &apos;failsafe&apos; in case of rogue/bad/buggy code
-                    that would lead to never-ending estimate increases. If <literal>at_max</literal>
-                    is reached, an RPC request is considered &apos;broken&apos; and will time
-                    out.</para>
-                  <para>Setting <literal>at_max</literal> to 0 causes adaptive timeouts to be
-                    disabled and the static fixed-timeout method (<literal>obd_timeout</literal>) to
-                    be used.</para>
+                  <para>Sets the maximum adaptive timeout (in seconds). The <literal>at_max</literal> parameter is an upper-limit on the service time estimate, and is used as a &apos;failsafe&apos; in case of rogue/bad/buggy code that would lead to never-ending estimate increases. If at_max is reached, an RPC request is considered &apos;broken&apos; and should time out.</para>
+                  <para>Setting at_max to 0 causes adaptive timeouts to be disabled and the old fixed-timeout method (<literal>obd_timeout</literal>) to be used.</para>
                    <note>
-                    <para>It is possible that slow hardware might validly cause the service estimate
-                      to increase beyond the default value of <literal>at_max</literal>. In this
-                      case, you should increase <literal>at_max</literal> to the maximum time you
-                      are willing to wait for an RPC completion.</para>
+                    <para>It is possible that slow hardware might validly cause the service estimate to increase beyond the default value of at_max. In this case, you should increase at_max to the maximum time you are willing to wait for an RPC completion.</para>
                    </note>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> at_history </literal></para>
+                  <para> <literal> at_history </literal></para>
                  </entry>
                  <entry>
-                  <para>Sets a time period (in seconds) within which adaptive timeouts remember the
-                    slowest event that occurred. Default value is 600.</para>
+                  <para>Sets a time period (in seconds) within which adaptive timeouts remember the slowest event that occurred. Default value is 600.</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> at_early_margin </literal></para>
+                  <para> <literal> at_early_margin </literal></para>
                  </entry>
                  <entry>
-                  <para>Sets how far before the deadline the Lustre client sends an early reply.
-                    Default value is 5<footnote>
-                      <para>This default was chosen as a reasonable time in which to send a reply
-                        from the point at which it was sent.</para>
+                  <para>Sets how far before the deadline Lustre sends an early reply. Default value is 5<footnote>
+                      <para>This default was chosen as a reasonable time in which to send a reply from the point at which it was sent.</para>
                      </footnote>.</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> at_extra </literal></para>
+                  <para> <literal> at_extra </literal></para>
                  </entry>
                  <entry>
-                  <para>Sets the incremental amount of time that a server asks for, with each early
-                    reply. The server does not know how much time the RPC will take, so it asks for
-                    a fixed value. Default value is 30<footnote>
-                      <para>This default was chosen as a balance between sending too many early
-                        replies for the same RPC and overestimating the actual completion
-                        time.</para>
-                    </footnote>. When a server finds a queued request about to time out (and needs
-                    to send an early reply out), the server adds the <literal>at_extra</literal>
-                    value. If the time expires, the Lustre client enters recovery status and
-                    reconnects to restore it to normal status.</para>
-                  <para>If you see multiple early replies for the same RPC asking for multiple
-                    30-second increases, change the <literal>at_extra</literal> value to a larger
-                    number to cut down on early replies sent and, therefore, network load.</para>
+                  <para>Sets the incremental amount of time that a server asks for, with each early reply. The server does not know how much time the RPC will take, so it asks for a fixed value. Default value is 30<footnote>
+                      <para>This default was chosen as a balance between sending too many early replies for the same RPC and overestimating the actual completion time</para>
+                    </footnote>. When a server finds a queued request about to time out (and needs to send an early reply out), the server adds the at_extra value. If the time expires, the Lustre client enters recovery status and reconnects to restore it to normal status.</para>
+                  <para>If you see multiple early replies for the same RPC asking for multiple 30-second increases, change the at_extra value to a larger number to cut down on early replies sent and, therefore, network load.</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> ldlm_enqueue_min </literal></para>
+                  <para> <literal> ldlm_enqueue_min </literal></para>
                  </entry>
                  <entry>
-                  <para>Sets the minimum lock enqueue time. Default value is 100. The
-                      <literal>ldlm_enqueue</literal> time is the maximum of the measured enqueue
-                    estimate (influenced by <literal>at_min</literal> and <literal>at_max</literal>
-                    parameters), multiplied by a weighting factor, and the
-                      <literal>ldlm_enqueue_min</literal> setting. LDLM lock enqueues were based on
-                    the <literal>obd_timeout</literal> value; now they have a dedicated minimum
-                    value. Lock enqueues increase as the measured enqueue times increase (similar to
-                    adaptive timeouts).</para>
+                  <para> Sets the minimum lock enqueue time. Default value is 100. The <literal>ldlm_enqueue</literal> time is the maximum of the measured enqueue estimate (influenced by at_min and at_max parameters), multiplied by a weighting factor, and the <literal>ldlm_enqueue_min</literal> setting. LDLM lock enqueues were based on the <literal>obd_timeout</literal> value; now they have a dedicated minimum value. Lock enqueues increase as the measured enqueue times increase (similar to adaptive timeouts).</para>
                  </entry>
                </row>
              </tbody>
            </tgroup>
          </informaltable>
-        <para>Adaptive timeouts are enabled by default. To disable adaptive timeouts, at run time,
-          set <literal>at_max</literal> to 0. On the MGS, run:</para>
+        <para>Adaptive timeouts are enabled, by default. To disable adaptive timeouts, at run time, set <literal>at_max</literal> to 0. On the MGS, run:</para>
          <screen>$ lctl conf_param <replaceable>fsname</replaceable>.sys.at_max=0</screen>
          <note>
-          <para>Changing the status of adaptive timeouts at runtime may cause a transient client
-            timeout, recovery, and reconnection.</para>
+          <para>Changing adaptive timeouts status at runtime may cause transient timeout, reconnect, recovery, etc.</para>
          </note>
        </section>
        <section remap="h4">
-        <title><indexterm>
-            <primary>proc</primary>
-            <secondary>interpreting adaptive timeouts</secondary>
-          </indexterm>Interpreting Adaptive Timeout Information</title>
-        <para>Adaptive timeout information can be read from the timeouts files in
-            <literal>/proc/fs/lustre/*/</literal> for each server and client or by using the
-            <literal>lctl</literal> command.</para>
-        <para>To read information from timeouts file, enter a command similar to:</para>
+        <title><indexterm><primary>proc</primary><secondary>interpreting adaptive timeouts</secondary></indexterm>Interpreting Adaptive Timeouts Information</title>
+        <para>Adaptive timeouts information can be read from <literal>/proc/fs/lustre/*/timeouts</literal> files (for each service and client) or with the lctl command.</para>
+        <para>This is an example from the <literal>/proc/fs/lustre/*/timeouts</literal> files:</para>
          <screen>cfs21:~# cat /proc/fs/lustre/ost/OSS/ost_io/timeouts</screen>
-        <para>To use the <literal>lctl</literal> command, enter a command similar to:</para>
+        <para>This is an example using the <literal>lctl</literal> command:</para>
          <screen>$ lctl get_param -n ost.*.ost_io.timeouts</screen>
-        <para>Example output:</para>
+        <para>This is the sample output:</para>
          <screen>service : cur 33  worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2</screen>
-        <para>In this example, the <literal>ost_io</literal> service on this node is currently
-          reporting an estimate of 33 seconds. The worst RPC service time was 34 seconds, and it
-          happened 26 minutes ago.</para>
-        <para>The output also provides a history of service times. In this example, four
-          &quot;bins&quot; of <literal>adaptive_timeout_history</literal> are shown, with the
-          maximum RPC time in each bin reported. In 0-150 seconds, the maximum RPC time was 1, with
-          the same result in 150-300 seconds. From 300-450 seconds, the worst (maximum) RPC time was
-          33 seconds, and from 450-600s the worst time was 2 seconds. The current estimated service
-          time is the maximum value of the four bins (33 seconds in this example).</para>
+        <para>The <literal>ost_io</literal> service on this node is currently reporting an estimate of 33 seconds. The worst RPC service time was 34 seconds, and it happened 26 minutes ago.</para>
+        <para>The output also provides a history of service times. In the example, there are 4 &quot;bins&quot; of <literal>adaptive_timeout_history</literal>, with the maximum RPC time in each bin reported. In 0-150 seconds, the maximum RPC time was 1, with the same result in 150-300 seconds. From 300-450 seconds, the worst (maximum) RPC time was 33 seconds, and from 450-600s the worst time was 2 seconds. The current estimated service time is the maximum value of the 4 bins (33 seconds in this example).</para>
          <para>Service times (as reported by the servers) are also tracked in the client OBDs:</para>
          <screen>cfs21:# lctl get_param osc.*.timeouts
  last reply : 1193428639, 0d0h00m00s ago
-network    : cur  1 worst  2 (at 1193427053, 0d0h26m26s ago)  1  1  1  1
-portal 6   : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33  2
-portal 28  : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  1  1  1
-portal 7   : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  0  1  1
-portal 17  : cur  1 worst  1 (at 1193426177, 0d0h41m02s ago)  1  0  0  1
+network    : cur   1  worst   2 (at 1193427053, 0d0h26m26s ago)   1   1   1   1
+portal 6   : cur  33  worst  34 (at 1193427052, 0d0h26m27s ago)  33  33  33   2
+portal 28  : cur   1  worst   1 (at 1193426141, 0d0h41m38s ago)   1   1   1   1
+portal 7   : cur   1  worst   1 (at 1193426141, 0d0h41m38s ago)   1   0   1   1
+portal 17  : cur   1  worst   1 (at 1193426177, 0d0h41m02s ago)   1   0   0   1
  </screen>
-        <para>In this case, RPCs to portal 6, the <literal>OST_IO_PORTAL</literal> (see
-            <literal>lustre/include/lustre/lustre_idl.h</literal>), shows the history of what the
-            <literal>ost_io</literal> portal has reported as the service estimate.</para>
-        <para>Server statistic files also show the range of estimates in the order
-          min/max/sum/sumsq.</para>
+        <para>In this case, RPCs to portal 6, the <literal>OST_IO_PORTAL</literal> (see <literal>lustre/include/lustre/lustre_idl.h</literal>), shows the history of what the <literal>ost_io</literal> portal has reported as the service estimate.</para>
+        <para>Server statistic files also show the range of estimates in the normal min/max/sum/sumsq manner.</para>
          <screen>cfs21:~# lctl get_param mdt.*.mdt.stats
  ...
  req_timeout               6 samples [sec] 1 10 15 105
@@ -355,336 +209,253 @@ req_timeout               6 samples [sec] 1 10 15 105
        </section>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>LNET</secondary>
-        </indexterm><indexterm>
-          <primary>LNET</primary>
-          <secondary>proc</secondary>
-        </indexterm>LNET Information</title>
-      <para>This section describes <literal>/proc</literal> entries containing LNET information.
-        These entries include:<itemizedlist>
-          <listitem>
-            <para><literal>/proc/sys/lnet/peers</literal> - Shows all NIDs known to this node and
-              provides information on the queue state.</para>
-            <para>Example:</para>
-            <screen># cat /proc/sys/lnet/peers
-nid                refs   state  max  rtr  min   tx    min   queue
-0@lo               1      ~rtr   0    0    0     0     0     0
-192.168.10.35@tcp  1      ~rtr   8    8    8     8     6     0
-192.168.10.36@tcp  1      ~rtr   8    8    8     8     6     0
-192.168.10.37@tcp  1      ~rtr   8    8    8     8     6     0</screen>
-            <para>The fields are explained in the table below:</para>
-            <informaltable frame="all">
-              <tgroup cols="2">
-                <colspec colname="c1" colwidth="30*"/>
-                <colspec colname="c2" colwidth="80*"/>
-                <thead>
-                  <row>
-                    <entry>
-                      <para><emphasis role="bold">Field</emphasis></para>
-                    </entry>
-                    <entry>
-                      <para><emphasis role="bold">Description</emphasis></para>
-                    </entry>
-                  </row>
-                </thead>
-                <tbody>
-                  <row>
-                    <entry>
-                      <para>
+      <title><indexterm><primary>proc</primary><secondary>LNET</secondary></indexterm><indexterm><primary>LNET</primary><secondary>proc</secondary></indexterm>LNET Information</title>
+      <para>This section describes<literal> /proc</literal> entries for LNET information.</para>
+      <para><literal> /proc/sys/lnet/peers </literal></para>
+      <para>Shows all NIDs known to this node and also gives information on the queue state.</para>
+      <screen># cat /proc/sys/lnet/peers
+nid                        refs            state           max             rtr             min             tx              min             queue
+0@lo                       1               ~rtr            0               0               0               0               0               0
+192.168.10.35@tcp  1               ~rtr            8               8               8               8               6               0
+192.168.10.36@tcp  1               ~rtr            8               8               8               8               6               0
+192.168.10.37@tcp  1               ~rtr            8               8               8               8               6               0</screen>
+      <para>The fields are explained below:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> 
                          <literal>
-                          <replaceable>refs</replaceable>
-                        </literal>
-                      </para>
-                    </entry>
-                    <entry>
-                      <para>A reference count (principally used for debugging).</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
+                    <replaceable>refs</replaceable>
+                  </literal>
+                  </para>
+              </entry>
+              <entry>
+                <para>A reference count (principally used for debugging)</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> 
                          <literal>
-                          <replaceable>state</replaceable>
-                        </literal>
-                      </para>
-                    </entry>
-                    <entry>
-                      <para>Only valid to refer to routers. Possible values:</para>
-                      <itemizedlist>
-                        <listitem>
-                          <para><literal>~rtr</literal> (indicates this node is not a router)</para>
-                        </listitem>
-                        <listitem>
-                          <para><literal>up/down</literal> (indicates this node is a router)</para>
-                        </listitem>
-                        <listitem>
-                          <para><literal>auto_fail</literal> (if enabled)</para>
-                        </listitem>
-                      </itemizedlist>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> max </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Maximum number of concurrent sends from this peer.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> rtr </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Routing buffer credits.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> min </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Minimum routing buffer credits seen.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> tx </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Send credits.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> min </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Minimum send credits seen.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> queue </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Total bytes in active/queued sends.</para>
-                    </entry>
-                  </row>
-                </tbody>
-              </tgroup>
-            </informaltable>
-            <para>Credits work like a semaphore. They are initialized to allow a certain number of
-              operations (8 in the example above). LNET keeps a track of the minimum value so that
-              you can see how congested a resource is.</para>
-            <para>A value of <literal>rtr/tx</literal> less than <literal>max</literal> indicates
-              operations are in progress. The number of operations is equal to
-                <literal>rtr</literal> or <literal>tx</literal> subtracted from
-                <literal>max</literal>.</para>
-            <para>A value of <literal>rtr/tx</literal> greater that <literal>max</literal> indicates
-              operations are blocking.</para>
-            <para>LNET also limits concurrent sends and router buffers allocated to a single peer so
-              that no peer can occupy all these resources.</para>
-          </listitem>
-        </itemizedlist><itemizedlist>
-          <listitem>
-            <para><literal>/proc/sys/lnet/nis</literal> - Shows the current queue health on this
-              node.</para>
-            <para>Example:</para>
-            <screen># cat /proc/sys/lnet/nis
-nid                    refs   peer    max   tx    min
-0@lo                   3      0       0     0     0
-192.168.10.34@tcp      4      8       256   256   252
+                    <replaceable>state</replaceable>
+                  </literal>
+                  </para>
+              </entry>
+              <entry>
+                <para>Only valid to refer to routers. Possible values:</para>
+                <itemizedlist>
+                  <listitem>
+                    <para>~ rtr (indicates this node is not a router)</para>
+                  </listitem>
+                  <listitem>
+                    <para>up/down (indicates this node is a router)</para>
+                  </listitem>
+                  <listitem>
+                    <para>auto_fail must be enabled</para>
+                  </listitem>
+                </itemizedlist>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> max </literal></para>
+              </entry>
+              <entry>
+                <para>Maximum number of concurrent sends from this peer</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> rtr </literal></para>
+              </entry>
+              <entry>
+                <para>Routing buffer credits.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> min </literal></para>
+              </entry>
+              <entry>
+                <para>Minimum routing buffer credits seen.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> tx </literal></para>
+              </entry>
+              <entry>
+                <para>Send credits.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> min </literal></para>
+              </entry>
+              <entry>
+                <para>Minimum send credits seen.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> queue </literal></para>
+              </entry>
+              <entry>
+                <para>Total bytes in active/queued sends.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <para>Credits work like a semaphore. At start they are initialized to allow a certain number of operations (8 in this example). LNET keeps a track of the minimum value so that you can see how congested a resource was.</para>
+      <para>If <literal>rtr/tx</literal> is less than max, there are operations in progress. The number of operations is equal to <literal>rtr</literal> or <literal>tx</literal> subtracted from max.</para>
+      <para>If <literal>rtr/tx</literal> is greater that max, there are operations blocking.</para>
+      <para>LNET also limits concurrent sends and router buffers allocated to a single peer so that no peer can occupy all these resources.</para>
+      <para><literal> /proc/sys/lnet/nis </literal></para>
+      <screen># cat /proc/sys/lnet/nis
+nid                                refs            peer            max             tx              min
+0@lo                               3               0               0               0               0
+192.168.10.34@tcp          4               8               256             256             252
  </screen>
-            <para> The fields are explained below:</para>
-            <informaltable frame="all">
-              <tgroup cols="2">
-                <colspec colname="c1" colwidth="30*"/>
-                <colspec colname="c2" colwidth="80*"/>
-                <thead>
-                  <row>
-                    <entry>
-                      <para><emphasis role="bold">Field</emphasis></para>
-                    </entry>
-                    <entry>
-                      <para><emphasis role="bold">Description</emphasis></para>
-                    </entry>
-                  </row>
-                </thead>
-                <tbody>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> nid </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Network interface.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> refs </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Internal reference counter.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> peer </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Number of peer-to-peer send credits on this NID. Credits are used to
-                        size buffer pools.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> max </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Total number of send credits on this NID.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> tx </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Current number of send credits available on this NID.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> min </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Lowest number of send credits available on this NID.</para>
-                    </entry>
-                  </row>
-                  <row>
-                    <entry>
-                      <para>
-                        <literal> queue </literal></para>
-                    </entry>
-                    <entry>
-                      <para>Total bytes in active/queued sends.</para>
-                    </entry>
-                  </row>
-                </tbody>
-              </tgroup>
-            </informaltable>
-            <para>Subtracting <literal>max</literal> - <literal>tx</literal> yields the number of
-              sends currently active. A large or increasing number of active sends may indicate a
-              problem.</para>
-            <para>Example:</para>
-            <screen># cat /proc/sys/lnet/nis
-nid                   refs       peer       max        tx         min
-0@lo                  2          0          0          0          0
-10.67.73.173@tcp      4          8          256        256        253
+      <para>Shows the current queue health on this node. The fields are explained below:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <literal> nid </literal></para>
+              </entry>
+              <entry>
+                <para>Network interface</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> refs </literal></para>
+              </entry>
+              <entry>
+                <para>Internal reference counter</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> peer </literal></para>
+              </entry>
+              <entry>
+                <para>Number of peer-to-peer send credits on this NID. Credits are used to size buffer pools</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> max </literal></para>
+              </entry>
+              <entry>
+                <para>Total number of send credits on this NID.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> tx </literal></para>
+              </entry>
+              <entry>
+                <para>Current number of send credits available on this NID.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> min </literal></para>
+              </entry>
+              <entry>
+                <para>Lowest number of send credits available on this NID.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal> queue </literal></para>
+              </entry>
+              <entry>
+                <para>Total bytes in active/queued sends.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <para>Subtracting <literal>max</literal> - <literal>tx</literal> yields the number of sends currently active. A large or increasing number of active sends may indicate a problem.</para>
+      <screen># cat /proc/sys/lnet/nis
+nid                                refs            peer            max             tx              min
+0@lo                               2               0               0               0               0
+10.67.73.173@tcp           4               8               256             256             253
  </screen>
-          </listitem>
-        </itemizedlist></para>
      </section>
      <section remap="h3">
        <title><indexterm>
            <primary>proc</primary>
-          <secondary>free space</secondary>
+          <secondary>free space distribution</secondary>
          </indexterm>Free Space Distribution</title>
-      <para>Free-space stripe weighting, as set, gives a priority of &quot;0&quot; to free space
-        (versus trying to place the stripes &quot;widely&quot; -- nicely distributed across OSSs and
-        OSTs to maximize network balancing). To adjust this priority as a percentage, use the
-          <literal>/proc</literal> tunable<literal>qos_prio_free</literal>:</para>
-      <screen>$ cat /proc/fs/lustre/lov/<replaceable>fsname</replaceable>-mdtlov/qos_prio_free</screen>
-      <para>The default is 90%. You can permanently set this value by running this command on the
-        MGS:</para>
-      <screen>$ lctl conf_param <replaceable>fsname</replaceable>-MDT0000.lov.qos_prio_free=90</screen>
-      <para>Setting the priority to 100% means that OSS distribution does not count in the
-        weighting, but the stripe assignment is still done via weighting. If OST 2 has twice as much
-        free space as OST 1, it is twice as likely to be used, but it is NOT guaranteed to be
-        used.</para>
-      <para>Also note that free-space stripe weighting does not activate until two OSTs are
-        imbalanced by more than 20%. Until then, a faster round-robin stripe allocator is used. (The
-        round-robin order also maximizes network balancing.)</para>
-      <section remap="h4">
-        <title><indexterm>
-            <primary>proc</primary>
-            <secondary>striping</secondary>
-          </indexterm>Managing Stripe Allocation</title>
-        <para>The MDS uses two methods to manage stripe allocation and determine which OSTs to use
-          for file object storage:</para>
-        <itemizedlist>
-          <listitem>
-            <para><emphasis role="bold">QOS</emphasis></para>
-            <para>Quality of Service (QOS) considers an OST&apos;s available blocks, speed, and the
-              number of existing objects, etc. Using these criteria, the MDS selects OSTs with more
-              free space more often than OSTs with less free space.</para>
-          </listitem>
-        </itemizedlist>
-        <itemizedlist>
-          <listitem>
-            <para><emphasis role="bold">RR</emphasis></para>
-            <para>Round-Robin (RR) allocates objects evenly across all OSTs. The RR stripe allocator
-              is faster than QOS, and used often because it distributes space usage/load best in
-              most situations, maximizing network balancing and improving performance.</para>
-          </listitem>
-        </itemizedlist>
-        <para>Whether QOS or RR is used depends on the setting of the
-            <literal>qos_threshold_rr</literal> proc tunable. The
-            <literal>qos_threshold_rr</literal> variable specifies a percentage threshold where the
-          use of QOS or RR becomes more/less likely. The <literal>qos_threshold_rr</literal> tunable
-          can be set as an integer, from 0 to 100, and results in this stripe allocation
-          behavior:</para>
-        <itemizedlist>
-          <listitem>
-            <para> If <literal>qos_threshold_rr</literal> is set to 0, then QOS is always
-              used</para>
-          </listitem>
-          <listitem>
-            <para> If <literal>qos_threshold_rr</literal> is set to 100, then RR is always
-              used</para>
-          </listitem>
-          <listitem>
-            <para> The larger the <literal>qos_threshold_rr</literal> setting, the greater the
-              possibility that RR is used instead of QOS</para>
-          </listitem>
-        </itemizedlist>
-      </section>
+      <para>Free space is allocated using either a round-robin or a weighted algorithm. The
+        allocation method is determined by the amount of free-space imbalance on the OSTs. When free
+        space is relatively balanced across OSTs, the faster round-robin allocator is used, which
+        maximizes network balancing. The weighted allocator is used when any two OSTs are out of
+        balance by more than a specified threshold.</para>
+      <para>Free space distribution can be tuned using these two <literal>/proc</literal>
+        tunables:</para>
+      <itemizedlist>
+        <listitem>
+          <para><literal>qos_threshold_rr</literal> - The threshold at which the allocation method
+            switches from round-robin to weighted is set in this file. The default is to switch to
+            the weighted algorithm when any two OSTs are out of balance by more than 17
+            percent.</para>
+        </listitem>
+        <listitem>
+          <para><literal>qos_prio_free</literal> - The weighting priority used by the weighted
+            allocator can be adjusted in this file. Increasing the value of
+              <literal>qos_prio_free</literal> puts more weighting on the amount of free space
+            available on each OST and less on how stripes are distributed across OSTs. The default
+            value is 91 percent. When the free space priority is set to 100, weighting is based
+            entirely on free space and location is no longer used by the striping algorthm.</para>
+        </listitem>
+      </itemizedlist>
+      <para>For more information about managing free space and setting <literal>/proc</literal>
+        tunables, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
+          linkend="dbdoclet.50438209_10424"/>.</para>
      </section>
    </section>
    <section xml:id="dbdoclet.50438271_78950">
-    <title><indexterm>
-        <primary>proc</primary>
-        <secondary>I/O tunables</secondary>
-      </indexterm>Lustre I/O Tunables</title>
-    <para>This section describes I/O tunables.</para>
+      <title><indexterm><primary>proc</primary><secondary>I/O tunables</secondary></indexterm>Lustre I/O Tunables</title>
+    <para>The section describes I/O tunables.</para>
      <para><literal> llite.<replaceable>fsname-instance</replaceable>/max_cache_mb</literal></para>
      <screen>client# lctl get_param llite.lustre-ce63ca00.max_cached_mb
  128</screen>
-    <para>This tunable is the maximum amount of inactive data cached by the client (default is 3/4
-      of RAM).</para>
+    <para>This tunable is the maximum amount of inactive data cached by the client (default is 3/4 of RAM).</para>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>RPC tunables</secondary>
-        </indexterm>Client I/O RPC Stream Tunables</title>
-      <para>The Lustre engine always attempts to pack an optimal amount of data into each I/O RPC
-        and attempts to keep a consistent number of issued RPCs in progress at a time. Lustre
-        exposes several tuning variables to adjust behavior according to network conditions and
-        cluster size. Each OSC has its own tree of these tunables. For example:</para>
+      <title><indexterm><primary>proc</primary><secondary>RPC tunables</secondary></indexterm>Client I/O RPC Stream Tunables</title>
+      <para>The Lustre engine always attempts to pack an optimal amount of data into each I/O RPC and attempts to keep a consistent number of issued RPCs in progress at a time. Lustre exposes several tuning variables to adjust behavior according to network conditions and cluster size. Each OSC has its own tree of these tunables. For example:</para>
        <screen>$ ls -d /proc/fs/lustre/osc/OSC_client_ost1_MNT_client_2 /localhost
  /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
  /proc/fs/lustre/osc/OSC_uml0_ost2_MNT_localhost
@@ -693,131 +464,49 @@ $ ls /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
  blocksizefilesfree max_dirty_mb ost_server_uuid stats</screen>
        <para>... and so on.</para>
        <para>RPC stream tunables are described below.</para>
-      <para>
-        <itemizedlist>
-          <listitem xml:id="lustreproc.maxdirtymb">
-            <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal> - This
-              tunable controls how many MBs of dirty data can be written and queued up in the
-                <literal>OSC. POSIX</literal> file writes that are cached contribute to this count.
-              When the limit is reached, additional writes stall until previously-cached writes are
-              written to the server. This may be changed by writing a single ASCII integer to the
-              file. Only values between 0 and 2048 or 1/4 of RAM are allowable. If 0 is given, no
-              writes are cached. Performance suffers noticeably unless you use large writes (1 MB or
-              more).</para>
-          </listitem>
-          <listitem>
-            <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal> -
-              This tunable is a read-only value that returns the current amount of bytes written and
-              cached on this OSC.</para>
-          </listitem>
-          <listitem>
-            <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal> -
-              This tunable is the maximum number of pages that will undergo I/O in a single RPC to
-              the OST. The minimum is a single page and the maximum for this setting is 1024 (for
-              systems with 4kB <literal>PAGE_SIZE</literal>), with the default maximum of 1MB in the
-              RPC. It is also possible to specify a units suffix (e.g. <literal>4M</literal>), so
-              that the RPC size can be specified independently of the client
-                <literal>PAGE_SIZE</literal>.</para>
-          </listitem>
-          <listitem>
-            <para><literal>osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight</literal>
-              - This tunable is the maximum number of concurrent RPCs in flight from an OSC to its
-              OST. If the OSC tries to initiate an RPC but finds that it already has the same number
-              of RPCs outstanding, it will wait to issue further RPCs until some complete. The
-              minimum setting is 1 and maximum setting is 256. If you are looking to improve small
-              file I/O performance, increase the <literal>max_rpcs_in_flight</literal> value.</para>
-          </listitem>
-        </itemizedlist>
-      </para>
-      <para>To maximize performance, the value for <literal>max_dirty_mb</literal> is recommended to
-        be 4 * <literal>max_pages_per_rpc</literal> * <literal>max_rpcs_in_flight</literal>.</para>
+      <para><literal> osc.<replaceable>osc_instance</replaceable>.max_dirty_mb </literal></para>
+      <para xml:id='lustreproc.maxdirtymb'>This tunable controls how many MBs of dirty data can be written and queued up in the OSC. POSIX file writes that are cached contribute to this count. When the limit is reached, additional writes stall until previously-cached writes are written to the server. This may be changed by writing a single ASCII integer to the file. Only values between 0 and 2048 or 1/4 of RAM are allowable. If 0 is given, no writes are cached. Performance suffers noticeably unless you use large writes (1 MB or more).</para>
+      <para><literal> osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes </literal></para>
+      <para>This tunable is a read-only value that returns the current amount of bytes written and cached on this OSC.</para>
+      <para><literal> osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc </literal></para>
+      <para>This tunable is the maximum number of pages that will undergo I/O in a single RPC to the OST. The minimum is a single page and the maximum for this setting is 1024 (for systems with 4kB <literal>PAGE_SIZE</literal>), with the default maximum of 1MB in the RPC. It is also possible to specify a units suffix (e.g. <literal>4M</literal>), so that the RPC size can be specified independently of the client <literal>PAGE_SIZE</literal>.</para>
+      <para><literal> osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight </literal></para>
+      <para>This tunable is the maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC tries to initiate an RPC but finds that it already has the same number of RPCs outstanding, it will wait to issue further RPCs until some complete. The minimum setting is 1 and maximum setting is 256. If you are looking to improve small file I/O performance, increase the <literal>max_rpcs_in_flight</literal> value.</para>
+      <para>To maximize performance, the value for <literal>max_dirty_mb</literal> is recommended to be 4 * <literal>max_pages_per_rpc</literal> * <literal>max_rpcs_in_flight</literal>.</para>
        <note>
-        <para>The <literal><replaceable>osc_instance</replaceable></literal> is typically
-              <literal><replaceable>fsname</replaceable>-OST<replaceable>ost_index</replaceable>-osc-<replaceable>mountpoint_instance</replaceable></literal>.
-          The <literal><replaceable>mountpoint_instance</replaceable></literal> is a unique value
-          per mount point to allow associating osc, mdc, lov, lmv, and llite parameters for the same
-          mount point. For <literal><replaceable>osc_instance</replaceable></literal> examples,
-          refer to the sample command output.</para>
+        <para>The 
+            <literal>
+            <replaceable>osc_instance</replaceable>
+          </literal>
+           is typically <literal><replaceable>fsname</replaceable>-OST<replaceable>ost_index</replaceable>-osc-<replaceable>mountpoint_instance</replaceable></literal>. The <literal><replaceable>mountpoint_instance</replaceable></literal> is a unique value per mountpoint to allow associating osc, mdc, lov, lmv, and llite parameters for the same mountpoint. For <literal><replaceable>osc_instance</replaceable></literal> examples, refer to the sample command output.</para>
        </note>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>watching RPC</secondary>
-        </indexterm>Watching the Client RPC Stream</title>
-      <para>The same directory contains an <literal>rpc_stats</literal> file with a histogram
-        showing the composition of previous RPCs. The histogram can be cleared by writing any value
-        into the <literal>rpc_stats</literal> file.</para>
-      <screen># cat /proc/fs/lustre/osc/testfs-OST0000-osc-c45f9c00/rpc_stats
-snapshot_time:                       1174867307.156604 (secs.usecs)
-read RPCs in flight:                 0
-write RPCs in flight:                0
-pending write pages:                 0
-pending read pages:                  0
-                read                                write
-pages per rpc   rpcs  %   cum   %    |   rpcs   %   cum     %
-1:              0     0   0          |   0          0       0
+      <title><indexterm><primary>proc</primary><secondary>watching RPC</secondary></indexterm>Watching the Client RPC Stream</title>
+      <para>The same directory contains a <literal>rpc_stats</literal> file with a histogram showing the composition of previous RPCs. The histogram can be cleared by writing any value into the <literal>rpc_stats</literal> file.</para>
+      <screen># cat /proc/fs/lustre/osc/spfs-OST0000-osc-c45f9c00/rpc_stats
+snapshot_time:                                     1174867307.156604 (secs.usecs)
+read RPCs in flight:                               0
+write RPCs in flight:                              0
+pending write pages:                               0
+pending read pages:                                0
+                   read                                    write
+pages per rpc              rpcs    %       cum     %       |       rpcs    %       cum     %
+1:                 0       0       0               |       0               0       0
   
-                read                                write
-rpcs in flight  rpcs  %   cum   %    |   rpcs   %   cum     %
-0:              0     0   0          |   0          0       0
+                   read                                    write
+rpcs in flight             rpcs    %       cum     %       |       rpcs    %       cum     %
+0:                 0       0       0               |       0               0       0
   
-                read                                write
-offset          rpcs  %   cum   %    |   rpcs   %   cum     %
-0:              0     0   0          |   0          0       0
-
-
-# cat /proc/fs/lustre/osc/testfs-OST0000-osc-ffff810058d2f800/rpc_stats
-snapshot_time:            1372786692.389858 (secs.usecs)
-read RPCs in flight:      0
-write RPCs in flight:     1
-dio read RPCs in flight:  0
-dio write RPCs in flight: 0
-pending write pages:      256
-pending read pages:       0
-
-                     read                   write
-pages per rpc   rpcs   % cum % |       rpcs   % cum %
-1:                 0   0   0   |          0   0   0
-2:                 0   0   0   |          1   0   0
-4:                 0   0   0   |          0   0   0
-8:                 0   0   0   |          0   0   0
-16:                0   0   0   |          0   0   0
-32:                0   0   0   |          2   0   0
-64:                0   0   0   |          2   0   0
-128:               0   0   0   |          5   0   0
-256:             850 100 100   |      18346  99 100
-
-                     read                   write
-rpcs in flight  rpcs   % cum % |       rpcs   % cum %
-0:               691  81  81   |       1740   9   9
-1:                48   5  86   |        938   5  14
-2:                29   3  90   |       1059   5  20
-3:                17   2  92   |       1052   5  26
-4:                13   1  93   |        920   5  31
-5:                12   1  95   |        425   2  33
-6:                10   1  96   |        389   2  35
-7:                30   3 100   |      11373  61  97
-8:                 0   0 100   |        460   2 100
-
-                     read                   write
-offset          rpcs   % cum % |       rpcs   % cum %
-0:               850 100 100   |      18347  99  99
-1:                 0   0 100   |          0   0  99
-2:                 0   0 100   |          0   0  99
-4:                 0   0 100   |          0   0  99
-8:                 0   0 100   |          0   0  99
-16:                0   0 100   |          1   0  99
-32:                0   0 100   |          1   0  99
-64:                0   0 100   |          3   0  99
-128:               0   0 100   |          4   0 100
-
+                   read                                    write
+offset                     rpcs    %       cum     %       |       rpcs    %       cum     %
+0:                 0       0       0               |       0               0       0
  </screen>
        <para>Where:</para>
        <informaltable frame="all">
          <tgroup cols="2">
-          <colspec colname="c1" colwidth="40*"/>
-          <colspec colname="c2" colwidth="60*"/>
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
            <thead>
              <row>
                <entry>
@@ -831,56 +520,40 @@ offset          rpcs   % cum % |       rpcs   % cum %
            <tbody>
              <row>
                <entry>
-                <para> {read,write} RPCs in flight</para>
+                <para> <emphasis role="bold">{read,write} RPCs in flight</emphasis></para>
                </entry>
                <entry>
-                <para>Number of read/write RPCs issued by the OSC, but not complete at the time of
-                  the snapshot. This value should always be less than or equal to
-                    <literal>max_rpcs_in_flight</literal>.</para>
+                <para>Number of read/write RPCs issued by the OSC, but not complete at the time of the snapshot. This value should always be less than or equal to max_rpcs_in_flight.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para> pending {read,write} pages</para>
+                <para> <emphasis role="bold">pending {read,write} pages</emphasis></para>
                </entry>
                <entry>
-                <para>Number of pending read/write pages that have been queued for I/O in the
-                  OSC.</para>
+                <para>Number of pending read/write pages that have been queued for I/O in the OSC.</para>
                </entry>
              </row>
              <row>
-              <entry>dio {read,write} RPCs in flight</entry>
-              <entry>Direct I/O (as opposed to block I/O) read/write RPCs issued but not completed
-                at the time of the snapshot.</entry>
-            </row>
-            <row>
                <entry>
-                <para> pages per RPC</para>
+                <para> <emphasis role="bold">pages per RPC</emphasis></para>
                </entry>
                <entry>
-                <para>When an RPC is sent, the number of pages it consists of is recorded (in
-                  order). A single page RPC increments the <literal>0:</literal> row.</para>
+                <para>When an RPC is sent, the number of pages it consists of is recorded (in order). A single page RPC increments the 0: row.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para> RPCs in flight</para>
+                <para> <emphasis role="bold">RPCs in flight</emphasis></para>
                </entry>
                <entry>
-                <para>When an RPC is sent, the number of other RPCs that are pending is recorded.
-                  When the first RPC is sent, the <literal>0:</literal> row is incremented. If the
-                  first RPC is sent while another is pending, the <literal>1:</literal> row is
-                  incremented and so on. As each RPC *completes*, the number of pending RPCs is not
-                  tabulated.</para>
-                <para>This table is a good way to visualize the concurrency of the RPC stream.
-                  Ideally, you will see a large clump around the
-                    <literal>max_rpcs_in_flight</literal> value, which shows that the network is
-                  being kept busy.</para>
+                <para>When an RPC is sent, the number of other RPCs that are pending is recorded. When the first RPC is sent, the 0: row is incremented. If the first RPC is sent while another is pending, the 1: row is incremented and so on. As each RPC *completes*, the number of pending RPCs is not tabulated.</para>
+                <para>This table is a good way to visualize the concurrency of the RPC stream. Ideally, you will see a large clump around the max_rpcs_in_flight value, which shows that the network is being kept busy.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para> offset</para>
+                <para> <emphasis role="bold">offset</emphasis></para>
                </entry>
                <entry>
                  <para> </para>
@@ -889,31 +562,21 @@ offset          rpcs   % cum % |       rpcs   % cum %
            </tbody>
          </tgroup>
        </informaltable>
-      <para>Each row in the table shows the number of reads or writes occurring for the statistic
-        (ios), the relative percentage of total reads or writes (%), and the cumulative percentage
-        to that point in the table for the statistic (cum %).</para>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>read/write survey</secondary>
-        </indexterm>Client Read-Write Offset Survey</title>
-      <para>The <literal>offset_stats</literal> parameter maintains statistics for occurrences where
-        a series of read or write calls from a process did not access the next sequential location.
-        The offset field is reset to 0 (zero) whenever a different file is read/written.</para>
-      <para>Read/write offset statistics are off by default. The statistics can be activated by
-        writing anything into the <literal>offset_stats</literal> file.</para>
+        <title><indexterm><primary>proc</primary><secondary>read/write survey</secondary></indexterm>Client Read-Write Offset Survey</title>
+      <para>The offset_stats parameter maintains statistics for occurrences where a series of read or write calls from a process did not access the next sequential location. The offset field is reset to 0 (zero) whenever a different file is read/written.</para>
+      <para>Read/write offset statistics are off, by default. The statistics can be activated by writing anything into the <literal>offset_stats</literal> file.</para>
        <para>Example:</para>
        <screen># cat /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats
  snapshot_time: 1155748884.591028 (secs.usecs)
-             RANGE   RANGE    SMALLEST   LARGEST   
-R/W   PID    START   END      EXTENT     EXTENT    OFFSET
-R     8385   0       128      128        128       0
-R     8385   0       224      224        224       -128
-W     8385   0       250      50         100       0
-W     8385   100     1110     10         500       -150
-W     8384   0       5233     5233       5233      0
-R     8385   500     600      100        100       -610</screen>
+R/W                PID             RANGE START             RANGE END               SMALLEST EXTENT         LARGEST EXTENT                          OFFSET
+R          8385            0                       128                     128                     128                             0
+R          8385            0                       224                     224                     224                             -128
+W          8385            0                       250                     50                      100                             0
+W          8385            100                     1110                    10                      500                             -150
+W          8384            0                       5233                    5233                    5233                            0
+R          8385            500                     600                     100                     100                             -610</screen>
        <para>Where:</para>
        <informaltable frame="all">
          <tgroup cols="2">
@@ -932,7 +595,7 @@ R     8385   500     600      100        100       -610</screen>
            <tbody>
              <row>
                <entry>
-                <para> R/W</para>
+                <para> <literal> R/W </literal></para>
                </entry>
                <entry>
                  <para>Whether the non-sequential call was a read or write</para>
@@ -940,7 +603,7 @@ R     8385   500     600      100        100       -610</screen>
              </row>
              <row>
                <entry>
-                <para> PID </para>
+                <para> <literal> PID </literal></para>
                </entry>
                <entry>
                  <para>Process ID which made the read/write call.</para>
@@ -948,7 +611,7 @@ R     8385   500     600      100        100       -610</screen>
              </row>
              <row>
                <entry>
-                <para> Range Start/Range End</para>
+                <para> <literal> Range Start/Range End </literal></para>
                </entry>
                <entry>
                  <para>Range in which the read/write calls were sequential.</para>
@@ -956,7 +619,7 @@ R     8385   500     600      100        100       -610</screen>
              </row>
              <row>
                <entry>
-                <para> Smallest Extent </para>
+                <para> <literal> Smallest Extent </literal></para>
                </entry>
                <entry>
                  <para>Smallest extent (single read/write) in the corresponding range.</para>
@@ -964,7 +627,7 @@ R     8385   500     600      100        100       -610</screen>
              </row>
              <row>
                <entry>
-                <para> Largest Extent </para>
+                <para> <literal> Largest Extent </literal></para>
                </entry>
                <entry>
                  <para>Largest extent (single read/write) in the corresponding range.</para>
@@ -972,18 +635,13 @@ R     8385   500     600      100        100       -610</screen>
              </row>
              <row>
                <entry>
-                <para> Offset </para>
+                <para> <literal> Offset </literal></para>
                </entry>
                <entry>
-                <para>Difference between the previous range end and the current range start.</para>
-                <para>For example, Smallest-Extent indicates that the writes in the range 100 to
-                  1110 were sequential, with a minimum write of 10 and a maximum write of 500. This
-                  range was started with an offset of -150. That means this is the difference
-                  between the last entry&apos;s range-end and this entry&apos;s range-start for the
-                  same file.</para>
-                <para>The <literal>rw_offset_stats</literal> file can be cleared by writing to
-                  it:</para>
-                <para><literal>lctl set_param llite.*.rw_offset_stats=0</literal></para>
+                <para>Difference from the previous range end to the current range start.</para>
+                <para>For example, Smallest-Extent indicates that the writes in the range 100 to 1110 were sequential, with a minimum write of 10 and a maximum write of 500. This range was started with an offset of -150. That means this is the difference between the last entry&apos;s range-end and this entry&apos;s range-start for the same file.</para>
+                <para>The <literal>rw_offset_stats</literal> file can be cleared by writing to it:</para>
+                <screen>lctl set_param llite.*.rw_offset_stats=0</screen>
                </entry>
              </row>
            </tbody>
@@ -991,38 +649,26 @@ R     8385   500     600      100        100       -610</screen>
        </informaltable>
      </section>
      <section xml:id="lustreproc.clientstats" remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>client stats</secondary>
-        </indexterm>Client Statistics </title>
-      <para>The <literal>stats</literal> parameter maintains statistics of activity across the VFS
-        interface of the Lustre file system. Only non-zero parameters are displayed in the file.
-        This section describes the statistics that accumulate during typical operation of a
-        client.</para>
-      <para>Client statistics are enabled by default. The statistics can be cleared by echoing an
-        empty string into the <literal>stats</literal> file or by using the command: <literal>lctl
-          set_param llite.*.stats=0</literal>. Statistics for an individual file system can be
-        displayed, for example, as shown below:</para>
+        <title><indexterm><primary>proc</primary><secondary>client stats</secondary></indexterm>Client stats</title>
+      <para>The stats parameter maintains statistics of activity across the VFS interface of the Lustre file system. Only non-zero parameters are displayed in the file. This section of the manual covers the statistics that will accumulate during typical operation of a client.</para>
+      <para>Client statistics are enabled by default. The statistics can be cleared by echoing an empty string into the <literal>stats</literal> file or with the command: <literal>lctl set_param llite.*.stats=0</literal>. Statistics for an individual file system can be displayed, for example:</para>
        <screen>client# lctl get_param llite.*.stats
-snapshot_time          1308343279.169704 secs.usecs
-dirty_pages_hits       14819716 samples [regs]
-dirty_pages_misses     81473472 samples [regs]
-read_bytes             36502963 samples [bytes] 1 26843582 55488794
-write_bytes            22985001 samples [bytes] 0 125912 3379002
-brw_read               2279 samples [pages] 1 1 2270
-ioctl                  186749 samples [regs]
-open                   3304805 samples [regs]
-close                  3331323 samples [regs]
-seek                   48222475 samples [regs]
-fsync                  963 samples [regs]
-truncate               9073 samples [regs]
-setxattr               19059 samples [regs]
-getxattr               61169 samples [regs]
+snapshot_time             1308343279.169704 secs.usecs
+dirty_pages_hits          14819716 samples [regs]
+dirty_pages_misses        81473472 samples [regs]
+read_bytes                36502963 samples [bytes] 1 26843582 55488794
+write_bytes               22985001 samples [bytes] 0 125912 3379002
+brw_read                  2279 samples [pages] 1 1 2270
+ioctl                     186749 samples [regs]
+open                      3304805 samples [regs]
+close                     3331323 samples [regs]
+seek                      48222475 samples [regs]
+fsync                     963 samples [regs]
+truncate                  9073 samples [regs]
+setxattr                  19059 samples [regs]
+getxattr                  61169 samples [regs]
  </screen>
-      <note>
-        <para>Statistics for all mounted file systems can be discovered by issuing the
-            <literal>lctl</literal> command <literal>lctl get_param llite.*.stats</literal></para>
-      </note>
+<note><para>Statistics for all mounted file systems can be discovered by issuing the lctl command: <literal>lctl get_param llite.*.stats</literal></para></note>
        <informaltable frame="all">
          <tgroup cols="2">
            <colspec colname="c1" colwidth="3*"/>
@@ -1040,152 +686,114 @@ getxattr               61169 samples [regs]
            <tbody>
              <row>
                <entry>
-                <para>
-                  <literal>snapshot_time</literal></para>
+                <para> <literal>snapshot_time</literal></para>
+              </entry>
+              <entry>
+                <para>Unix epoch instant the stats file was read.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>dirty_page_hits</literal></para>
                </entry>
                <entry>
-                <para>UNIX* epoch instant the stats file was read.</para>
+                <para>A count of the number of write operations that have been satisfied by the dirty page cache. See <xref linkend='lustreproc.maxdirtymb'/> for dirty cache behavior in Lustre.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>dirty_page_hits</literal></para>
+                <para> <literal>dirty_page_misses</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of write operations that have been satisfied by the
-                  dirty page cache. See <xref xmlns:xlink="http://www.w3.org/1999/xlink"
-                    linkend="lustreproc.maxdirtymb"/> for dirty cache behavior in a Lustre file
-                  system.</para>
+                <para>A count of the number of write operations that were not satisfied by the dirty page cache.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>dirty_page_misses</literal></para>
+                <para> <literal>read_bytes</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of write operations that were not satisfied by the dirty
-                  page cache.</para>
+                  <para>A count of the number of read operations that have occurred (samples). Three additional parameters are given:</para>
+                  <variablelist>
+                      <varlistentry>
+                          <term>min</term>
+                          <listitem><para>The minimum number of bytes read in a single request since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                      <varlistentry>
+                          <term>max</term>
+                          <listitem><para>The maximum number of bytes read in a single request since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                      <varlistentry>
+                          <term>sum</term>
+                          <listitem><para>The accumulated sum of bytes of all read requests since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                  </variablelist>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>read_bytes</literal></para>
+                <para> <literal>write_bytes</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of read operations that have occurred (samples). Three
-                  additional parameters are given:</para>
-                <variablelist>
-                  <varlistentry>
-                    <term>min</term>
-                    <listitem>
-                      <para>The minimum number of bytes read in a single request since the counter
-                        was reset.</para>
-                    </listitem>
-                  </varlistentry>
-                  <varlistentry>
-                    <term>max</term>
-                    <listitem>
-                      <para>The maximum number of bytes read in a single request since the counter
-                        was reset.</para>
-                    </listitem>
-                  </varlistentry>
-                  <varlistentry>
-                    <term>sum</term>
-                    <listitem>
-                      <para>The accumulated sum of bytes of all read requests since the counter was
-                        reset.</para>
-                    </listitem>
-                  </varlistentry>
-                </variablelist>
+                  <para>A count of the number of write operations that have occurred (samples). Three additional parameters are given:</para>
+                  <variablelist>
+                      <varlistentry>
+                          <term>min</term>
+                          <listitem><para>The minimum number of bytes written in a single request since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                      <varlistentry>
+                          <term>max</term>
+                          <listitem><para>The maximum number of bytes written in a single request since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                      <varlistentry>
+                          <term>sum</term>
+                          <listitem><para>The accumulated sum of bytes of all write requests since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                  </variablelist>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>write_bytes</literal></para>
+                <para> <literal>brw_read</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of write operations that have occurred (samples). Three
-                  additional parameters are given:</para>
-                <variablelist>
-                  <varlistentry>
-                    <term>min</term>
-                    <listitem>
-                      <para>The minimum number of bytes written in a single request since the
-                        counter was reset.</para>
-                    </listitem>
-                  </varlistentry>
-                  <varlistentry>
-                    <term>max</term>
-                    <listitem>
-                      <para>The maximum number of bytes written in a single request since the
-                        counter was reset.</para>
-                    </listitem>
-                  </varlistentry>
-                  <varlistentry>
-                    <term>sum</term>
-                    <listitem>
-                      <para>The accumulated sum of bytes of all write requests since the counter was
-                        reset.</para>
-                    </listitem>
-                  </varlistentry>
-                </variablelist>
+                  <para>A count of the number of pages that have been read.</para> <warning><para><literal>brw_</literal> stats are only tallied when the lloop device driver is present. lloop device is not currently supported.</para></warning><para>Three additional parameters are given:</para>
+                  <variablelist>
+                      <varlistentry>
+                          <term>min</term>
+                          <listitem><para>The minimum number of bytes read in a single brw read requests since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                      <varlistentry>
+                          <term>max</term>
+                          <listitem><para>The maximum number of bytes read in a single brw read requests since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                      <varlistentry>
+                          <term>sum</term>
+                          <listitem><para>The accumulated sum of bytes of all brw read requests since the counter was reset.</para>
+                          </listitem>
+                      </varlistentry>
+                  </variablelist>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>brw_read</literal></para>
+                <para> <literal>ioctl</literal></para>
+              </entry>
+              <entry>
+                <para>A count of the number of the combined file and directory ioctl operations.</para>
                </entry>
+            </row>
+            <row>
                <entry>
-                <para>A count of the number of pages that have been read.</para>
-                <warning>
-                  <para><literal>brw_</literal> stats are only tallied when the lloop device driver
-                    is present. lloop device is not currently supported.</para>
-                </warning>
-                <para>Three additional parameters are given:</para>
-                <variablelist>
-                  <varlistentry>
-                    <term>min</term>
-                    <listitem>
-                      <para>The minimum number of bytes read in a single brw read requests since the
-                        counter was reset.</para>
-                    </listitem>
-                  </varlistentry>
-                  <varlistentry>
-                    <term>max</term>
-                    <listitem>
-                      <para>The maximum number of bytes read in a single brw read requests since the
-                        counter was reset.</para>
-                    </listitem>
-                  </varlistentry>
-                  <varlistentry>
-                    <term>sum</term>
-                    <listitem>
-                      <para>The accumulated sum of bytes of all brw read requests since the counter
-                        was reset.</para>
-                    </listitem>
-                  </varlistentry>
-                </variablelist>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para>
-                  <literal>ioctl</literal></para>
-              </entry>
-              <entry>
-                <para>A count of the number of the combined file and directory ioctl
-                  operations.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para>
-                  <literal>open</literal></para>
+                <para> <literal>open</literal></para>
                </entry>
                <entry>
                  <para>A count of the number of open operations that have succeeded.</para>
@@ -1193,8 +801,7 @@ getxattr               61169 samples [regs]
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>close</literal></para>
+                <para> <literal>close</literal></para>
                </entry>
                <entry>
                  <para>A count of the number of close operations that have succeeded.</para>
@@ -1202,51 +809,42 @@ getxattr               61169 samples [regs]
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>seek</literal></para>
+                <para> <literal>seek</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of times <literal>seek</literal> has been called.</para>
+                  <para>A count of the number of times <literal>seek</literal> has been called.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>fsync</literal></para>
+                <para> <literal>fsync</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of times <literal>fsync</literal> has been
-                  called.</para>
+                  <para>A count of the number of times <literal>fsync</literal> has been called.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>truncate</literal></para>
+                <para> <literal>truncate</literal></para>
                </entry>
                <entry>
-                <para>A count of the total number of calls to both locked and lockless
-                  truncate.</para>
+                <para>A count of the total number of calls to both locked and lockless truncate.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>setxattr</literal></para>
+                <para> <literal>setxattr</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of times <literal>ll_setxattr</literal> has been
-                  called.</para>
+                  <para>A count of the number of times <literal>ll_setxattr</literal> has been called.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>getxattr</literal></para>
+                <para> <literal>getxattr</literal></para>
                </entry>
                <entry>
-                <para>A count of the number of times <literal>ll_getxattr</literal> has been
-                  called.</para>
+                  <para>A count of the number of times <literal>ll_getxattr</literal> has been called.</para>
                </entry>
              </row>
            </tbody>
@@ -1254,182 +852,96 @@ getxattr               61169 samples [regs]
        </informaltable>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>read/write survey</secondary>
-        </indexterm>Client Read-Write Extents Survey</title>
+        <title><indexterm><primary>proc</primary><secondary>read/write survey</secondary></indexterm>Client Read-Write Extents Survey</title>
        <para><emphasis role="bold">Client-Based I/O Extent Size Survey</emphasis></para>
-      <para>The <literal>rw_extent_stats</literal> histogram in the <literal>llite</literal>
-        directory shows you the statistics for the sizes of the read-write I/O extents. This file
-        does not maintain the per-process statistics.</para>
+      <para>The <literal>rw_extent_stats</literal> histogram in the <literal>llite</literal> directory shows you the statistics for the sizes of the read-write I/O extents. This file does not maintain the per-process statistics.</para>
        <para>Example:</para>
        <screen>client# lctl get_param llite.testfs-*.extents_stats
  snapshot_time:                     1213828728.348516 (secs.usecs)
-                       read           |            write
-extents          calls  %      cum%   |     calls  %     cum%
+                           read            |               write
+extents                    calls   %       cum%    |       calls   %       cum%
   
-0K - 4K :        0      0      0      |     2      2     2
-4K - 8K :        0      0      0      |     0      0     2
-8K - 16K :       0      0      0      |     0      0     2
-16K - 32K :      0      0      0      |     20     23    26
-32K - 64K :      0      0      0      |     0      0     26
-64K - 128K :     0      0      0      |     51     60    86
-128K - 256K :    0      0      0      |     0      0     86
-256K - 512K :    0      0      0      |     0      0     86
-512K - 1024K :   0      0      0      |     0      0     86
-1M - 2M :        0      0      0      |     11     13    100</screen>
+0K - 4K :          0       0       0       |       2       2       2
+4K - 8K :          0       0       0       |       0       0       2
+8K - 16K :         0       0       0       |       0       0       2
+16K - 32K :                0       0       0       |       20      23      26
+32K - 64K :                0       0       0       |       0       0       26
+64K - 128K :               0       0       0       |       51      60      86
+128K - 256K :              0       0       0       |       0       0       86
+256K - 512K :              0       0       0       |       0       0       86
+512K - 1024K :             0       0       0       |       0       0       86
+1M - 2M :          0       0       0       |       11      13      100</screen>
        <para>The file can be cleared by issuing the following command:</para>
        <screen>client# lctl set_param llite.testfs-*.extents_stats=0</screen>
        <para><emphasis role="bold">Per-Process Client I/O Statistics</emphasis></para>
-      <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size
-        statistics on a per-process basis. So you can track the per-process statistics for the last
-          <literal>MAX_PER_PROCESS_HIST</literal> processes.</para>
+      <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size statistics on a per-process basis. So you can track the per-process statistics for the last <literal>MAX_PER_PROCESS_HIST</literal> processes.</para>
        <para>Example:</para>
        <screen>lctl get_param llite.testfs-*.extents_stats_per_process
  snapshot_time:                     1213828762.204440 (secs.usecs)
-                          read            |             write
-extents            calls   %      cum%    |      calls   %       cum%
+                           read            |               write
+extents                    calls   %       cum%    |       calls   %       cum%
   
  PID: 11488
-   0K - 4K :       0       0       0      |      0       0       0
-   4K - 8K :       0       0       0      |      0       0       0
-   8K - 16K :      0       0       0      |      0       0       0
-   16K - 32K :     0       0       0      |      0       0       0
-   32K - 64K :     0       0       0      |      0       0       0
-   64K - 128K :    0       0       0      |      0       0       0
-   128K - 256K :   0       0       0      |      0       0       0
-   256K - 512K :   0       0       0      |      0       0       0
-   512K - 1024K :  0       0       0      |      0       0       0
-   1M - 2M :       0       0       0      |      10      100     100
+   0K - 4K :       0       0        0      |       0       0       0
+   4K - 8K :       0       0        0      |       0       0       0
+   8K - 16K :      0       0        0      |       0       0       0
+   16K - 32K :     0       0        0      |       0       0       0
+   32K - 64K :     0       0        0      |       0       0       0
+   64K - 128K :    0       0        0      |       0       0       0
+   128K - 256K :   0       0        0      |       0       0       0
+   256K - 512K :   0       0        0      |       0       0       0
+   512K - 1024K :  0       0        0      |       0       0       0
+   1M - 2M :       0       0        0      |       10      100     100
   
  PID: 11491
-   0K - 4K :       0       0       0      |      0       0       0
-   4K - 8K :       0       0       0      |      0       0       0
-   8K - 16K :      0       0       0      |      0       0       0
-   16K - 32K :     0       0       0      |      20      100     100
+   0K - 4K :       0       0        0      |       0       0       0
+   4K - 8K :       0       0        0      |       0       0       0
+   8K - 16K :      0       0        0      |       0       0       0
+   16K - 32K :     0       0        0      |       20      100     100
     
  PID: 11424
-   0K - 4K :       0       0       0      |      0       0       0
-   4K - 8K :       0       0       0      |      0       0       0
-   8K - 16K :      0       0       0      |      0       0       0
-   16K - 32K :     0       0       0      |      0       0       0
-   32K - 64K :     0       0       0      |      0       0       0
-   64K - 128K :    0       0       0      |      16      100     100
+   0K - 4K :       0       0        0      |       0       0       0
+   4K - 8K :       0       0        0      |       0       0       0
+   8K - 16K :      0       0        0      |       0       0       0
+   16K - 32K :     0       0        0      |       0       0       0
+   32K - 64K :     0       0        0      |       0       0       0
+   64K - 128K :    0       0        0      |       16      100     100
   
  PID: 11426
-   0K - 4K :       0       0       0      |      1       100     100
+   0K - 4K :       0       0        0      |       1       100     100
   
  PID: 11429
-   0K - 4K :       0       0       0      |      1       100     100
+   0K - 4K :       0       0        0      |       1       100     100
   
  </screen>
-      <para>Each row in the table shows the number of reads or writes occurring for the statistic
-        (ios), the relative percentage of total reads or writes (%), and the cumulative percentage
-        to that point in the table for the statistic (cum %).</para>
      </section>
      <section xml:id="dbdoclet.50438271_55057">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>block I/O</secondary>
-        </indexterm>Watching the OST Block I/O Stream</title>
-      <para>Similarly, a <literal>brw_stats</literal> histogram in the obdfilter directory shows the
-        statistics for number of I/O requests sent to the disk, their size, and whether they are
-        contiguous on the disk or not.</para>
+        <title><indexterm><primary>proc</primary><secondary>block I/O</secondary></indexterm>Watching the OST Block I/O Stream</title>
+      <para>Similarly, there is a <literal>brw_stats</literal> histogram in the obdfilter directory which shows you the statistics for number of I/O requests sent to the disk, their size and whether they are contiguous on the disk or not.</para>
        <screen>oss# lctl get_param obdfilter.testfs-OST0000.brw_stats 
  snapshot_time:                     1174875636.764630 (secs:usecs)
-                   read                         write
-pages per brw      brws    %      cum %   |     rpcs    %      cum %
-1:                 0       0      0       |     0       0      0
-                   read                         write
-discont pages      rpcs    %      cum %   |     rpcs    %      cum %
-1:                 0       0      0       |     0       0      0
-                   read                         write
-discont blocks     rpcs    %      cum %   |     rpcs    %      cum %
-1:                 0       0      0       |     0       0      0
-                   read                         write
-dio frags          rpcs    %      cum %   |     rpcs    %      cum %
-1:                 0       0      0       |     0       0      0
-                   read                         write
-disk ios in flight rpcs    %      cum %   |     rpcs    %      cum %
-1:                 0       0      0       |     0       0      0
-                   read                         write
-io time (1/1000s)  rpcs    %      cum %   |     rpcs    %      cum %
-1:                 0       0      0       |     0       0      0
-                   read                         write
-disk io size       rpcs    %      cum %   |     rpcs    %      cum %
-1:                 0       0      0       |     0       0      0
-                   read                         write
-
-# cat ./obdfilter/testfs-OST0000/brw_stats
-snapshot_time:         1372775039.769045 (secs.usecs)
-
-                           read      |      write
-pages per bulk r/w     rpcs  % cum % |  rpcs   % cum %
-1:                     108 100 100   |    39   0   0
-2:                       0   0 100   |     6   0   0
-4:                       0   0 100   |     1   0   0
-8:                       0   0 100   |     0   0   0
-16:                      0   0 100   |     4   0   0
-32:                      0   0 100   |    17   0   0
-64:                      0   0 100   |    12   0   0
-128:                     0   0 100   |    24   0   0
-256:                     0   0 100   | 23142  99 100
-
-                           read      |      write
-discontiguous pages    rpcs  % cum % |  rpcs   % cum %
-0:                     108 100 100   | 23245 100 100
-
-                           read      |      write
-discontiguous blocks   rpcs  % cum % |  rpcs   % cum %
-0:                     108 100 100   | 23243  99  99
-1:                       0   0 100   |     2   0 100
-
-                           read      |      write
-disk fragmented I/Os   ios   % cum % |   ios   % cum %
-0:                      94  87  87   |     0   0   0
-1:                      14  12 100   | 23243  99  99
-2:                       0   0 100   |     2   0 100
-
-                           read      |      write
-disk I/Os in flight    ios   % cum % |   ios   % cum %
-1:                      14 100 100   | 20896  89  89
-2:                       0   0 100   |  1071   4  94
-3:                       0   0 100   |   573   2  96
-4:                       0   0 100   |   300   1  98
-5:                       0   0 100   |   166   0  98
-6:                       0   0 100   |   108   0  99
-7:                       0   0 100   |    81   0  99
-8:                       0   0 100   |    47   0  99
-9:                       0   0 100   |     5   0 100
-
-                           read      |      write
-I/O time (1/1000s)     ios   % cum % |   ios   % cum %
-1:                      94  87  87   |     0   0   0
-2:                       0   0  87   |     7   0   0
-4:                      14  12 100   |    27   0   0
-8:                       0   0 100   |    14   0   0
-16:                      0   0 100   |    31   0   0
-32:                      0   0 100   |    38   0   0
-64:                      0   0 100   | 18979  81  82
-128:                     0   0 100   |   943   4  86
-256:                     0   0 100   |  1233   5  91
-512:                     0   0 100   |  1825   7  99
-1K:                      0   0 100   |   99   0  99
-2K:                      0   0 100   |     0   0  99
-4K:                      0   0 100   |     0   0  99
-8K:                      0   0 100   |    49   0 100
-
-                           read      |      write
-disk I/O size          ios   % cum % |   ios   % cum %
-4K:                     14 100 100   |    41   0   0
-8K:                      0   0 100   |     6   0   0
-16K:                     0   0 100   |     1   0   0
-32K:                     0   0 100   |     0   0   0
-64K:                     0   0 100   |     4   0   0
-128K:                    0   0 100   |    17   0   0
-256K:                    0   0 100   |    12   0   0
-512K:                    0   0 100   |    24   0   0
-1M:                      0   0 100   | 23142  99 100
+                           read                            write
+pages per brw              brws    %       cum %   |       rpcs    %       cum %
+1:                 0       0       0       |       0       0       0
+                           read                                    write
+discont pages              rpcs    %       cum %   |       rpcs    %       cum %
+1:                 0       0       0       |       0       0       0
+                           read                                    write
+discont blocks             rpcs    %       cum %   |       rpcs    %       cum %
+1:                 0       0       0       |       0       0       0
+                           read                                    write
+dio frags          rpcs    %       cum %   |       rpcs    %       cum %
+1:                 0       0       0       |       0       0       0
+                           read                                    write
+disk ios in flight rpcs    %       cum %   |       rpcs    %       cum %
+1:                 0       0       0       |       0       0       0
+                           read                                    write
+io time (1/1000s)  rpcs    %       cum %   |       rpcs    %       cum %
+1:                 0       0       0       |       0       0       0
+                           read                                    write
+disk io size               rpcs    %       cum %   |       rpcs    %       cum %
+1:                 0       0       0       |       0       0       0
+                           read                                    write
  </screen>
        <para>The fields are explained below:</para>
        <informaltable frame="all">
@@ -1449,72 +961,31 @@ disk I/O size          ios   % cum % |   ios   % cum %
            <tbody>
              <row>
                <entry>
-                <para>
-                  <literal>pages per bulk r/w</literal></para>
-              </entry>
-              <entry>
-                <para>Number of pages per RPC request, which should match aggregate client
-                    <literal>rpc_stats</literal>.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para>
-                  <literal>discontiguous pages</literal></para>
+                <para> <literal> pages per brw </literal></para>
                </entry>
                <entry>
-                <para>Number of discontinuities in the logical file offset of each page in a single
-                  RPC.</para>
+                <para>Number of pages per RPC request, which should match aggregate client <literal>rpc_stats</literal>.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>discontiguous blocks</literal></para>
+                <para> <literal> discont pages </literal></para>
                </entry>
                <entry>
-                <para>Number of discontinuities in the physical block allocation in the file system
-                  for a single RPC.</para>
+                <para>Number of discontinuities in the logical file offset of each page in a single RPC.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para><literal>disk fragmented I/Os</literal></para>
+                <para> <literal> discont blocks </literal></para>
                </entry>
                <entry>
-                <para>Number of I/Os that were not written entirely sequentially.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para><literal>disk I/Os in flight</literal></para>
-              </entry>
-              <entry>
-                <para>Number of disk I/Os currently pending.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para><literal>I/O time (1/1000s)</literal></para>
-              </entry>
-              <entry>
-                <para>Amount of time for each I/O operation to complete.</para>
-              </entry>
-            </row>
-            <row>
-              <entry>
-                <para><literal>disk I/O size</literal></para>
-              </entry>
-              <entry>
-                <para>Size of each I/O operation.</para>
+                <para>Number of discontinuities in the physical block allocation in the file system for a single RPC.</para>
                </entry>
              </row>
            </tbody>
          </tgroup>
        </informaltable>
-      <para>Each row in the table shows the number of reads or writes occurring for the statistic
-        (ios), the relative percentage of total reads or writes (%), and the cumulative percentage
-        to that point in the table for the statistic (cum %).</para>
        <para>For each Lustre service, the following information is provided:</para>
        <itemizedlist>
          <listitem>
@@ -1538,62 +1009,31 @@ disk I/O size          ios   % cum % |   ios   % cum %
        </itemizedlist>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>readahead</secondary>
-        </indexterm>Using File Readahead and Directory Statahead</title>
-      <para>Lustre 1.6.5.1 introduced file readahead and directory statahead functionality that read
-        data into memory in anticipation of a process actually requesting the data. File readahead
-        functionality reads file content data into memory. Directory statahead functionality reads
-        metadata into memory. When readahead and/or statahead work well, a data-consuming process
-        finds that the information it needs is available when requested, and it is unnecessary to
-        wait for network I/O.</para>
-      <para>Since Lustre 2.2.0, the directory statahead feature has been improved to enhance
-        directory traversal performance. The improvements have concentrated on two main
-        issues:</para>
+      <title><indexterm><primary>proc</primary><secondary>readahead</secondary></indexterm>Using File Readahead and Directory Statahead</title>
+      <para>Lustre 1.6.5.1 introduced file readahead and directory statahead functionality that read data into memory in anticipation of a process actually requesting the data. File readahead functionality reads file content data into memory. Directory statahead functionality reads metadata into memory. When readahead and/or statahead work well, a data-consuming process finds that the information it needs is available when requested, and it is unnecessary to wait for network I/O.</para>
+      <para>Since Lustre 2.2.0, the directory statahead feature has been improved to enhance directory traversal performance. The improvements have concentrated on two main issues:</para>
        <orderedlist>
          <listitem>
-          <para>A race condition between statahead thread and other VFS operations while processing
-            asynchronous getattr RPC replies.</para>
+          <para>A race condition between statahead thread and other VFS operations while processing asynchronous getattr RPC replies.</para>
          </listitem>
          <listitem>
-          <para>There is no file size/block attributes pre-fetching and the traversing thread has to
-            send synchronous glimpse size RPCs to OST(s).</para>
+          <para>There is no file size/block attributes pre-fetching and the traversing thread has to send synchronous glimpse size RPCs to OST(s).</para>
          </listitem>
        </orderedlist>
-      <para>The first issue is resolved by using statahead local dcache, and the second one is
-        resolved by using asynchronous glimpse lock (AGL) RPCs for pre-fetching file size/block
-        attributes from OST(s).</para>
+      <para>The first issue is resolved by using statahead local dcache, and the second one is resolved by using asynchronous glimpse lock (AGL) RPCs for pre-fetching file size/block attributes from OST(s).</para>
        <section remap="h4">
          <title>Tuning File Readahead</title>
-        <para>File readahead is triggered when two or more sequential reads by an application fail
-          to be satisfied by the Linux buffer cache. The size of the initial readahead is 1 MB.
-          Additional readaheads grow linearly, and increment until the readahead cache on the client
-          is full at 40 MB.</para>
-        <para><literal> llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_mb
-          </literal></para>
-        <para>This tunable controls the maximum amount of data readahead on a file. Files are read
-          ahead in RPC-sized chunks (1 MB or the size of read() call, if larger) after the second
-          sequential read on a file descriptor. Random reads are done at the size of the read() call
-          only (no readahead). Reads to non-contiguous regions of the file reset the readahead
-          algorithm, and readahead is not triggered again until there are sequential reads again. To
-          disable readahead, set this tunable to 0. The default value is 40 MB.</para>
-        <para><literal> llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_whole_mb
-          </literal></para>
-        <para>This tunable controls the maximum size of a file that is read in its entirety,
-          regardless of the size of the <literal>read()</literal>.</para>
+        <para>File readahead is triggered when two or more sequential reads by an application fail to be satisfied by the Linux buffer cache. The size of the initial readahead is 1 MB. Additional readaheads grow linearly, and increment until the readahead cache on the client is full at 40 MB.</para>
+        <para><literal> llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_mb </literal></para>
+        <para>This tunable controls the maximum amount of data readahead on a file. Files are read ahead in RPC-sized chunks (1 MB or the size of read() call, if larger) after the second sequential read on a file descriptor. Random reads are done at the size of the read() call only (no readahead). Reads to non-contiguous regions of the file reset the readahead algorithm, and readahead is not triggered again until there are sequential reads again. To disable readahead, set this tunable to 0. The default value is 40 MB.</para>
+        <para><literal> llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_whole_mb </literal></para>
+        <para>This tunable controls the maximum size of a file that is read in its entirety, regardless of the size of the <literal>read()</literal>.</para>
        </section>
        <section remap="h4">
          <title>Tuning Directory Statahead and AGL</title>
-        <para>Many system commands, like <literal>ls –l</literal>, <literal>du</literal>,
-            <literal>find</literal>, etc., will traverse directory sequentially. To make these
-          commands run efficiently, the directory statahead and AGL (asynchronous glimpse lock) can
-          be enabled to improve the performance of traversing.</para>
+        <para>Many system commands, like <literal>ls –l</literal>, <literal>du</literal>, <literal>find</literal>, etc., will traverse directory sequentially. To make these commands run efficiently, the directory statahead and AGL (asynchronous glimpse lock) can be enabled to improve the performance of traversing.</para>
          <para><literal> /proc/fs/lustre/llite/*/statahead_max </literal></para>
-        <para>This proc interface controls whether directory statahead is enabled and the maximum
-          statahead windows size (which means how many files can be pre-fetched by the statahead
-          thread). By default, statahead is enabled and the value of
-            <literal>statahead_max</literal> is 32.</para>
+        <para>This proc interface controls whether directory statahead is enabled and the maximum statahead windows size (which means how many files can be pre-fetched by the statahead thread). By default, statahead is enabled and the value of <literal>statahead_max</literal> is 32.</para>
          <para>To disable statahead, run:</para>
          <screen>lctl set_param llite.*.statahead_max=0</screen>
          <para>To set the maximum statahead windows size (n), run:</para>
@@ -1603,33 +1043,22 @@ disk I/O size          ios   % cum % |   ios   % cum %
          <screen>lctl set_param llite.*.statahead_agl=n</screen>
          <para>If &quot;n&quot; is 0, then the AGL is disabled, else the AGL is enabled.</para>
          <para><literal> /proc/fs/lustre/llite/*/statahead_stats </literal></para>
-        <para>This is a read-only interface that indicates the current statahead and AGL
-          status.</para>
+        <para>This is a read-only interface that indicates the current statahead and AGL status.</para>
          <note>
-          <para>The AGL is affected by statahead because the inodes processed by AGL are built by
-            the statahead thread, which means the statahead thread is the input of AGL pipeline. So
-            if statahead is disabled, then the AGL is disabled by force.</para>
+          <para>The AGL is affected by statahead because the inodes processed by AGL are built by the statahead thread, which means the statahead thread is the input of AGL pipeline. So if statahead is disabled, then the AGL is disabled by force.</para>
          </note>
        </section>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>read cache</secondary>
-        </indexterm>OSS Read Cache</title>
-      <para>The OSS read cache feature provides read-only caching of data on an OSS. This
-        functionality uses the regular Linux page cache to store the data. Just like caching from a
-        regular filesystem in Linux, OSS read cache uses as much physical memory as is
-        allocated.</para>
+      <title><indexterm><primary>proc</primary><secondary>read cache</secondary></indexterm>OSS Read Cache</title>
+      <para>The OSS read cache feature provides read-only caching of data on an OSS. This functionality uses the regular Linux page cache to store the data. Just like caching from a regular filesystem in Linux, OSS read cache uses as much physical memory as is allocated.</para>
        <para>OSS read cache improves Lustre performance in these situations:</para>
        <itemizedlist>
          <listitem>
-          <para>Many clients are accessing the same data set (as in HPC applications and when
-            diskless clients boot from Lustre)</para>
+          <para>Many clients are accessing the same data set (as in HPC applications and when diskless clients boot from Lustre)</para>
          </listitem>
          <listitem>
-          <para>One client is storing data while another client is reading it (essentially
-            exchanging data via the OST)</para>
+          <para>One client is storing data while another client is reading it (essentially exchanging data via the OST)</para>
          </listitem>
          <listitem>
            <para>A client has very limited caching of its own</para>
@@ -1649,29 +1078,15 @@ disk I/O size          ios   % cum % |   ios   % cum %
        </itemizedlist>
        <section remap="h4">
          <title>Using OSS Read Cache</title>
-        <para>OSS read cache is implemented on the OSS, and does not require any special support on
-          the client side. Since OSS read cache uses the memory available in the Linux page cache,
-          you should use I/O patterns to determine the appropriate amount of memory for the cache;
-          if the data is mostly reads, then more cache is required than for writes.</para>
+        <para>OSS read cache is implemented on the OSS, and does not require any special support on the client side. Since OSS read cache uses the memory available in the Linux page cache, you should use I/O patterns to determine the appropriate amount of memory for the cache; if the data is mostly reads, then more cache is required than for writes.</para>
          <para>OSS read cache is enabled, by default, and managed by the following tunables:</para>
          <itemizedlist>
            <listitem>
-            <para><literal>read_cache_enable</literal> controls whether data read from disk during a
-              read request is kept in memory and available for later read requests for the same
-              data, without having to re-read it from disk. By default, read cache is enabled
-                (<literal>read_cache_enable = 1</literal>).</para>
+            <para><literal>read_cache_enable</literal>  controls whether data read from disk during a read request is kept in memory and available for later read requests for the same data, without having to re-read it from disk. By default, read cache is enabled (<literal>read_cache_enable = 1</literal>).</para>
            </listitem>
          </itemizedlist>
-        <para>When the OSS receives a read request from a client, it reads data from disk into its
-          memory and sends the data as a reply to the requests. If read cache is enabled, this data
-          stays in memory after the client&apos;s request is finished, and the OSS skips reading
-          data from disk when subsequent read requests for the same are received. The read cache is
-          managed by the Linux kernel globally across all OSTs on that OSS, and the least recently
-          used cache pages will be dropped from memory when the amount of free memory is running
-          low.</para>
-        <para>If read cache is disabled (<literal>read_cache_enable = 0</literal>), then the OSS
-          will discard the data after the client&apos;s read requests are serviced and, for
-          subsequent read requests, the OSS must read the data from disk.</para>
+        <para>When the OSS receives a read request from a client, it reads data from disk into its memory and sends the data as a reply to the requests. If read cache is enabled, this data stays in memory after the client&apos;s request is finished, and the OSS skips reading data from disk when subsequent read requests for the same are received. The read cache is managed by the Linux kernel globally across all OSTs on that OSS, and the least recently used cache pages will be dropped from memory when the amount of free memory is running low.</para>
+        <para>If read cache is disabled (<literal>read_cache_enable = 0</literal>), then the OSS will discard the data after the client&apos;s read requests are serviced and, for subsequent read requests, the OSS must read the data from disk.</para>
          <para>To disable read cache on all OSTs of an OSS, run:</para>
          <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0</screen>
          <para>To re-enable read cache on one OST, run:</para>
@@ -1680,153 +1095,87 @@ disk I/O size          ios   % cum % |   ios   % cum %
          <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable</screen>
          <itemizedlist>
            <listitem>
-            <para><literal>writethrough_cache_enable</literal> controls whether data sent to the OSS
-              as a write request is kept in the read cache and available for later reads, or if it
-              is discarded from cache when the write is completed. By default, writethrough cache is
-              enabled (<literal>writethrough_cache_enable = 1</literal>).</para>
+            <para><literal>writethrough_cache_enable</literal>  controls whether data sent to the OSS as a write request is kept in the read cache and available for later reads, or if it is discarded from cache when the write is completed. By default, writethrough cache is enabled (<literal>writethrough_cache_enable = 1</literal>).</para>
            </listitem>
          </itemizedlist>
-        <para>When the OSS receives write requests from a client, it receives data from the client
-          into its memory and writes the data to disk. If writethrough cache is enabled, this data
-          stays in memory after the write request is completed, allowing the OSS to skip reading
-          this data from disk if a later read request, or partial-page write request, for the same
-          data is received.</para>
-        <para>If writethrough cache is disabled (<literal>writethrough_cache_enabled = 0</literal>),
-          then the OSS discards the data after the client&apos;s write request is completed, and for
-          subsequent read request, or partial-page write request, the OSS must re-read the data from
-          disk.</para>
-        <para>Enabling writethrough cache is advisable if clients are doing small or unaligned
-          writes that would cause partial-page updates, or if the files written by one node are
-          immediately being accessed by other nodes. Some examples where this might be useful
-          include producer-consumer I/O models or shared-file writes with a different node doing I/O
-          not aligned on 4096-byte boundaries. Disabling writethrough cache is advisable in the case
-          where files are mostly written to the file system but are not re-read within a short time
-          period, or files are only written and re-read by the same node, regardless of whether the
-          I/O is aligned or not.</para>
+        <para>When the OSS receives write requests from a client, it receives data from the client into its memory and writes the data to disk. If writethrough cache is enabled, this data stays in memory after the write request is completed, allowing the OSS to skip reading this data from disk if a later read request, or partial-page write request, for the same data is received.</para>
+        <para>If writethrough cache is disabled (<literal>writethrough_cache_enabled = 0</literal>), then the OSS discards the data after the client&apos;s write request is completed, and for subsequent read request, or partial-page write request, the OSS must re-read the data from disk.</para>
+        <para>Enabling writethrough cache is advisable if clients are doing small or unaligned writes that would cause partial-page updates, or if the files written by one node are immediately being accessed by other nodes. Some examples where this might be useful include producer-consumer I/O models or shared-file writes with a different node doing I/O not aligned on 4096-byte boundaries. Disabling writethrough cache is advisable in the case where files are mostly written to the file system but are not re-read within a short time period, or files are only written and re-read by the same node, regardless of whether the I/O is aligned or not.</para>
          <para>To disable writethrough cache on all OSTs of an OSS, run:</para>
          <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0</screen>
          <para>To re-enable writethrough cache on one OST, run:</para>
-        <screen>root@oss1# lctl set_param obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
+        <screen>root@oss1# lctl set_param \
+obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
          <para>To check if writethrough cache is</para>
          <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=1</screen>
          <itemizedlist>
            <listitem>
-            <para><literal>readcache_max_filesize</literal> controls the maximum size of a file that
-              both the read cache and writethrough cache will try to keep in memory. Files larger
-              than <literal>readcache_max_filesize</literal> will not be kept in cache for either
-              reads or writes.</para>
+            <para><literal>readcache_max_filesize</literal>  controls the maximum size of a file that both the read cache and writethrough cache will try to keep in memory. Files larger than <literal>readcache_max_filesize</literal> will not be kept in cache for either reads or writes.</para>
            </listitem>
          </itemizedlist>
-        <para>This can be very useful for workloads where relatively small files are repeatedly
-          accessed by many clients, such as job startup files, executables, log files, etc., but
-          large files are read or written only once. By not putting the larger files into the cache,
-          it is much more likely that more of the smaller files will remain in cache for a longer
-          time.</para>
-        <para>When setting <literal>readcache_max_filesize</literal>, the input value can be
-          specified in bytes, or can have a suffix to indicate other binary units such as <emphasis
-            role="bold">K</emphasis>ilobytes, <emphasis role="bold">M</emphasis>egabytes, <emphasis
-            role="bold">G</emphasis>igabytes, <emphasis role="bold">T</emphasis>erabytes, or
-            <emphasis role="bold">P</emphasis>etabytes.</para>
+        <para>This can be very useful for workloads where relatively small files are repeatedly accessed by many clients, such as job startup files, executables, log files, etc., but large files are read or written only once. By not putting the larger files into the cache, it is much more likely that more of the smaller files will remain in cache for a longer time.</para>
+        <para>When setting <literal>readcache_max_filesize</literal>, the input value can be specified in bytes, or can have a suffix to indicate other binary units such as <emphasis role="bold">K</emphasis>ilobytes, <emphasis role="bold">M</emphasis>egabytes, <emphasis role="bold">G</emphasis>igabytes, <emphasis role="bold">T</emphasis>erabytes, or <emphasis role="bold">P</emphasis>etabytes.</para>
          <para>To limit the maximum cached file size to 32MB on all OSTs of an OSS, run:</para>
          <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M</screen>
          <para>To disable the maximum cached file size on an OST, run:</para>
-        <screen>root@oss1# lctl set_param obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
+        <screen>root@oss1# lctl set_param \
+obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
          <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
          <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize</screen>
        </section>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>OSS journal</secondary>
-        </indexterm>OSS Asynchronous Journal Commit</title>
-      <para>The OSS asynchronous journal commit feature synchronously writes data to disk without
-        forcing a journal flush. This reduces the number of seeks and significantly improves
-        performance on some hardware.</para>
+      <title><indexterm><primary>proc</primary><secondary>OSS journal</secondary></indexterm>OSS Asynchronous Journal Commit</title>
+      <para>The OSS asynchronous journal commit feature synchronously writes data to disk without forcing a journal flush. This reduces the number of seeks and significantly improves performance on some hardware.</para>
        <note>
-        <para>Asynchronous journal commit cannot work with O_DIRECT writes, a journal flush is still
-          forced.</para>
+        <para>Asynchronous journal commit cannot work with O_DIRECT writes, a journal flush is still forced.</para>
        </note>
-      <para>When asynchronous journal commit is enabled, client nodes keep data in the page cache (a
-        page reference). Lustre clients monitor the last committed transaction number (transno) in
-        messages sent from the OSS to the clients. When a client sees that the last committed
-        transno reported by the OSS is at least the bulk write transno, it releases the reference on
-        the corresponding pages. To avoid page references being held for too long on clients after a
-        bulk write, a 7 second ping request is scheduled (jbd commit time is 5 seconds) after the
-        bulk write reply is received, so the OSS has an opportunity to report the last committed
-        transno.</para>
-      <para>If the OSS crashes before the journal commit occurs, then the intermediate data is lost.
-        However, new OSS recovery functionality (introduced in the asynchronous journal commit
-        feature), causes clients to replay their write requests and compensate for the missing disk
-        updates by restoring the state of the file system.</para>
-      <para>To enable asynchronous journal commit, set the <literal>sync_journal parameter</literal>
-        to zero (<literal>sync_journal=0</literal>):</para>
+      <para>When asynchronous journal commit is enabled, client nodes keep data in the page cache (a page reference). Lustre clients monitor the last committed transaction number (transno) in messages sent from the OSS to the clients. When a client sees that the last committed transno reported by the OSS is at least the bulk write transno, it releases the reference on the corresponding pages. To avoid page references being held for too long on clients after a bulk write, a 7 second ping request is scheduled (jbd commit time is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity to report the last committed transno.</para>
+      <para>If the OSS crashes before the journal commit occurs, then the intermediate data is lost. However, new OSS recovery functionality (introduced in the asynchronous journal commit feature), causes clients to replay their write requests and compensate for the missing disk updates by restoring the state of the file system.</para>
+      <para>To enable asynchronous journal commit, set the <literal>sync_journal parameter</literal> to zero (<literal>sync_journal=0</literal>):</para>
        <screen>$ lctl set_param obdfilter.*.sync_journal=0 
  obdfilter.lol-OST0001.sync_journal=0</screen>
-      <para>By default, <literal>sync_journal</literal> is disabled
-          (<literal>sync_journal=1</literal>), which forces a journal flush after every bulk
-        write.</para>
-      <para>When asynchronous journal commit is used, clients keep a page reference until the
-        journal transaction commits. This can cause problems when a client receives a blocking
-        callback, because pages need to be removed from the page cache, but they cannot be removed
-        because of the extra page reference.</para>
-      <para>This problem is solved by forcing a journal flush on lock cancellation. When this
-        happens, the client is granted the metadata blocks that have hit the disk, and it can safely
-        release the page reference before processing the blocking callback. The parameter which
-        controls this action is <literal>sync_on_lock_cancel</literal>, which can be set to the
-        following values:</para>
+      <para>By default, <literal>sync_journal</literal> is disabled (<literal>sync_journal=1</literal>), which forces a journal flush after every bulk write.</para>
+      <para>When asynchronous journal commit is used, clients keep a page reference until the journal transaction commits. This can cause problems when a client receives a blocking callback, because pages need to be removed from the page cache, but they cannot be removed because of the extra page reference.</para>
+      <para>This problem is solved by forcing a journal flush on lock cancellation. When this happens, the client is granted the metadata blocks that have hit the disk, and it can safely release the page reference before processing the blocking callback. The parameter which controls this action is <literal>sync_on_lock_cancel</literal>, which can be set to the following values:</para>
        <itemizedlist>
          <listitem>
            <para><literal>always</literal>: Always force a journal flush on lock cancellation</para>
          </listitem>
          <listitem>
-          <para><literal>blocking</literal>: Force a journal flush only when the local cancellation
-            is due to a blocking callback</para>
+          <para><literal>blocking</literal>: Force a journal flush only when the local cancellation is due to a blocking callback</para>
          </listitem>
          <listitem>
            <para><literal>never</literal>: Do not force any journal flush</para>
          </listitem>
        </itemizedlist>
-      <para>Here is an example of <literal>sync_on_lock_cancel</literal> being set not to force a
-        journal flush:</para>
+      <para>Here is an example of <literal>sync_on_lock_cancel</literal> being set not to force a journal flush:</para>
        <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
  obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
-      <para>By default, <literal>sync_on_lock_cancel</literal> is set to never, because asynchronous
-        journal commit is disabled by default.</para>
-      <para>When asynchronous journal commit is enabled (<literal>sync_journal=0</literal>),
-          <literal>sync_on_lock_cancel</literal> is automatically set to always, if it was
-        previously set to never.</para>
-      <para>Similarly, when asynchronous journal commit is disabled,
-          (<literal>sync_journal=1</literal>), <literal>sync_on_lock_cancel</literal> is enforced to
-        never.</para>
+      <para>By default, <literal>sync_on_lock_cancel</literal> is set to never, because asynchronous journal commit is disabled by default.</para>
+      <para>When asynchronous journal commit is enabled (<literal>sync_journal=0</literal>), <literal>sync_on_lock_cancel</literal> is automatically set to always, if it was previously set to never.</para>
+      <para>Similarly, when asynchronous journal commit is disabled, (<literal>sync_journal=1</literal>), <literal>sync_on_lock_cancel</literal> is enforced to never.</para>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>mballoc history</secondary>
-        </indexterm><literal>mballoc</literal> History</title>
+      <title><indexterm><primary>proc</primary><secondary>mballoc history</secondary></indexterm><literal>mballoc</literal> History</title>
        <para><literal> /proc/fs/ldiskfs/sda/mb_history </literal></para>
-      <para>Multi-Block-Allocate (<literal>mballoc</literal>), enables Lustre to ask
-          <literal>ldiskfs</literal> to allocate multiple blocks with a single request to the block
-        allocator. Typically, an <literal>ldiskfs</literal> file system allocates only one block per
-        time. Each <literal>mballoc</literal>-enabled partition has this file. This is sample
-        output:</para>
-      <screen>pid  inode  goal       result      found grps cr   merge tail broken
-2838 139267 17/12288/1 17/12288/1  1     0    0    M     1    8192
-2838 139267 17/12289/1 17/12289/1  1     0    0    M     0    0
-2838 139267 17/12290/1 17/12290/1  1     0    0    M     1    2
-2838 24577  3/12288/1  3/12288/1   1     0    0    M     1    8192
-2838 24578  3/12288/1  3/771/1     1     1    1          0    0
-2838 32769  4/12288/1  4/12288/1   1     0    0    M     1    8192
-2838 32770  4/12288/1  4/12289/1   13    1    1          0    0
-2838 32771  4/12288/1  5/771/1     26    2    1          0    0
-2838 32772  4/12288/1  5/896/1     31    2    1          1    128
-2838 32773  4/12288/1  5/897/1     31    2    1          0    0
-2828 32774  4/12288/1  5/898/1     31    2    1          1    2
-2838 32775  4/12288/1  5/899/1     31    2    1          0    0
-2838 32776  4/12288/1  5/900/1     31    2    1          1    4
-2838 32777  4/12288/1  5/901/1     31    2    1          0    0
-2838 32778  4/12288/1  5/902/1     31    2    1          1    2</screen>
+      <para>Multi-Block-Allocate (<literal>mballoc</literal>), enables Lustre to ask <literal>ldiskfs</literal> to allocate multiple blocks with a single request to the block allocator. Typically, an <literal>ldiskfs</literal> file system allocates only one block per time. Each <literal>mballoc</literal>-enabled partition has this file. This is sample output:</para>
+      <screen>pid  inode   goal            result          found   grps    cr      \   merge   tail    broken
+2838       139267  17/12288/1      17/12288/1      1       0       0       \   M       1       8192
+2838       139267  17/12289/1      17/12289/1      1       0       0       \   M       0       0
+2838       139267  17/12290/1      17/12290/1      1       0       0       \   M       1       2
+2838       24577   3/12288/1       3/12288/1       1       0       0       \   M       1       8192
+2838       24578   3/12288/1       3/771/1         1       1       1       \           0       0
+2838       32769   4/12288/1       4/12288/1       1       0       0       \   M       1       8192
+2838       32770   4/12288/1       4/12289/1       13      1       1       \           0       0
+2838       32771   4/12288/1       5/771/1         26      2       1       \           0       0
+2838       32772   4/12288/1       5/896/1         31      2       1       \           1       128
+2838       32773   4/12288/1       5/897/1         31      2       1       \           0       0
+2828       32774   4/12288/1       5/898/1         31      2       1       \           1       2
+2838       32775   4/12288/1       5/899/1         31      2       1       \           0       0
+2838       32776   4/12288/1       5/900/1         31      2       1       \           1       4
+2838       32777   4/12288/1       5/901/1         31      2       1       \           0       0
+2838       32778   4/12288/1       5/902/1         31      2       1       \           1       2</screen>
        <para>The parameters are described below:</para>
        <informaltable frame="all">
          <tgroup cols="2">
@@ -1845,8 +1194,7 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
            <tbody>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>pid</literal>
                    </emphasis></para>
                </entry>
@@ -1856,8 +1204,7 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>inode</literal>
                    </emphasis></para>
                </entry>
@@ -1867,20 +1214,17 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>goal</literal>
                    </emphasis></para>
                </entry>
                <entry>
-                <para>Initial request that came to <literal>mballoc</literal>
-                  (group/block-in-group/number-of-blocks)</para>
+                <para>Initial request that came to <literal>mballoc</literal> (group/block-in-group/number-of-blocks)</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>result</literal>
                    </emphasis></para>
                </entry>
@@ -1890,52 +1234,41 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>found</literal>
                    </emphasis></para>
                </entry>
                <entry>
-                <para>Number of free chunks <literal>mballoc</literal> found and measured before the
-                  final decision.</para>
+                <para>Number of free chunks <literal>mballoc</literal> found and measured before the final decision.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>grps</literal>
                    </emphasis></para>
                </entry>
                <entry>
-                <para>Number of groups <literal>mballoc</literal> scanned to satisfy the
-                  request.</para>
+                <para>Number of groups <literal>mballoc</literal> scanned to satisfy the request.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>cr</literal>
                    </emphasis></para>
                </entry>
                <entry>
                  <para>Stage at which <literal>mballoc</literal> found the result:</para>
-                <para><emphasis role="bold">0</emphasis> - best in terms of resource allocation. The
-                  request was 1MB or larger and was satisfied directly via the kernel buddy
-                  allocator.</para>
-                <para><emphasis role="bold">1</emphasis> - regular stage (good at resource
-                  consumption)</para>
-                <para><emphasis role="bold">2</emphasis> - fs is quite fragmented (not that bad at
-                  resource consumption)</para>
-                <para><emphasis role="bold">3</emphasis> - fs is very fragmented (worst at resource
-                  consumption)</para>
+                <para><emphasis role="bold">0</emphasis> - best in terms of resource allocation. The request was 1MB or larger and was satisfied directly via the kernel buddy allocator.</para>
+                <para><emphasis role="bold">1</emphasis> - regular stage (good at resource consumption)</para>
+                <para><emphasis role="bold">2</emphasis> - fs is quite fragmented (not that bad at resource consumption)</para>
+                <para><emphasis role="bold">3</emphasis> - fs is very fragmented (worst at resource consumption)</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>queue</literal>
                    </emphasis></para>
                </entry>
@@ -1945,33 +1278,27 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>merge</literal>
                    </emphasis></para>
                </entry>
                <entry>
-                <para>Whether the request hit the goal. This is good as extents code can now merge
-                  new blocks to existing extent, eliminating the need for extents tree
-                  growth.</para>
+                <para>Whether the request hit the goal. This is good as extents code can now merge new blocks to existing extent, eliminating the need for extents tree growth.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>tail</literal>
                    </emphasis></para>
                </entry>
                <entry>
-                <para>Number of blocks left free after the allocation breaks large free
-                  chunks.</para>
+                <para>Number of blocks left free after the allocation breaks large free chunks.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">
+                <para> <emphasis role="bold">
                      <literal>broken</literal>
                    </emphasis></para>
                </entry>
@@ -1982,33 +1309,19 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
            </tbody>
          </tgroup>
        </informaltable>
-      <para>Most users are probably interested in found/cr. If cr is 0 1 and found is less than 100,
-        then <literal>mballoc</literal> is doing quite well.</para>
-      <para>Also, number-of-blocks-in-request (third number in the goal triple) can tell the number
-        of blocks requested by the <literal>obdfilter</literal>. If the <literal>obdfilter</literal>
-        is doing a lot of small requests (just few blocks), then either the client is processing
-        input/output to a lot of small files, or something may be wrong with the client (because it
-        is better if client sends large input/output requests). This can be investigated with the
-        OSC <literal>rpc_stats</literal> or OST <literal>brw_stats</literal> mentioned above.</para>
-      <para>Number of groups scanned (<literal>grps</literal> column) should be small. If it reaches
-        a few dozen often, then either your disk file system is pretty fragmented or
-          <literal>mballoc</literal> is doing something wrong in the group selection part.</para>
+      <para>Most users are probably interested in found/cr. If cr is 0 1 and found is less than 100, then <literal>mballoc</literal> is doing quite well.</para>
+      <para>Also, number-of-blocks-in-request (third number in the goal triple) can tell the number of blocks requested by the <literal>obdfilter</literal>. If the <literal>obdfilter</literal> is doing a lot of small requests (just few blocks), then either the client is processing input/output to a lot of small files, or something may be wrong with the client (because it is better if client sends large input/output requests). This can be investigated with the OSC <literal>rpc_stats</literal> or OST <literal>brw_stats</literal> mentioned above.</para>
+      <para>Number of groups scanned (<literal>grps</literal> column) should be small. If it reaches a few dozen often, then either your disk file system is pretty fragmented or <literal>mballoc</literal> is doing something wrong in the group selection part.</para>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>mballoc tunables</secondary>
-        </indexterm><literal>mballoc</literal> Tunables</title>
-      <para>Lustre ldiskfs includes a multi-block allocation for ldiskfs to improve the efficiency
-        of space allocation in the OST storage. Multi-block allocation adds the following
-        features:</para>
+      <title><indexterm><primary>proc</primary><secondary>mballoc tunables</secondary></indexterm><literal>mballoc</literal> Tunables</title>
+      <para>Lustre ldiskfs includes a multi-block allocation for ldiskfs to improve the efficiency of space allocation in the OST storage.  Multi-block allocation adds the following features:</para>
        <itemizedlist>
          <listitem>
            <para> Pre-allocation for single files (helps to resist fragmentation)</para>
          </listitem>
          <listitem>
-          <para> Pre-allocation for a group of files (helps to pack small files into large,
-            contiguous chunks)</para>
+          <para> Pre-allocation for a group of files (helps to pack small files into large, contiguous chunks)</para>
          </listitem>
          <listitem>
            <para> Stream allocation (helps to decrease the seek rate)</para>
@@ -2032,75 +1345,59 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
            <tbody>
              <row>
                <entry>
-                <para>
-                  <literal>mb_max_to_scan</literal></para>
+                <para> <literal>mb_max_to_scan</literal></para>
                </entry>
                <entry>
-                <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a
-                  final decision to avoid livelock.</para>
+                <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a final decision to avoid livelock.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>mb_min_to_scan</literal></para>
+                <para> <literal>mb_min_to_scan</literal></para>
                </entry>
                <entry>
-                <para>Minimum number of free chunks that <literal>mballoc</literal> searches before
-                  picking the best chunk for allocation. This is useful for a very small request, to
-                  resist fragmentation of big free chunks.</para>
+                <para>Minimum number of free chunks that <literal>mballoc</literal> searches before picking the best chunk for allocation. This is useful for a very small request, to resist fragmentation of big free chunks.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>mb_order2_req</literal></para>
+                <para> <literal>mb_order2_req</literal></para>
                </entry>
                <entry>
-                <para>For requests equal to 2^N (where N &gt;= <literal>order2_req</literal>), a
-                  very fast search via buddy structures is used.</para>
+                <para>For requests equal to 2^N (where N &gt;= <literal>order2_req</literal>), a very fast search via buddy structures is used.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>mb_small_req</literal></para>
+                <para> <literal>mb_small_req</literal></para>
                </entry>
                <entry morerows="1">
                  <para>All requests are divided into 3 categories:</para>
                  <para>&lt; small_req (packed together to form large, aggregated requests)</para>
                  <para>&lt; large_req (allocated mostly in linearly)</para>
                  <para>&gt; large_req (very large requests so the arm seek does not matter)</para>
-                <para>The idea is that we try to pack small requests to form large requests, and
-                  then place all large requests (including compound from the small ones) close to
-                  one another, causing as few arm seeks as possible.</para>
+                <para>The idea is that we try to pack small requests to form large requests, and then place all large requests (including compound from the small ones) close to one another, causing as few arm seeks as possible.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>mb_large_req</literal></para>
+                <para> <literal>mb_large_req</literal></para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>mb_prealloc_table</literal></para>
+                <para> <literal>mb_prealloc_table</literal></para>
                </entry>
                <entry>
-                <para>The amount of space to preallocate depends on the current file size. The idea
-                  is that for small files we do not need 1 MB preallocations and for large files, 1
-                  MB preallocations are not large enough; it is better to preallocate 4 MB.</para>
+                <para>The amount of space to preallocate depends on the current file size. The idea is that for small files we do not need 1 MB preallocations and for large files, 1 MB preallocations are not large enough; it is better to preallocate 4 MB.</para>
                </entry>
              </row>
              <row>
                <entry>
-                <para>
-                  <literal>mb_group_prealloc</literal></para>
+                <para> <literal>mb_group_prealloc</literal></para>
                </entry>
                <entry>
-                <para>The amount of space (in kilobytes) preallocated for groups of small
-                  requests.</para>
+                <para>The amount of space (in kilobytes) preallocated for groups of small requests.</para>
                </entry>
              </row>
            </tbody>
@@ -2108,44 +1405,23 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
        </informaltable>
      </section>
      <section remap="h3">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>locking</secondary>
-        </indexterm>Locking</title>
-      <para><literal> ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size
-        </literal></para>
-      <para>The <literal>lru_size</literal> parameter is used to control the number of client-side
-        locks in an LRU queue. LRU size is dynamic, based on load. This optimizes the number of
-        locks available to nodes that have different workloads (e.g., login/build nodes vs. compute
-        nodes vs. backup nodes).</para>
-      <para>The total number of locks available is a function of the server&apos;s RAM. The default
-        limit is 50 locks/1 MB of RAM. If there is too much memory pressure, then the LRU size is
-        shrunk. The number of locks on the server is limited to
-          <replaceable>targets_on_server</replaceable> * <replaceable>client_count</replaceable> *
-          <replaceable>client_lru_size</replaceable>.</para>
+      <title><indexterm><primary>proc</primary><secondary>locking</secondary></indexterm>Locking</title>
+      <para><literal> ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size </literal></para>
+      <para>The <literal>lru_size</literal> parameter is used to control the number of client-side locks in an LRU queue. LRU size is dynamic, based on load. This optimizes the number of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute nodes vs. backup nodes).</para>
+      <para>The total number of locks available is a function of the server&apos;s RAM. The default limit is 50 locks/1 MB of RAM. If there is too much memory pressure, then the LRU size is shrunk. The number of locks on the server is limited to <replaceable>targets_on_server</replaceable> * <replaceable>client_count</replaceable> * <replaceable>client_lru_size</replaceable>.</para>
        <itemizedlist>
          <listitem>
-          <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0.
-            In this case, the <literal>lru_size</literal> parameter shows the current number of
-            locks being used on the export. LRU sizing is enabled by default starting with Lustre
-            1.6.5.1.</para>
+          <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0. In this case, the <literal>lru_size</literal> parameter shows the current number of locks being used on the export.  LRU sizing is enabled by default starting with Lustre 1.6.5.1.</para>
          </listitem>
          <listitem>
-          <para>To specify a maximum number of locks, set the lru_size parameter to a value other
-            than 0 (former numbers are okay, 100 * <replaceable>core_count</replaceable>). We
-            recommend that you only increase the LRU size on a few login nodes where users access
-            the file system interactively.</para>
+          <para>To specify a maximum number of locks, set the lru_size parameter to a value other than 0 (former numbers are okay, 100 * <replaceable>core_count</replaceable>). We recommend that you only increase the LRU size on a few login nodes where users access the file system interactively.</para>
          </listitem>
        </itemizedlist>
-      <para>To clear the LRU on a single client, and as a result flush client cache, without
-        changing the <literal>lru_size</literal> value:</para>
+      <para>To clear the LRU on a single client, and as a result flush client cache, without changing the <literal>lru_size</literal> value:</para>
        <screen>$ lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
-      <para>If you shrink the LRU size below the number of existing unused locks, then the unused
-        locks are canceled immediately. Use echo clear to cancel all locks without changing the
-        value.</para>
+      <para>If you shrink the LRU size below the number of existing unused locks, then the unused locks are canceled immediately. Use echo clear to cancel all locks without changing the value.</para>
        <note>
-        <para>Currently, the lru_size parameter can only be set temporarily with <literal>lctl
-            set_param</literal>; it cannot be set permanently.</para>
+        <para>Currently, the lru_size parameter can only be set temporarily with <literal>lctl set_param</literal>; it cannot be set permanently.</para>
        </note>
        <para>To disable LRU sizing, run this command on the Lustre clients:</para>
        <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))</screen>
@@ -2154,16 +1430,8 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
        <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
      </section>
      <section xml:id="dbdoclet.50438271_87260">
-      <title><indexterm>
-          <primary>proc</primary>
-          <secondary>thread counts</secondary>
-        </indexterm>Setting MDS and OSS Thread Counts</title>
-      <para>MDS and OSS thread counts (minimum and maximum) can be set via the
-          <literal>{min,max}_thread_count tunable</literal>. For each service, a new
-          <literal>/proc/fs/lustre/{service}/*/thread_{min,max,started}</literal> entry is created.
-        The tunable, <literal>{service}.thread_{min,max,started}</literal>, can be used to set the
-        minimum and maximum thread counts or get the current number of running threads for the
-        following services.</para>
+      <title><indexterm><primary>proc</primary><secondary>thread counts</secondary></indexterm>Setting MDS and OSS Thread Counts</title>
+      <para>MDS and OSS thread counts (minimum and maximum) can be set via the <literal>{min,max}_thread_count tunable</literal>. For each service, a new <literal>/proc/fs/lustre/{service}/*/thread_{min,max,started}</literal> entry is created. The tunable, <literal>{service}.thread_{min,max,started}</literal>, can be used to set the minimum and maximum thread counts or get the current number of running threads for the following services.</para>
        <informaltable frame="all">
          <tgroup cols="2">
            <colspec colname="c1" colwidth="50*"/>
@@ -2171,12 +1439,10 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
            <tbody>
              <row>
                <entry>
-                <para>
-                  <emphasis role="bold">Service</emphasis></para>
+                <para> <emphasis role="bold">Service</emphasis></para>
                </entry>
                <entry>
-                <para>
-                  <emphasis role="bold">Description</emphasis></para>
+                <para> <emphasis role="bold">Description</emphasis></para>
                </entry>
              </row>
              <row>
@@ -2256,8 +1522,7 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
          <listitem>
            <para>To permanently set this tunable, run:</para>
            <screen># lctl conf_param {service}.thread_{min,max,started} </screen>
-          <para>The following examples show how to set thread counts and get the number of running
-            threads for the ost_io service.</para>
+          <para>The following examples show how to set thread counts and get the number of running threads for the ost_io service.</para>
          </listitem>
        </itemizedlist>
        <itemizedlist>
@@ -2278,8 +1543,7 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
        </itemizedlist>
        <itemizedlist>
          <listitem>
-          <para> To set the maximum thread count to 256 instead of 512 (to avoid overloading the
-            storage or for an array with requests), run:</para>
+          <para> To set the maximum thread count to 256 instead of 512 (to avoid overloading the storage or for an array with requests), run:</para>
            <screen># lctl set_param ost.OSS.ost_io.threads_max=256</screen>
            <para>The command output will be:</para>
            <screen>ost.OSS.ost_io.threads_max=256</screen>
@@ -2294,38 +1558,23 @@ obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
          </listitem>
        </itemizedlist>
        <note>
-        <para>Currently, the maximum thread count setting is advisory because Lustre does not reduce
-          the number of service threads in use, even if that number exceeds the
-            <literal>threads_max</literal> value. Lustre does not stop service threads once they are
-          started.</para>
+        <para>Currently, the maximum thread count setting is advisory because Lustre does not reduce the number of service threads in use, even if that number exceeds the <literal>threads_max</literal> value. Lustre does not stop service threads once they are started.</para>
        </note>
      </section>
    </section>
    <section xml:id="dbdoclet.50438271_83523">
-    <title><indexterm>
-        <primary>proc</primary>
-        <secondary>debug</secondary>
-      </indexterm>Debug</title>
+    <title><indexterm><primary>proc</primary><secondary>debug</secondary></indexterm>Debug</title>
      <para><literal> /proc/sys/lnet/debug </literal></para>
-    <para>By default, Lustre generates a detailed log of all operations to aid in debugging. The
-      level of debugging can affect the performance or speed you achieve with Lustre. Therefore, it
-      is useful to reduce this overhead by turning down the debug level<footnote>
-        <para>This controls the level of Lustre debugging kept in the internal log buffer. It does
-          not alter the level of debugging that goes to syslog.</para>
-      </footnote> to improve performance. Raise the debug level when you need to collect the logs
-      for debugging problems. The debugging mask can be set with &quot;symbolic names&quot; instead
-      of the numerical values that were used in prior releases. The new symbolic format is shown in
-      the examples below.</para>
+    <para>By default, Lustre generates a detailed log of all operations to aid in debugging. The level of debugging can affect the performance or speed you achieve with Lustre. Therefore, it is useful to reduce this overhead by turning down the debug level<footnote>
+        <para>This controls the level of Lustre debugging kept in the internal log buffer. It does not alter the level of debugging that goes to syslog.</para>
+      </footnote> to improve performance. Raise the debug level when you need to collect the logs for debugging problems. The debugging mask can be set with &quot;symbolic names&quot; instead of the numerical values that were used in prior releases. The new symbolic format is shown in the examples below.</para>
      <note>
-      <para>All of the commands below must be run as root; note the <literal>#</literal>
-        nomenclature.</para>
+      <para>All of the commands below must be run as root; note the <literal>#</literal> nomenclature.</para>
      </note>
-    <para>To verify the debug level used by examining the <literal>sysctl</literal> that controls
-      debugging, run:</para>
+    <para>To verify the debug level used by examining the <literal>sysctl</literal> that controls debugging, run:</para>
      <screen># sysctl lnet.debug 
  lnet.debug = ioctl neterror warning error emerg ha config console</screen>
-    <para>To turn off debugging (except for network error debugging), run this command on all
-      concerned nodes:</para>
+    <para>To turn off debugging (except for network error debugging), run this command on all concerned nodes:</para>
      <screen># sysctl -w lnet.debug=&quot;neterror&quot; 
  lnet.debug = neterror</screen>
      <para>To turn off debugging completely, run this command on all concerned nodes:</para>
@@ -2334,13 +1583,11 @@ lnet.debug = 0</screen>
      <para>To set an appropriate debug level for a production environment, run:</para>
      <screen># sysctl -w lnet.debug=&quot;warning dlmtrace error emerg ha rpctrace vfstrace&quot; 
  lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace</screen>
-    <para>The flags above collect enough high-level information to aid debugging, but they do not
-      cause any serious performance impact.</para>
+    <para>The flags above collect enough high-level information to aid debugging, but they do not cause any serious performance impact.</para>
      <para>To clear all flags and set new ones, run:</para>
      <screen># sysctl -w lnet.debug=&quot;warning&quot; 
  lnet.debug = warning</screen>
-    <para>To add new flags to existing ones, prefix them with a
-      &quot;<literal>+</literal>&quot;:</para>
+    <para>To add new flags to existing ones, prefix them with a &quot;<literal>+</literal>&quot;:</para>
      <screen># sysctl -w lnet.debug=&quot;+neterror +ha&quot; 
  lnet.debug = +neterror +ha
  # sysctl lnet.debug 
@@ -2350,8 +1597,7 @@ lnet.debug = neterror warning ha</screen>
  lnet.debug = -ha
  # sysctl lnet.debug 
  lnet.debug = neterror warning</screen>
-    <para>You can verify and change the debug level using the <literal>/proc</literal> interface in
-      Lustre. To use the flags with <literal>/proc</literal>, run:</para>
+    <para>You can verify and change the debug level using the <literal>/proc</literal> interface in Lustre. To use the flags with <literal>/proc</literal>, run:</para>
      <screen># lctl get_param debug
  debug=
  neterror warning
@@ -2364,32 +1610,20 @@ neterror warning ha
  debug=
  neterror ha</screen>
      <para><literal> /proc/sys/lnet/subsystem_debug </literal></para>
-    <para>This controls the debug logs for subsystems (see <literal>S_*</literal>
-      definitions).</para>
+    <para>This controls the debug logs for subsystems (see <literal>S_*</literal> definitions).</para>
      <para><literal> /proc/sys/lnet/debug_path </literal></para>
-    <para>This indicates the location where debugging symbols should be stored for
-        <literal>gdb</literal>. The default is set to
-        <literal>/r/tmp/lustre-log-localhost.localdomain</literal>.</para>
+    <para>This indicates the location where debugging symbols should be stored for <literal>gdb</literal>. The default is set to <literal>/r/tmp/lustre-log-localhost.localdomain</literal>.</para>
      <para>These values can also be set via <literal>sysctl -w lnet.debug={value}</literal></para>
      <note>
        <para>The above entries only exist when Lustre has already been loaded.</para>
      </note>
      <para><literal> /proc/sys/lnet/panic_on_lbug </literal></para>
-    <para>This causes Lustre to call &apos;&apos;panic&apos;&apos; when it detects an internal
-      problem (an <literal>LBUG</literal>); panic crashes the node. This is particularly useful when
-      a kernel crash dump utility is configured. The crash dump is triggered when the internal
-      inconsistency is detected by Lustre.</para>
+    <para>This causes Lustre to call &apos;&apos;panic&apos;&apos; when it detects an internal problem (an <literal>LBUG</literal>); panic crashes the node. This is particularly useful when a kernel crash dump utility is configured. The crash dump is triggered when the internal inconsistency is detected by Lustre.</para>
      <para><literal> /proc/sys/lnet/upcall </literal></para>
-    <para>This allows you to specify the path to the binary which will be invoked when an
-        <literal>LBUG</literal> is encountered. This binary is called with four parameters. The
-      first one is the string &apos;&apos;<literal>LBUG</literal>&apos;&apos;. The second one is the
-      file where the <literal>LBUG</literal> occurred. The third one is the function name. The
-      fourth one is the line number in the file.</para>
+    <para>This allows you to specify the path to the binary which will be invoked when an <literal>LBUG</literal> is encountered. This binary is called with four parameters. The first one is the string &apos;&apos;<literal>LBUG</literal>&apos;&apos;. The second one is the file where the <literal>LBUG</literal> occurred. The third one is the function name. The fourth one is the line number in the file.</para>
      <section remap="h3">
        <title>RPC Information for Other OBD Devices</title>
-      <para>Some OBD devices maintain a count of the number of RPC events that they process.
-        Sometimes these events are more specific to operations of the device, like llite, than
-        actual raw RPC counts.</para>
+      <para>Some OBD devices maintain a count of the number of RPC events that they process. Sometimes these events are more specific to operations of the device, like llite, than actual raw RPC counts.</para>
        <screen>$ find /proc/fs/lustre/ -name stats
  /proc/fs/lustre/osc/lustre-OST0001-osc-ce63ca00/stats
  /proc/fs/lustre/osc/lustre-OST0000-osc-ce63ca00/stats
@@ -2398,74 +1632,56 @@ neterror ha</screen>
  /proc/fs/lustre/mdt/MDS/mds_readpage/stats
  /proc/fs/lustre/mdt/MDS/mds_setattr/stats
  /proc/fs/lustre/mdt/MDS/mds/stats
-/proc/fs/lustre/mds/lustre-MDT0000/exports/
-       ab206805-0630-6647-8543-d24265c91a3d/stats
-/proc/fs/lustre/mds/lustre-MDT0000/exports/
-       08ac6584-6c4a-3536-2c6d-b36cf9cbdaa0/stats
+/proc/fs/lustre/mds/lustre-MDT0000/exports/ab206805-0630-6647-8543-d24265c91a3d/stats
+/proc/fs/lustre/mds/lustre-MDT0000/exports/08ac6584-6c4a-3536-2c6d-b36cf9cbdaa0/stats
  /proc/fs/lustre/mds/lustre-MDT0000/stats
  /proc/fs/lustre/ldlm/services/ldlm_canceld/stats
  /proc/fs/lustre/ldlm/services/ldlm_cbd/stats
  /proc/fs/lustre/llite/lustre-ce63ca00/stats
  </screen>
        <section remap="h4">
-        <title><indexterm>
-            <primary>proc</primary>
-            <secondary>statistics</secondary>
-          </indexterm>Interpreting OST Statistics</title>
+        <title><indexterm><primary>proc</primary><secondary>statistics</secondary></indexterm>Interpreting OST Statistics</title>
          <note>
-          <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>)
-            and <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
+          <para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref linkend="dbdoclet.50438273_80593"/> (CollectL).</para>
          </note>
-        <para>The OST <literal>.../stats</literal> files can be used to track client statistics
-          (client activity) for each OST. It is possible to get a periodic dump of values from these
-          file (for example, every 10 seconds), that show the RPC rates (similar to
-            <literal>iostat</literal>) by using the <literal>llstat</literal> tool:</para>
+        <para>The OST .../stats files can be used to track client statistics (client activity) for each OST. It is possible to get a periodic dump of values from these file (for example, every 10 seconds), that show the RPC rates (similar to iostat) by using the <literal>llstat.pl</literal> tool:</para>
          <screen># llstat /proc/fs/lustre/osc/lustre-OST0000-osc/stats 
-/usr/bin/llstat: STATS on 09/14/07 
-       /proc/fs/lustre/osc/lustre-OST0000-osc/ stats on 192.168.10.34@tcp                             
+/usr/bin/llstat: STATS on 09/14/07 /proc/fs/lustre/osc/lustre-OST0000-osc/stats on 192.168.10.34@tcp
  snapshot_time                      1189732762.835363
  ost_create                 1
-ost_get_info               1
-ost_connect                1
-ost_set_info               1
+ost_get_info                       1
+ost_connect                        1
+ost_set_info                       1
  obd_ping                   212</screen>
-        <para>To clear the statistics, give the <literal>-c</literal> option to
-            <literal>llstat</literal>. To specify how frequently the statistics should be cleared
-          (in seconds), use an integer for the <literal>-i</literal> option. This is sample output
-          with <literal>-c</literal> and <literal>-i10</literal> options used, providing statistics
-          every 10s):</para>
-        <screen role="smaller">$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
+        <para>To clear the statistics, give the <literal>-c</literal> option to <literal>llstat.pl</literal>. To specify how frequently the statistics should be cleared (in seconds), use an integer for the <literal>-i</literal> option. This is sample output with <literal>-c</literal> and <literal>-i10</literal> options used, providing statistics every 10s):</para>
+        <screen>$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
   
-/usr/bin/llstat: STATS on 06/06/07 
-        /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
+/usr/bin/llstat: STATS on 06/06/07 /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
  snapshot_time                              1181074093.276072
   
  /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
-Name         Cur.  Cur. #
-             Count Rate Events Unit   last   min    avg       max    stddev
-req_waittime 8     0    8     [usec]  2078   34     259.75    868    317.49
-req_qdepth   8     0    8     [reqs]  1      0      0.12      1      0.35
-req_active   8     0    8     [reqs]  11     1      1.38      2      0.52
-reqbuf_avail 8     0    8     [bufs]  511    63     63.88     64     0.35
-ost_write    8     0    8     [bytes] 169767 72914  212209.62 387579 91874.29
+Name               Cur.Count       Cur.Rate        #Events Unit            \last               min             avg             max             stddev
+req_waittime       8               0               8       [usec]          2078\               34              259.75          868             317.49
+req_qdepth 8               0               8       [reqs]          1\          0               0.12            1               0.35
+req_active 8               0               8       [reqs]          11\                 1               1.38            2               0.52
+reqbuf_avail       8               0               8       [bufs]          511\                63              63.88           64              0.35
+ost_write  8               0               8       [bytes]         1697677\    72914           212209.62       387579          91874.29
   
  /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
-Name         Cur.  Cur. #
-             Count Rate Events Unit   last    min   avg       max    stddev
-req_waittime 31    3    39    [usec]  30011   34    822.79    12245  2047.71
-req_qdepth   31    3    39    [reqs]  0       0     0.03      1      0.16
-req_active   31    3    39    [reqs]  58      1     1.77      3      0.74
-reqbuf_avail 31    3    39    [bufs]  1977    63    63.79     64     0.41
-ost_write    30    3    38    [bytes] 1028467 15019 315325.16 910694 197776.51
+Name               Cur.Count       Cur.Rate        #Events Unit            \last               min             avg             max             stddev
+req_waittime       31              3               39      [usec]          30011\              34              822.79          12245           2047.71
+req_qdepth 31              3               39      [reqs]          0\          0               0.03            1               0.16
+req_active 31              3               39      [reqs]          58\         1               1.77            3               0.74
+reqbuf_avail       31              3               39      [bufs]          1977\               63              63.79           64              0.41
+ost_write  30              3               38      [bytes]         10284679\   15019           315325.16       910694          197776.51
   
  /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
-Name         Cur.  Cur. #
-             Count Rate Events Unit   last    min    avg       max    stddev
-req_waittime 21    2    60    [usec]  14970   34     784.32    12245  1878.66
-req_qdepth   21    2    60    [reqs]  0       0      0.02      1      0.13
-req_active   21    2    60    [reqs]  33      1      1.70      3      0.70
-reqbuf_avail 21    2    60    [bufs]  1341    63     63.82     64     0.39
-ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
+Name               Cur.Count       Cur.Rate        #Events Unit            \last               min             avg             max             stddev
+req_waittime       21              2               60      [usec]          14970\              34              784.32          12245           1878.66
+req_qdepth 21              2               60      [reqs]          0\          0               0.02            1               0.13
+req_active 21              2               60      [reqs]          33\                 1               1.70            3               0.70
+reqbuf_avail       21              2               60      [bufs]          1341\               63              63.82           64              0.39
+ost_write  21              2               59      [bytes]         7648424\    15019           332725.08       910694          180397.87
  </screen>
          <para>Where:</para>
          <informaltable frame="all">
@@ -2485,18 +1701,15 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
              <tbody>
                <row>
                  <entry>
-                  <para>
-                    <literal> Cur. Count </literal></para>
+                  <para> <literal> Cur. Count </literal></para>
                  </entry>
                  <entry>
-                  <para>Number of events of each type sent in the last interval (in this example,
-                    10s)</para>
+                  <para>Number of events of each type sent in the last interval (in this example, 10s)</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> Cur. Rate </literal></para>
+                  <para> <literal> Cur. Rate </literal></para>
                  </entry>
                  <entry>
                    <para>Number of events per second in the last interval</para>
@@ -2504,8 +1717,7 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> #Events </literal></para>
+                  <para> <literal> #Events </literal></para>
                  </entry>
                  <entry>
                    <para>Total number of such events since the system started</para>
@@ -2513,30 +1725,23 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> Unit </literal></para>
+                  <para> <literal> Unit </literal></para>
                  </entry>
                  <entry>
-                  <para>Unit of measurement for that statistic (microseconds, requests,
-                    buffers)</para>
+                  <para>Unit of measurement for that statistic (microseconds, requests, buffers)</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> last </literal></para>
+                  <para> <literal> last </literal></para>
                  </entry>
                  <entry>
-                  <para>Average rate of these events (in units/event) for the last interval during
-                    which they arrived. For instance, in the above mentioned case of
-                      <literal>ost_destroy</literal> it took an average of 736 microseconds per
-                    destroy for the 400 object destroys in the previous 10 seconds.</para>
+                  <para>Average rate of these events (in units/event) for the last interval during which they arrived. For instance, in the above mentioned case of <literal>ost_destroy</literal> it took an average of 736 microseconds per destroy for the 400 object destroys in the previous 10 seconds.</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> min </literal></para>
+                  <para> <literal> min </literal></para>
                  </entry>
                  <entry>
                    <para>Minimum rate (in units/events) since the service started</para>
@@ -2544,8 +1749,7 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> avg </literal></para>
+                  <para> <literal> avg </literal></para>
                  </entry>
                  <entry>
                    <para>Average rate</para>
@@ -2553,8 +1757,7 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> max </literal></para>
+                  <para> <literal> max </literal></para>
                  </entry>
                  <entry>
                    <para>Maximum rate</para>
@@ -2562,8 +1765,7 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> stddev </literal></para>
+                  <para> <literal> stddev </literal></para>
                  </entry>
                  <entry>
                    <para>Standard deviation (not measured in all cases)</para>
@@ -2590,28 +1792,23 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
              <tbody>
                <row>
                  <entry>
-                  <para>
-                    <literal> req_waittime </literal></para>
+                  <para> <literal> req_waittime </literal></para>
                  </entry>
                  <entry>
-                  <para>Amount of time a request waited in the queue before being handled by an
-                    available server thread.</para>
+                  <para>Amount of time a request waited in the queue before being handled by an available server thread.</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> req_qdepth </literal></para>
+                  <para> <literal> req_qdepth </literal></para>
                  </entry>
                  <entry>
-                  <para>Number of requests waiting to be handled in the queue for this
-                    service.</para>
+                  <para>Number of requests waiting to be handled in the queue for this service.</para>
                  </entry>
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> req_active </literal></para>
+                  <para> <literal> req_active </literal></para>
                  </entry>
                  <entry>
                    <para>Number of requests currently being handled.</para>
@@ -2619,8 +1816,7 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> reqbuf_avail </literal></para>
+                  <para> <literal> reqbuf_avail </literal></para>
                  </entry>
                  <entry>
                    <para>Number of unsolicited lnet request buffers for this service.</para>
@@ -2647,8 +1843,7 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
              <tbody>
                <row>
                  <entry>
-                  <para>
-                    <literal> ldlm_enqueue </literal></para>
+                  <para> <literal> ldlm_enqueue </literal></para>
                  </entry>
                  <entry>
                    <para>Time it takes to enqueue a lock (this includes file open on the MDS)</para>
@@ -2656,13 +1851,10 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
                </row>
                <row>
                  <entry>
-                  <para>
-                    <literal> mds_reint </literal></para>
+                  <para> <literal> mds_reint </literal></para>
                  </entry>
                  <entry>
-                  <para>Time it takes to process an MDS modification record (includes create,
-                      <literal>mkdir</literal>, <literal>unlink</literal>, <literal>rename</literal>
-                    and <literal>setattr</literal>)</para>
+                  <para>Time it takes to process an MDS modification record (includes create, <literal>mkdir</literal>, <literal>unlink</literal>, <literal>rename</literal> and <literal>setattr</literal>)</para>
                  </entry>
                </row>
              </tbody>
@@ -2670,29 +1862,24 @@ ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
          </informaltable>
        </section>
        <section remap="h4">
-        <title><indexterm>
-            <primary>proc</primary>
-            <secondary>statistics</secondary>
-          </indexterm>Interpreting MDT Statistics</title>
+        <title><indexterm><primary>proc</primary><secondary>statistics</secondary></indexterm>Interpreting MDT Statistics</title>
          <note>
-          <para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref
-              linkend="dbdoclet.50438273_80593"/> (CollectL).</para>
+          <para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref linkend="dbdoclet.50438273_80593"/> (CollectL).</para>
          </note>
-        <para>The MDT .../stats files can be used to track MDT statistics for the MDS. Here is
-          sample output for an MDT stats file:</para>
+        <para>The MDT .../stats files can be used to track MDT statistics for the MDS. Here is sample output for an MDT stats file:</para>
          <screen># cat /proc/fs/lustre/mds/*-MDT0000/stats 
-snapshot_time                   1244832003.676892 secs.usecs 
-open                            2 samples [reqs] 
-close                           1 samples [reqs] 
-getxattr                        3 samples [reqs] 
-process_config                  1 samples [reqs] 
-connect                         2 samples [reqs] 
-disconnect                      2 samples [reqs] 
-statfs                          3 samples [reqs] 
-setattr                         1 samples [reqs] 
-getattr                         3 samples [reqs] 
-llog_init                       6 samples [reqs] 
-notify                          16 samples [reqs]</screen>
+snapshot_time                              1244832003.676892 secs.usecs 
+open                                       2 samples [reqs] 
+close                                      1 samples [reqs] 
+getxattr                           3 samples [reqs] 
+process_config                             1 samples [reqs] 
+connect                                    2 samples [reqs] 
+disconnect                         2 samples [reqs] 
+statfs                                     3 samples [reqs] 
+setattr                                    1 samples [reqs] 
+getattr                                    3 samples [reqs] 
+llog_init                          6 samples [reqs] 
+notify                                     16 samples [reqs]</screen>
        </section>
      </section>
    </section>
diff --git a/ManagingStripingFreeSpace.xml b/ManagingStripingFreeSpace.xml

index 7a53750..066a83e 100644 (file)
--- a/ManagingStripingFreeSpace.xml
+++ b/ManagingStripingFreeSpace.xml
@@ -1,7 +1,8 @@
-<?xml version='1.0' encoding='UTF-8'?>
-<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="managingstripingfreespace">
-  <title xml:id="managingstripingfreespace.title">Managing File Striping and Free Space</title>
-  <para>This chapter describes file striping and I/O options, and includes the following sections:</para>
+<?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="managingstripingfreespace">
+  <title xml:id="managingstripingfreespace.title">Managing File Layout (Striping) and Free
+    Space</title>
+  <para>This chapter describes file layout (striping) and I/O options, and includes the following
+    sections:</para>
    <itemizedlist>
      <listitem>
        <para><xref linkend="dbdoclet.50438209_79324"/></para>
@@ -18,84 +19,168 @@
      <listitem>
        <para><xref linkend="dbdoclet.50438209_10424"/></para>
      </listitem>
+    <listitem>
+      <para><xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_syy_gcl_qk"/></para>
+    </listitem>
    </itemizedlist>
    <section xml:id="dbdoclet.50438209_79324">
        <title>
-          <indexterm><primary>space</primary></indexterm>
-          <indexterm><primary>striping</primary><secondary>how it works</secondary></indexterm>
-          <indexterm><primary>striping</primary><see>space</see></indexterm>
-          <indexterm><primary>space</primary><secondary>striping</secondary></indexterm>
-  How Lustre Striping Works</title>
-    <para>Lustre uses a round-robin algorithm for selecting the next OST to which a stripe is to be written. Normally the usage of OSTs is well balanced. However, if users create a small number of exceptionally large files or incorrectly specify striping parameters, imbalanced OST usage may result.</para>
-    <para>The MDS allocates objects on sequential OSTs. Periodically, it will adjust the striping layout to eliminate some degenerated cases where applications that create very regular file layouts (striping patterns) would preferentially use a particular OST in the sequence.</para>
-    <para>Stripes are written to sequential OSTs until free space across the OSTs differs by more than 20%. The MDS will then use weighted random allocations with a preference for allocating objects on OSTs with more free space. This can reduce I/O performance until space usage is rebalanced to within 20% again.</para>
-    <para>For a more detailed description of stripe assignments, see <xref linkend="dbdoclet.50438209_10424"/>.</para>
+      <indexterm>
+        <primary>space</primary>
+      </indexterm>
+      <indexterm>
+        <primary>striping</primary>
+        <secondary>how it works</secondary>
+      </indexterm>
+      <indexterm>
+        <primary>striping</primary>
+        <see>space</see>
+      </indexterm>
+      <indexterm>
+        <primary>space</primary>
+        <secondary>striping</secondary>
+      </indexterm>How Lustre* File System Striping Works</title>
+    <para>In a Lustre* file system, the MDS allocates objects to OSTs using either a round-robin
+      algorithm or a weighted algorithm. When the amount of free space is well balanced (i.e., by
+      default, when the free space across OSTs differs by less than 17%), the round-robin algorithm
+      is used to select the next OST to which a stripe is to be written. Periodically, the MDS
+      adjusts the striping layout to eliminate some degenerated cases in which applications that
+      create very regular file layouts (striping patterns) preferentially use a particular OST in
+      the sequence.</para>
+    <para> Normally the usage of OSTs is well balanced. However, if users create a small number of
+      exceptionally large files or incorrectly specify striping parameters, imbalanced OST usage may
+      result. When the free space across OSTs differs by more than a specific amount (17% by
+      default), the MDS then uses weighted random allocations with a preference for allocating
+      objects on OSTs with more free space. (This can reduce I/O performance until space usage is
+      rebalanced again.) For a more detailed description of how striping is allocated, see <xref
+        linkend="dbdoclet.50438209_10424"/>.</para>
+    <para condition="l22">Files can only be striped over a finite number of OSTs. Prior to the
+      Lustre 2.2 release, the maximum number of OSTs that a file could be striped across was limited
+      to 160. As of the Lustre 2.2 release, the maximum number of OSTs is 2000. For more
+      information, see <xref xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_syy_gcl_qk"
+      />.</para>
    </section>
    <section xml:id="dbdoclet.50438209_48033">
-      <title><indexterm><primary>striping</primary><secondary>considerations</secondary></indexterm>
-          <indexterm><primary>space</primary><secondary>considerations</secondary></indexterm>
-          Lustre File Striping Considerations</title>
-    <para>Whether you should set up file striping and what parameter values you select depends on your need. A good rule of thumb is to stripe over as few objects as will meet those needs and no more.</para>
+      <title><indexterm>
+        <primary>file layout</primary>
+        <secondary>See striping</secondary>
+      </indexterm><indexterm>
+        <primary>striping</primary>
+        <secondary>considerations</secondary>
+      </indexterm>
+      <indexterm>
+        <primary>space</primary>
+        <secondary>considerations</secondary>
+      </indexterm> Lustre File Layout (Striping) Considerations</title>
+    <para>Whether you should set up file striping and what parameter values you select depends on
+      your needs. A good rule of thumb is to stripe over as few objects as will meet those needs and
+      no more.</para>
      <para>Some reasons for using striping include:</para>
      <itemizedlist>
        <listitem>
-        <para><emphasis role="bold">Providing high-bandwidth access</emphasis>  - Many applications require high-bandwidth access to a single file - more bandwidth than can be provided by a single OSS. For example, scientific applications that write to a single file from hundreds of nodes, or a binary executable that is loaded by many nodes when an application starts.</para>
-        <para>In cases like these, a file can be striped over as many OSSs as it takes to achieve the required peak aggregate bandwidth for that file. Striping across a larger number of OSSs should only be used when the file size is very large and/or is accessed by many nodes at a time. Currently, Lustre files can be striped across up to 2000 OSTs, the maximum stripe count for an ldiskfs file system.</para>
+        <para><emphasis role="bold">Providing high-bandwidth access.</emphasis> Many applications
+          require high-bandwidth access to a single file, which may be more bandwidth than can be
+          provided by a single OSS. Examples are a scientific application that writes to a single
+          file from hundreds of nodes, or a binary executable that is loaded by many nodes when an
+          application starts.</para>
+        <para>In cases like these, a file can be striped over as many OSSs as it takes to achieve
+          the required peak aggregate bandwidth for that file. Striping across a larger number of
+          OSSs should only be used when the file size is very large and/or is accessed by many nodes
+          at a time. Currently, Lustre files can be striped across up to 2000 OSTs, the maximum
+          stripe count for an <literal>ldiskfs</literal> file system.</para>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">Improving performance when OSS bandwidth is exceeded</emphasis>  - Striping across many OSSs can improve performance if the aggregate client bandwidth exceeds the server bandwidth and the application reads and writes data fast enough to take advantage of the additional OSS bandwidth. The largest useful stripe count is bounded by the I/O rate of the clients/jobs divided by the performance per OSS.</para>
+        <para><emphasis role="bold">Improving performance when OSS bandwidth is exceeded.</emphasis>
+          Striping across many OSSs can improve performance if the aggregate client bandwidth
+          exceeds the server bandwidth and the application reads and writes data fast enough to take
+          advantage of the additional OSS bandwidth. The largest useful stripe count is bounded by
+          the I/O rate of the clients/jobs divided by the performance per OSS.</para>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">Providing space for very large files.</emphasis>  Striping is also useful when a single OST does not have enough free space to hold the entire file.</para>
+        <para><emphasis role="bold">Providing space for very large files.</emphasis> Striping is
+          useful when a single OST does not have enough free space to hold the entire file.</para>
        </listitem>
      </itemizedlist>
      <para>Some reasons to minimize or avoid striping:</para>
      <itemizedlist>
        <listitem>
-        <para><emphasis role="bold">Increased overhead</emphasis>  - Striping results in more locks and extra network operations during common operations such as stat and unlink. Even when these operations are performed in parallel, one network operation takes less time than 100 operations.</para>
-        <para>Increased overhead also results from server contention. Consider a cluster with 100 clients and 100 OSSs, each with one OST. If each file has exactly one object and the load is distributed evenly, there is no contention and the disks on each server can manage sequential I/O. If each file has 100 objects, then the clients all compete with one another for the attention of the servers, and the disks on each node seek in 100 different directions. In this case, there is needless contention.</para>
+        <para><emphasis role="bold">Increased overhead.</emphasis> Striping results in more locks
+          and extra network operations during common operations such as <literal>stat</literal> and
+            <literal>unlink</literal>. Even when these operations are performed in parallel, one
+          network operation takes less time than 100 operations.</para>
+        <para>Increased overhead also results from server contention. Consider a cluster with 100
+          clients and 100 OSSs, each with one OST. If each file has exactly one object and the load
+          is distributed evenly, there is no contention and the disks on each server can manage
+          sequential I/O. If each file has 100 objects, then the clients all compete with one
+          another for the attention of the servers, and the disks on each node seek in 100 different
+          directions resulting in needless contention.</para>
        </listitem>
        <listitem>
-        <para><emphasis role="bold">Increased risk</emphasis>  - When a file is striped across all servers and one of the servers breaks down, a small part of each striped file is lost. By comparison, if each file has exactly one stripe, you lose fewer files, but you lose them in their entirety. Many users would prefer to lose some of their files entirely than all of their files partially.</para>
+        <para><emphasis role="bold">Increased risk.</emphasis> When files are striped across all
+          servers and one of the servers breaks down, a small part of each striped file is lost. By
+          comparison, if each file has exactly one stripe, fewer files are lost, but they are lost
+          in their entirety. Many users would prefer to lose some of their files entirely than all
+          of their files partially.</para>
        </listitem>
      </itemizedlist>
      <section remap="h3">
          <title><indexterm><primary>striping</primary><secondary>size</secondary></indexterm>
              Choosing a Stripe Size</title>
-      <para>Choosing a stripe size is a small balancing act, but there are reasonable defaults.</para>
+      <para>Choosing a stripe size is a balancing act, but reasonable defaults are described below.
+        The stripe size has no effect on a single-stripe file.</para>
        <itemizedlist>
          <listitem>
-          <para><emphasis role="bold">The stripe size must be a multiple of the page size.</emphasis>  Lustre&apos;s tools enforce a multiple of 64 KB (the maximum page size on ia64 and PPC64 nodes) so that users on platforms with smaller pages do not accidentally create files that might cause problems for ia64 clients.</para>
-        </listitem>
-        <listitem>
-          <para><emphasis role="bold">The smallest recommended stripe size is 512 KB.</emphasis>  Although you can create files with a stripe size of 64 KB, the smallest practical stripe size is 512 KB because Lustre sends 1MB chunks over the network. Choosing a smaller stripe size may result in inefficient I/O to the disks and reduced performance.</para>
+          <para><emphasis role="bold">The stripe size must be a multiple of the page
+              size.</emphasis> Lustre software tools enforce a multiple of 64 KB (the maximum page
+            size on ia64 and PPC64 nodes) so that users on platforms with smaller pages do not
+            accidentally create files that might cause problems for ia64 clients.</para>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">A good stripe size for sequential I/O using high-speed networks is between 1 MB and 4 MB.</emphasis>  In most situations, stripe sizes larger than 4 MB may result in longer lock hold times and contention on shared file access.</para>
+          <para><emphasis role="bold">The smallest recommended stripe size is 512 KB.</emphasis>
+            Although you can create files with a stripe size of 64 KB, the smallest practical stripe
+            size is 512 KB because the Lustre file system sends 1MB chunks over the network.
+            Choosing a smaller stripe size may result in inefficient I/O to the disks and reduced
+            performance.</para>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">The maximum stripe size is 4GB.</emphasis>  Using a large stripe size can improve performance when accessing very large files. It allows each client to have exclusive access to its own part of a file. However, it can be counterproductive in some cases if it does not match your I/O pattern.</para>
+          <para><emphasis role="bold">A good stripe size for sequential I/O using high-speed
+              networks is between 1 MB and 4 MB.</emphasis> In most situations, stripe sizes larger
+            than 4 MB may result in longer lock hold times and contention during shared file
+            access.</para>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">Choose a stripe pattern that takes into account your application&apos;s write patterns.</emphasis>  Writes that cross an object boundary are slightly less efficient than writes that go entirely to one server. If the file is written in a very consistent and aligned way, make the stripe size a multiple of the write() size.</para>
+          <para><emphasis role="bold">The maximum stripe size is 4 GB.</emphasis> Using a large
+            stripe size can improve performance when accessing very large files. It allows each
+            client to have exclusive access to its own part of a file. However, a large stripe size
+            can be counterproductive in cases where it does not match your I/O pattern.</para>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">The choice of stripe size has no effect on a single-stripe file.</emphasis></para>
+          <para><emphasis role="bold">Choose a stripe pattern that takes into account the write
+              patterns of your application.</emphasis> Writes that cross an object boundary are
+            slightly less efficient than writes that go entirely to one server. If the file is
+            written in a consistent and aligned way, make the stripe size a multiple of the
+              <literal>write()</literal> size.</para>
          </listitem>
        </itemizedlist>
      </section>
    </section>
    <section xml:id="dbdoclet.50438209_78664">
-      <title><indexterm><primary>striping</primary><secondary>configuration</secondary></indexterm>
-          Setting the File Layout/Striping Configuration (<literal>lfs setstripe</literal>)</title>
+      <title><indexterm>
+        <primary>striping</primary>
+        <secondary>configuration</secondary>
+      </indexterm>Setting the File Layout/Striping Configuration (<literal>lfs
+      setstripe</literal>)</title>
      <para>Use the <literal>lfs setstripe</literal> command to create new files with a specific file layout (stripe pattern) configuration.</para>
-    <screen>lfs setstripe [--size|-s stripe_size] [--count|-c stripe_count] 
+    <screen>lfs setstripe [--size|-s stripe_size] [--count|-c stripe_count] \
  [--index|-i start_ost] [--pool|-p pool_name] <replaceable>filename|dirname</replaceable> </screen>
      <para><emphasis role="bold">
          <literal>stripe_size</literal>
        </emphasis>
        </para>
-    <para>The <literal>stripe_size</literal> indicates how much data to write to one OST before moving to the next OST. The default <literal>stripe_size</literal> is 1 MB, and passing a stripe_size of 0 causes the default stripe size to be used. Otherwise, the <literal>stripe_size</literal> value must be a multiple of 64 KB.</para>
+    <para>The <literal>stripe_size</literal> indicates how much data to write to one OST before
+      moving to the next OST. The default <literal>stripe_size</literal> is 1 MB. Passing a
+        <literal>stripe_size</literal> of 0 causes the default stripe size to be used. Otherwise,
+      the <literal>stripe_size</literal> value must be a multiple of 64 KB.</para>
      <para><emphasis role="bold">
          <literal>stripe_count</literal>
        </emphasis>
@@ -107,16 +192,23 @@
        </para>
      <para>The start OST is the first OST to which files are written. The default value for <literal>start_ost</literal> is -1, which allows the MDS to choose the starting index. This setting is strongly recommended, as it allows space and load balancing to be done by the MDS as needed. Otherwise, the file starts on the specified OST index. The numbering of the OSTs starts at 0.</para>
      <note>
-      <para>If you pass a <literal>start_ost</literal> value of 0 and a <literal>stripe_count</literal> value of <emphasis>1</emphasis>, all files are written to OST 0, until space is exhausted. This is probably not what you meant to do. If you only want to adjust the stripe count and keep the other parameters at their default settings, do not specify any of the other parameters:</para>
+      <para>If you pass a <literal>start_ost</literal> value of 0 and a
+          <literal>stripe_count</literal> value of <emphasis>1</emphasis>, all files are written to
+        OST 0, until space is exhausted. <emphasis role="italic">This is probably not what you meant
+          to do.</emphasis> If you only want to adjust the stripe count and keep the other
+        parameters at their default settings, do not specify any of the other parameters:</para>
        <para><screen>client# lfs setstripe -c <replaceable>stripe_count</replaceable> <replaceable>filename</replaceable></screen></para>
      </note>
      <para><emphasis role="bold">
          <literal>pool_name</literal>
        </emphasis>
        </para>
-    <para>Specify the OST pool on which the file will be written. This allows limiting the OSTs used to a subset of all OSTs in the file system. For more details about using OST pools, see <link xl:href="ManagingFileSystemIO.html#50438211_75549">Creating and Managing OST Pools</link>.</para>
+    <para>The <literal>pool_name</literal> specifies the OST pool to which the file will be written.
+      This allows limiting the OSTs used to a subset of all OSTs in the file system. For more
+      details about using OST pools, see <link xl:href="ManagingFileSystemIO.html#50438211_75549"
+        >Creating and Managing OST Pools</link>.</para>
      <section remap="h3">
-      <title>Using a Specific Striping Pattern/File Layout for a Single File</title>
+      <title>Specifying a File Layout (Striping Pattern) for a Single File</title>
        <para>It is possible to specify the file layout when a new file is created using the command <literal>lfs setstripe</literal>. This allows users to override the file system default parameters to tune the file layout more optimally for their application. Execution of an <literal>lfs setstripe</literal> command fails if the file already exists.</para>
        <section xml:id="dbdoclet.50438209_60155">
          <title>Setting the Stripe Size</title>
@@ -129,95 +221,159 @@
  lmm_stripe_count:   1
  lmm_stripe_size:    4194304
  lmm_stripe_offset:  1
-obdidx     objid                   objid                           group
-1  690550                  0xa8976                         0 </screen>
-        <para>As can be seen, the stripe size is 4 MB.</para>
+obdidx     objid        objid           group
+1          690550       0xa8976         0 </screen>
+        <para>In this example, the stripe size is 4 MB.</para>
        </section>
        <section remap="h4">
            <title><indexterm><primary>striping</primary><secondary>count</secondary></indexterm>
                Setting the Stripe Count</title>
-        <para>The command below creates a new file with a stripe count of -1 to specify striping over all available OSTs:</para>
+        <para>The command below creates a new file with a stripe count of <literal>-1</literal> to
+          specify striping over all available OSTs:</para>
          <screen>[client]# lfs setstripe -c -1 /mnt/lustre/full_stripe</screen>
-        <para>The example below indicates that the file full_stripe is striped over all six active OSTs in the configuration:</para>
+        <para>The example below indicates that the file <literal>full_stripe</literal> is striped
+          over all six active OSTs in the configuration:</para>
          <screen>[client]# lfs getstripe /mnt/lustre/full_stripe
  /mnt/lustre/full_stripe
-obdidx objid objid group
-0  8       0x8             0
-1  4       0x4             0
-2  5       0x5             0
-3  5       0x5             0
-4  4       0x4             0
-5  2       0x2             0</screen>
-        <para> This is in contrast to the output in <xref linkend="dbdoclet.50438209_60155"/> that shows only a single object for the file.</para>
+  obdidx   objid   objid   group
+  0        8       0x8     0
+  1        4       0x4     0
+  2        5       0x5     0
+  3        5       0x5     0
+  4        4       0x4     0
+  5        2       0x2     0</screen>
+        <para> This is in contrast to the output in <xref linkend="dbdoclet.50438209_60155"/>, which
+          shows only a single object for the file.</para>
        </section>
      </section>
      <section remap="h3">
-      <title><indexterm><primary>striping</primary><secondary>per directory</secondary></indexterm>Changing Striping for a Directory</title>
-      <para>In a directory, the <literal>lfs setstripe</literal> command sets a default striping configuration for files created in the directory. The usage is the same as <literal>lfs setstripe</literal> for a regular file, except that the directory must exist prior to setting the default striping configuration. If a file is created in a directory with a default stripe configuration (without otherwise specifying striping), Lustre uses those striping parameters instead of the file system default for the new file.</para>
-      <para>To change the striping pattern (file layout) for a sub-directory, create a directory with desired file layout as described above. Sub-directories inherit the file layout of the root/parent directory.</para>
+      <title><indexterm>
+          <primary>striping</primary>
+          <secondary>per directory</secondary>
+        </indexterm>Setting the Striping Layout for a Directory</title>
+      <para>In a directory, the <literal>lfs setstripe</literal> command sets a default striping
+        configuration for files created in the directory. The usage is the same as <literal>lfs
+          setstripe</literal> for a regular file, except that the directory must exist prior to
+        setting the default striping configuration. If a file is created in a directory with a
+        default stripe configuration (without otherwise specifying striping), the Lustre file system
+        uses those striping parameters instead of the file system default for the new file.</para>
+      <para>To change the striping pattern for a sub-directory, create a directory with desired file
+        layout as described above. Sub-directories inherit the file layout of the root/parent
+        directory.</para>
      </section>
      <section remap="h3">
-      <title><indexterm><primary>striping</primary><secondary>per file system</secondary></indexterm>Changing Striping for a File System</title>
-      <para>Change the striping on the file system root will change the striping for all newly created files that would otherwise have a striping parameter from the parent directory or explicitly on the command line.</para>
+      <title><indexterm>
+          <primary>striping</primary>
+          <secondary>per file system</secondary>
+        </indexterm>Setting the Striping Layout for a File System</title>
+      <para>Setting the striping specification on the <literal>root</literal> directory determines
+        the striping for all new files created in the file system unless an overriding striping
+        specification takes precedence (such as a striping layout specified by the application, or
+        set using <literal>lfs setstripe</literal>, or specified for the parent directory).</para>
        <note>
-        <para>Striping of new files and sub-directories is done per the striping parameter settings of the root directory. Once you set striping on the root directory, then, by default, it applies to any new child directories created in that root directory (unless they have their own striping settings).</para>
+        <para>The striping settings for a <literal>root</literal> directory are, by default, applied
+          to any new child directories created in the root directory, unless striping settings have
+          been specified for the child directory.</para>
        </note>
      </section>
      <section remap="h3">
-      <title><indexterm><primary>striping</primary><secondary>on specific OST</secondary></indexterm>Creating a File on a Specific OST</title>
-      <para>You can use <literal>lfs setstripe</literal> to create a file on a specific OST. In the following example, the file &quot;<literal>bob</literal>&quot; will be created on the first OST (id 0).</para>
-      <screen>$ lfs setstripe --count 1 --index 0 bob
-$ dd if=/dev/zero of=bob count=1 bs=100M
+      <title><indexterm>
+          <primary>striping</primary>
+          <secondary>on specific OST</secondary>
+        </indexterm>Creating a File on a Specific OST</title>
+      <para>You can use <literal>lfs setstripe</literal> to create a file on a specific OST. In the
+        following example, the file <literal>file1</literal> is created on the first OST (OST index
+        is 0).</para>
+      <screen>$ lfs setstripe --count 1 --index 0 file1
+$ dd if=/dev/zero of=file1 count=1 bs=100M
  1+0 records in
  1+0 records out
-$ lfs getstripe bob</screen>
-      <para>OBDS:</para>
-      <screen>0: home-OST0000_UUID ACTIVE
-[...]
-bob
-   obdidx          objid                   objid                   group
-   0               33459243                0x1fe8c2b               0</screen>
+
+$ lfs getstripe file1
+/mnt/testfs/file1
+/mnt/testfs/file1
+lmm_stripe_count:   1
+lmm_stripe_size:    1048576
+lmm_stripe_offset:  0               
+     obdidx    objid   objid    group                    
+     0         37364   0x91f4   0</screen>
      </section>
    </section>
    <section xml:id="dbdoclet.50438209_44776">
      <title><indexterm><primary>striping</primary><secondary>getting information</secondary></indexterm>Retrieving File Layout/Striping Information (<literal>getstripe</literal>)</title>
-    <para>The <literal>lfs getstripe</literal> command is used to display information that shows over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along with the OST index and object ID for each stripe in the file. For directories, the default settings for files created in that directory are printed.</para>
+    <para>The <literal>lfs getstripe</literal> command is used to display information that shows
+      over which OSTs a file is distributed. For each OST, the index and UUID is displayed, along
+      with the OST index and object ID for each stripe in the file. For directories, the default
+      settings for files created in that directory are displayed.</para>
      <section remap="h3">
        <title>Displaying the Current Stripe Size</title>
-      <para>To see the current stripe size, use the <literal>lfs getstripe</literal> command on a Lustre file or directory. For example:</para>
+      <para>To see the current stripe size for a Lustre file or directory, use the <literal>lfs
+          getstripe</literal> command. For example, to view information for a directory, enter a
+        command similar to:</para>
        <screen>[client]# lfs getstripe /mnt/lustre </screen>
-      <para>This command produces output similar to this:</para>
+      <para>This command produces output similar to:</para>
        <screen>/mnt/lustre 
  (Default) stripe_count: 1 stripe_size: 1M stripe_offset: -1</screen>
-      <para>In this example, the default stripe count is 1 (data blocks are striped over a single OSTs), the default stripe size is 1 MB, and objects are created over all available OSTs.</para>
+      <para>In this example, the default stripe count is <literal>1</literal> (data blocks are
+        striped over a single OST), the default stripe size is 1 MB, and the objects are created
+        over all available OSTs.</para>
+      <para>To view information for a file, enter a command similar to:</para>
+      <screen>$ lfs getstripe /mnt/lustre/foo
+/mnt/lustre/foo
+  obdidx   objid    objid      group
+  2        835487   m0xcbf9f   0 </screen>
+      <para>In this example, the file is located on <literal>obdidx 2</literal>, which corresponds
+        to the OST <literal>lustre-OST0002</literal>. To see which node is serving that OST, run:
+        <screen>$ lctl get_param osc.lustre-OST0002-osc.ost_conn_uuid
+osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp</screen></para>
      </section>
      <section remap="h3">
        <title>Inspecting the File Tree</title>
-      <para>To inspect an entire tree of files, use the <literal>lfs</literal> find command:</para>
+      <para>To inspect an entire tree of files, use the <literal>lfs find</literal>  command:</para>
        <screen>lfs find [--recursive | -r] <replaceable>file|directory</replaceable> ...</screen>
-      <para>You can also use <literal>ls -l /proc/<replaceable>pid</replaceable>/fd/</literal> to find open files using Lustre. For example:</para>
-      <screen>$ lfs getstripe $(readlink /proc/$(pidof cat)/fd/1)</screen>
-      <para>Typical output is:</para>
-      <screen>/mnt/lustre/foo
-obdidx                     objid                   objid                   \
-group
-2                  835487                  0xcbf9f                 0</screen>
-      <para>In this example, the file lives on <literal>obdidx</literal><literal> 2</literal>, which is <literal>lustre-OST0002</literal>. To see which node is serving that OST, run:</para>
-      <screen>$ lctl get_param osc.lustre-OST0002-osc.ost_conn_uuid</screen>
-      <para>Typical output is:</para>
-      <screen>osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp</screen>
      </section>
-       <section condition='l24'>
-    <title><indexterm><primary>striping</primary><secondary>remote directories</secondary></indexterm>Locating the MDT for a remote directory</title>
-       <para>Lustre 2.4 can be configured with multiple MDTs in the same filesystem. Each sub-directory can have a different MDT. To identify which MDT a given subdirectory is located on, pass the <literal>getstripe -M</literal> parameters to <literal>lfs</literal>. An example of this command is provided in the section <xref linkend='dbdoclet.rmremotedir'/>.</para>
-       </section>
+       <section>
+      <title><indexterm>
+          <primary>striping</primary>
+          <secondary>remote directories</secondary>
+        </indexterm>Locating the MDT for a remote directory</title>
+      <para condition="l24">Lustre  release 2.4 can be configured with multiple MDTs in the same
+        file system. Each sub-directory can have a different MDT. To identify on which MDT a given
+        subdirectory is located, pass the <literal>getstripe -M</literal> parameters to
+          <literal>lfs</literal>. An example of this command is provided in the section <xref
+          linkend="dbdoclet.rmremotedir"/>.</para>
+    </section>
    </section>
    <section xml:id="dbdoclet.50438209_10424">
-    <title><indexterm><primary>space</primary><secondary>free space</secondary></indexterm>Managing Free Space</title>
-    <para>The MDT assigns file stripes to OSTs based on location (which OSS) and size considerations (free space) to optimize file system performance. Emptier OSTs are preferentially selected for stripes, and stripes are preferentially spread out between OSSs to increase network bandwidth utilization. The weighting factor between these two optimizations can be adjusted by the user.</para>
+    <title><indexterm>
+        <primary>space</primary>
+        <secondary>free space</secondary>
+      </indexterm><indexterm>
+        <primary>striping</primary>
+        <secondary>round-robin algorithm</secondary>
+      </indexterm><indexterm>
+        <primary>striping</primary>
+        <secondary>weighted algorithm</secondary>
+      </indexterm><indexterm>
+        <primary>round-robin algorithm</primary>
+      </indexterm><indexterm>
+        <primary>weighted algorithm</primary>
+      </indexterm>Managing Free Space</title>
+    <para>To optimize file system performance, the MDT assigns file stripes to OSTs based on two
+      allocation algorithms. The <emphasis role="italic">round-robin</emphasis> allocator gives
+      preference to location (spreading out stripes across OSSs to increase network bandwidth
+      utilization) and the weighted allocator gives preference to available space (balancing loads
+      across OSTs). Threshold and weighting factors for these two algorithms can be adjusted by the
+      user. This section describes how to check available free space on disks and how free space is
+      allocated. It then describes how to set the threshold and weighting factors for the allocation
+      algorithms.</para>
      <section xml:id="dbdoclet.50438209_35838">
        <title>Checking File System Free Space</title>
-      <para>Free space is an important consideration in assigning file stripes. The <literal>lfs df</literal> command shows available disk space on the mounted Lustre file system and space consumption per OST. If multiple Lustre file systems are mounted, a path may be specified, but is not required.</para>
+      <para>Free space is an important consideration in assigning file stripes. The <literal>lfs
+          df</literal> command can be used to show available disk space on the mounted Lustre file
+        system and space consumption per OST. If multiple Lustre file systems are mounted, a path
+        may be specified, but is not required. Options to the <literal>lfs df</literal> command are
+        shown below.</para>
        <informaltable frame="all">
          <tgroup cols="2">
            <colspec colname="c1" colwidth="50*"/>
@@ -238,7 +394,7 @@ group
                  <para> <literal>-h</literal></para>
                </entry>
                <entry>
-                <para> Human-readable print sizes in human readable format (for example: 1K, 234M, 5G).</para>
+                <para> Displays sizes in human readable format (for example: 1K, 234M, 5G).</para>
                </entry>
              </row>
              <row>
@@ -253,59 +409,73 @@ group
          </tgroup>
        </informaltable>
        <note>
-        <para>The <literal>df -i</literal> and <literal>lfs df -i</literal> commands show the minimum number of inodes that can be created in the file system. Depending on the configuration, it may be possible to create more inodes than initially reported by <literal>df -i</literal>. Later, <literal>df -i</literal> operations will show the current, estimated free inode count.</para>
-        <para>If the underlying file system has fewer free blocks than inodes, then the total inode count for the file system reports only as many inodes as there are free blocks. This is done because Lustre may need to store an external attribute for each new inode, and it is better to report a free inode count that is the guaranteed, minimum number of inodes that can be created.</para>
+        <para>The <literal>df -i</literal> and <literal>lfs df -i</literal> commands show the
+            <emphasis role="italic">minimum</emphasis> number of inodes that can be created in the
+          file system at the current time. Depending on the current state of the file system and the
+          OSTs, it may be possible to create more inodes than currently reported by <literal>df
+            -i</literal>. As more files are created in the file system, <literal>df -i</literal>
+          will show the current estimated free inode count.</para>
+        <para>If the underlying file system has fewer free blocks than inodes, the total inode count
+          reported for the file system is only as many inodes as there are free blocks. This is
+          because the Lustre file system may need to store an external attribute for each new inode,
+          and it is better to report a free inode count that corresponds to the guaranteed minimum
+          number of inodes that can be created.</para>
+        <para>If the total number of objects available across all of the OSTs is smaller than those
+          available on the MDT(s), taking into account the default file striping, then <literal>df
+            -i</literal> will also report a smaller number of inodes than could be created. Running
+            <literal>lfs df -i</literal> will report the actual number of inodes that are free on
+          each target.</para>
+        <para>For ZFS file systems, the number of inodes that can be created is dynamic and depends
+          on the free space in the file system. The Free and Total inode counts reported for a ZFS
+          file system are only an estimate based on the current usage for each target. The Used
+          inode count is the actual number of inodes used by the file system.</para>
        </note>
        <para><emphasis role="bold">Examples</emphasis></para>
-      <screen>[lin-cli1] $ lfs df
-UUID                       1K-blockS               Used                    \
-Available               Use%            Mounted on
-mds-lustre-0_UUID  9174328                 1020024                 8154304 \
-                11%             /mnt/lustre[MDT:0]
-ost-lustre-0_UUID  94181368                56330708                37850660\
-                59%             /mnt/lustre[OST:0]
-ost-lustre-1_UUID  94181368                56385748                37795620\
-                59%             /mnt/lustre[OST:1]
-ost-lustre-2_UUID  94181368                54352012                39829356\
-                57%             /mnt/lustre[OST:2]
-filesystem summary:        282544104               167068468               \
-39829356                57%             /mnt/lustre
+      <screen>[client1] $ lfs df
+UUID                1K-blockS  Used      Available Use% Mounted on
+mds-lustre-0_UUID   9174328    1020024   8154304   11%  /mnt/lustre[MDT:0]
+ost-lustre-0_UUID   94181368   56330708  37850660  59%  /mnt/lustre[OST:0]
+ost-lustre-1_UUID   94181368   56385748  37795620  59%  /mnt/lustre[OST:1]
+ost-lustre-2_UUID   94181368   54352012  39829356  57%  /mnt/lustre[OST:2]
+filesystem summary: 282544104  167068468 39829356  57%  /mnt/lustre
   
-[lin-cli1] $ lfs df -h
-UUID                       bytes                   Used                    \
-Available               Use%            Mounted on
-mds-lustre-0_UUID  8.7G                    996.1M                  7.8G    \
-                11%             /mnt/lustre[MDT:0]
-ost-lustre-0_UUID  89.8G                   53.7G                   36.1G   \
-                59%             /mnt/lustre[OST:0]
-ost-lustre-1_UUID  89.8G                   53.8G                   36.0G   \
-                59%             /mnt/lustre[OST:1]
-ost-lustre-2_UUID  89.8G                   51.8G                   38.0G   \
-                57%             /mnt/lustre[OST:2]
-filesystem summary:        269.5G                  159.3G                  \
-110.1G                  59%             /mnt/lustre
+[client1] $ lfs df -h
+UUID                bytes    Used    Available   Use%  Mounted on
+mds-lustre-0_UUID   8.7G     996.1M  7.8G        11%   /mnt/lustre[MDT:0]
+ost-lustre-0_UUID   89.8G    53.7G   36.1G       59%   /mnt/lustre[OST:0]
+ost-lustre-1_UUID   89.8G    53.8G   36.0G       59%   /mnt/lustre[OST:1]
+ost-lustre-2_UUID   89.8G    51.8G   38.0G       57%   /mnt/lustre[OST:2]
+filesystem summary: 269.5G   159.3G  110.1G      59%   /mnt/lustre
   
-[lin-cli1] $ lfs df -i 
-UUID                       Inodes                  IUsed                   \
-IFree                   IUse%           Mounted on
-mds-lustre-0_UUID  2211572                 41924                   2169648 \
-                1%              /mnt/lustre[MDT:0]
-ost-lustre-0_UUID  737280                  12183                   725097  \
-                1%              /mnt/lustre[OST:0]
-ost-lustre-1_UUID  737280                  12232                   725048  \
-                1%              /mnt/lustre[OST:1]
-ost-lustre-2_UUID  737280                  12214                   725066  \
-                1%              /mnt/lustre[OST:2]
-filesystem summary:        2211572                 41924                   \
-2169648                 1%              /mnt/lustre[OST:2]</screen>
+[client1] $ lfs df -i 
+UUID                Inodes  IUsed IFree   IUse% Mounted on
+mds-lustre-0_UUID   2211572 41924 2169648 1%    /mnt/lustre[MDT:0]
+ost-lustre-0_UUID   737280  12183 725097  1%    /mnt/lustre[OST:0]
+ost-lustre-1_UUID   737280  12232 725048  1%    /mnt/lustre[OST:1]
+ost-lustre-2_UUID   737280  12214 725066  1%    /mnt/lustre[OST:2]
+filesystem summary: 2211572 41924 2169648 1%    /mnt/lustre[OST:2]</screen>
      </section>
      <section remap="h3">
-        <title><indexterm><primary>striping</primary><secondary>allocations</secondary></indexterm>
-            Using Stripe Allocations</title>
-      <para>Two stripe allocation methods are provided: <emphasis>round-robin</emphasis> and <emphasis>weighted</emphasis>. By default, the allocation method is determined by the amount of free-space imbalance on the OSTs. The weighted allocator is used when any two OSTs are imbalanced by more than 20%. Otherwise, the faster round-robin allocator is used. (The round-robin order maximizes network balancing.)</para>
+        <title><indexterm>
+          <primary>striping</primary>
+          <secondary>allocations</secondary>
+        </indexterm> Stripe Allocation Methods</title>
+      <para>Two stripe allocation methods are provided:</para>
        <itemizedlist>
          <listitem>
-          <para><emphasis role="bold">Round-robin allocator</emphasis> - When OSTs have approximately the same amount of free space (within 20%), an efficient round-robin allocator is used. The round-robin allocator alternates stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is evenly distributed among OSTs, regardless of the stripe count. Here are several sample round-robin stripe orders (each letter represents a different OST on a single OSS):</para>
+          <para><emphasis role="bold">Round-robin allocator</emphasis> - When the OSTs have
+            approximately the same amount of free space, the round-robin allocator alternates
+            stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is
+            evenly distributed among OSTs, regardless of the stripe count. In a simple example with
+            eight OSTs numbered 0-7, objects would be allocated like this:</para>
+          <para>
+            <screen>File 1: OST1, OST2, OST3, OST4
+File 2: OST5, OST6, OST7
+File 3: OST0, OST1, OST2, OST3, OST4, OST5
+File 4: OST6, OST7, OST0</screen>
+          </para>
+          <para>Here are several more sample round-robin stripe orders (each letter represents a
+            different OST on a single OSS):</para>
            <informaltable frame="none">
              <tgroup cols="2">
                <colspec colname="c1" colwidth="50*"/>
@@ -356,18 +526,65 @@ filesystem summary:        2211572                 41924                   \
            </informaltable>
          </listitem>
          <listitem>
-          <para><emphasis role="bold">Weighted allocator</emphasis>  - When the free space difference between the OSTs is significant (by default, 20% of the free space), then a weighting algorithm is used to influence OST ordering based on size and location. Note that these are weightings for a random algorithm, so the OST with the most free space is not necessarily chosen each time. On average, the weighted allocator fills the emptier OSTs faster.</para>
+          <para><emphasis role="bold">Weighted allocator</emphasis> - When the free space difference
+            between the OSTs becomes significant, the weighting algorithm is used to influence OST
+            ordering based on size (amount of free space available on each OST) and location
+            (stripes evenly distributed across OSTs). The weighted allocator fills the emptier OSTs
+            faster, but uses a weighted random algorithm, so the OST with the most free space is not
+            necessarily chosen each time.</para>
          </listitem>
        </itemizedlist>
+      <para>The allocation method is determined by the amount of free-space imbalance on the OSTs.
+        When free space is relatively balanced across OSTs, the faster round-robin allocator is
+        used, which maximizes network balancing. The weighted allocator is used when any two OSTs
+        are out of balance by more than the specified threshold (17% by default). The threshold
+        between the two allocation methods is defined in the file
+            <literal>/proc/fs/<replaceable>fsname</replaceable>/lov/<replaceable>fsname</replaceable>-mdtlov/qos_threshold_rr</literal>. </para>
+      <para>To set the <literal>qos_threshold_r</literal> to <literal>25</literal>,  enter this
+        command on the
+        MGS:<screen>lctl set_param lov.<replaceable>fsname</replaceable>-mdtlov.quos_threshold_rr=25</screen></para>
      </section>
      <section remap="h3">
-        <title><indexterm><primary>space</primary><secondary>location weighting</secondary></indexterm>Adjusting the Weighting Between Free Space and Location</title>
-      <para>The weighting priority can be adjusted in the <literal>/proc</literal> file <literal>/proc/fs/lustre/lov/lustre-mdtlov/qos_prio_free proc</literal>. The default value is 90%. Use this command on the MGS to permanently change this weighting:</para>
-      <screen>lctl conf_param <replaceable>fsname</replaceable>-MDT0000.lov.qos_prio_free=90</screen>
-      <para>Increasing this value puts more weighting on free space. When the free space priority is set to 100%, then location is no longer used in stripe-ordering calculations and weighting is based entirely on free space.</para>
+      <title><indexterm>
+          <primary>space</primary>
+          <secondary>location weighting</secondary>
+        </indexterm>Adjusting the Weighting Between Free Space and Location</title>
+      <para>The weighting priority used by the weighted allocator is set in the file
+            <literal>/proc/fs/<replaceable>fsname</replaceable>/lov/<replaceable>fsname</replaceable>-mdtlov/qos_prio_free</literal>.
+        Increasing the value of <literal>qos_prio_free</literal> puts more weighting on the amount
+        of free space available on each OST and less on how stripes are distributed across OSTs. The
+        default value is <literal>91</literal> (percent). When the free space priority is set to
+          <literal>100</literal> (percent), weighting is based entirely on free space and location
+        is no longer used by the striping algorthm. </para>
+      <para>To change the allocator weighting to <literal>100</literal>, enter this command on the
+        MGS:</para>
+      <screen>lctl conf_param <replaceable>fsname</replaceable>-MDT0000.lov.qos_prio_free=100</screen>
+      <para> .</para>
        <note>
-        <para>Setting the priority to 100% means that OSS distribution does not count in the weighting, but the stripe assignment is still done via a weighting. For example, if OST2 has twice as much free space as OST1, then OST2 is twice as likely to be used, but it is not guaranteed to be used.</para>
+        <para>When <literal>qos_prio_free</literal> is set to <literal>100</literal>, a weighted
+          random algorithm is still used to assign stripes, so, for example, if OST2 has twice as
+          much free space as OST1, OST2 is twice as likely to be used, but it is not guaranteed to
+          be used.</para>
        </note>
      </section>
    </section>
+  <section xml:id="section_syy_gcl_qk">
+    <title><indexterm>
+        <primary>striping</primary>
+        <secondary>wide striping</secondary>
+      </indexterm><indexterm>
+        <primary>wide striping</primary>
+      </indexterm>Lustre Striping Internals</title>
+    <para>For Lustre releases prior to Lustre release 2.2, files can be striped across a maximum of
+      160 OSTs. Lustre inodes use an extended attribute to record the location of each object (the
+      object ID and the number of the OST on which it is stored). The size of the extended attribute
+      limits the maximum stripe count to 160 objects.</para>
+    <para condition="l22">In Lustre release 2.2 and subsequent releases, the maximum number of OSTs
+      over which files can be striped has been raised to 2000 by allocating a new block on which to
+      store the extended attribute that holds the object information. This feature, known as "wide
+      striping," only allocates the additional extended attribute data block if the file is striped
+      with a stripe count greater than 160. The file layout (object ID, OST number) is stored on the
+      new data block with a pointer to this block stored in the original Lustre inode for the file.
+      For files smaller than 160 objects, the Lustre inode is used to store the file layout.</para>
+  </section>
  </chapter>
author	Linda Bebernes <linda.bebernes@intel.com>
	Mon, 5 Aug 2013 20:16:08 +0000 (13:16 -0700)
committer	Richard Henwood <richard.henwood@intel.com>
	Tue, 6 Aug 2013 18:31:01 +0000 (18:31 +0000)
LustreProc.xml		patch \| blob \| history
ManagingStripingFreeSpace.xml		patch \| blob \| history