FIX: proofed against origional

author Richard Henwood <rhenwood@whamcloud.com>

Fri, 20 May 2011 16:44:31 +0000 (11:44 -0500)

committer Richard Henwood <rhenwood@whamcloud.com>

Fri, 20 May 2011 16:44:31 +0000 (11:44 -0500)
author Richard Henwood <rhenwood@whamcloud.com>
Fri, 20 May 2011 16:44:31 +0000 (11:44 -0500)
committer Richard Henwood <rhenwood@whamcloud.com>
Fri, 20 May 2011 16:44:31 +0000 (11:44 -0500)
diff --git a/I_LustreIntro.xml b/I_LustreIntro.xml

index f3ed8a4..9b976ed 100644 (file)
--- a/I_LustreIntro.xml
+++ b/I_LustreIntro.xml
@@ -8,17 +8,17 @@
    <itemizedlist>
        <listitem>
            <para>
-              <link linkend='understandinglustre' endterm='understandinglustre.title'/>
+              <xref linkend='understandinglustre' endterm='understandinglustre.title'/>
            </para>
        </listitem>
        <listitem>
            <para>
-              <link linkend='understandinglustrenetworking' endterm='understandinglustrenetworking.title'/>
+              <xref linkend='understandinglustrenetworking' endterm='understandinglustrenetworking.title'/>
            </para>
        </listitem>
        <listitem>
            <para>
-              <link linkend='understandingfailover' endterm='understandingfailover.title'/>
+              <xref linkend='understandingfailover' endterm='understandingfailover.title'/>
            </para>
        </listitem>
    </itemizedlist>
diff --git a/InstallingLustreFromSourceCode.xml b/InstallingLustreFromSourceCode.xml

index d203a1e..6e3ed30 100644 (file)
--- a/InstallingLustreFromSourceCode.xml
+++ b/InstallingLustreFromSourceCode.xml
@@ -1,258 +1,273 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" xml:id='installinglustrefromsourcecode'>
+<?xml version='1.0' encoding='UTF-8'?>
+<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="installinglustrefromsourcecode">
    <info>
-    <title xml:id='installinglustrefromsourcecode.title'>Installing Lustre from Source Code</title>
+    <title xml:id="installinglustrefromsourcecode.title">Installing Lustre from Source Code</title>
    </info>
    <para>If you need to build a customized Lustre server kernel or are using a Linux kernel that has not been tested with the version of Lustre you are installing, you may need to build and install Lustre from source code. This chapter describes:</para>
-
-  <itemizedlist><listitem>
+  <itemizedlist>
+    <listitem>
        <para><xref linkend="dbdoclet.50438210_69313"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438210_65411"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438210_47529"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438210_27248"/></para>
      </listitem>
-
-</itemizedlist>
-
-    <section xml:id="dbdoclet.50438210_69313">
-      <title>29.1 Overview and Prerequisites</title>
-      <para>Lustre can be installed from either pre-built binary packages (RPMs) or freely-available source code. Installing from the package release is recommended unless you need to customize the Lustre server kernel or will be using an Linux kernel that has not been tested with Lustre. For a list of supported Linux distributions and architectures, see the topic <link xl:href="http://wiki.lustre.org/index.php/Lustre_2.0">Lustre_2.0</link> on the Lustre wiki. The procedure for installing Lustre from RPMs is describe in <link xl:href="InstallingLustre.html#50438261_81829">Chapter 8</link>: <link xl:href="InstallingLustre.html#50438261_62973">Installing the Lustre Software</link>.</para>
-      <para>To install Lustre from source code, the following are required:</para>
-      <itemizedlist><listitem>
-          <para> Linux kernel patched with Lustre-specific patches</para>
+  </itemizedlist>
+  <section xml:id="dbdoclet.50438210_69313">
+    <title>29.1 Overview and Prerequisites</title>
+    <para>Lustre can be installed from either pre-built binary packages (RPMs) or freely-available source code. Installing from the package release is recommended unless you need to customize the Lustre server kernel or will be using an Linux kernel that has not been tested with Lustre. For a list of supported Linux distributions and architectures, see the topic <ulink xl:href="http://wiki.lustre.org/index.php/Lustre_2.0">Lustre_2.0</ulink> on the Lustre wiki. The procedure for installing Lustre from RPMs is describe in <xref linkend="installinglustre">Chapter 8</xref>.</para>
+    <para>To install Lustre from source code, the following are required:</para>
+    <itemizedlist>
+      <listitem>
+        <para> Linux kernel patched with Lustre-specific patches</para>
+      </listitem>
+      <listitem>
+        <para> Lustre modules compiled for the Linux kernel</para>
+      </listitem>
+      <listitem>
+        <para> Lustre utilities required for Lustre configuration</para>
+      </listitem>
+    </itemizedlist>
+    <para>The installation procedure involves several steps:</para>
+    <itemizedlist>
+      <listitem>
+        <para> Patching the core kernel</para>
+      </listitem>
+      <listitem>
+        <para> Configuring the kernel to work with Lustre</para>
+      </listitem>
+      <listitem>
+        <para> Creating Lustre and kernel RPMs from source code.</para>
+      </listitem>
+    </itemizedlist>
+    <note>
+      <para>When using third-party network hardware with Lustre, the third-party modules (typically, the drivers) must be linked against the Linux kernel. The LNET modules in Lustre also need these references. To meet these requirements, a specific process must be followed to install and recompile Lustre. See <xref linkend="dbdoclet.50438210_27248">Installing Lustre with a Third-Party Network Stack</xref>, for an example showing how to install Lustre 1.6.6 using the Myricom MX 1.2.7 driver. The same process can be used for other third-party network stacks.</para>
+    </note>
+  </section>
+  <section xml:id="dbdoclet.50438210_65411">
+    <title>29.2 Patching the Kernel</title>
+    <para>If you are using non-standard hardware, plan to apply a Lustre patch, or have another reason not to use packaged Lustre binaries, you have to apply several Lustre patches to the core kernel and run the Lustre configure script against the kernel.</para>
+    <section remap="h3">
+      <title>29.2.1 Introducing the <literal>quilt</literal> Utility</title>
+      <para>To simplify the process of applying Lustre patches to the kernel, we recommend that you use the <literal>qu</literal>ilt utility.</para>
+      <para><literal>quil</literal>t manages a stack of patches on a single source tree. A series file lists the patch files and the order in which they are applied. Patches are applied, incrementally, on the base tree and all preceding patches. You can:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Apply patches from the stack (<literal>quilt push</literal>)</para>
          </listitem>
-
-<listitem>
-          <para> Lustre modules compiled for the Linux kernel</para>
+        <listitem>
+          <para>Remove patches from the stack (<literal>quilt pop</literal>)</para>
          </listitem>
-
-<listitem>
-          <para> Lustre utilities required for Lustre configuration</para>
+        <listitem>
+          <para>Query the contents of the series file (<literal>quilt series</literal>), the contents of the stack (<literal>quilt applied,</literal> <literal>quilt previous</literal>, <literal>quilt top</literal>), and the patches that are not applied at a particular moment (<literal>quilt next</literal>, <literal>quilt unapplied</literal>).</para>
          </listitem>
-
-</itemizedlist>
-      <para>The installation procedure involves several steps:</para>
-      <itemizedlist><listitem>
-          <para> Patching the core kernel</para>
+        <listitem>
+          <para>Edit and refresh (update) patches with <literal>quilt</literal>, as well as revert inadvertent changes, and fork or clone patches and show the diffs before and after work.</para>
          </listitem>
-
-<listitem>
-          <para> Configuring the kernel to work with Lustre</para>
+      </itemizedlist>
+      <para>A variety of <literal>quilt</literal> packages (RPMs, SRPMs and tarballs) are available from various sources. Use the most recent version you can find. <literal>quilt</literal> depends on several other utilities, e.g., the coreutils RPM that is only available in RedHat 9. For other RedHat kernels, you have to get the required packages to successfully install <literal>quilt</literal>. If you cannot locate a <literal>quilt</literal> package or fulfill its dependencies, you can build <literal>quilt</literal> from a tarball, available at the <literal>quilt</literal> project website:</para>
+      <para><ulink xl:href="http://savannah.nongnu.org/projects/quilt">http://savannah.nongnu.org/projects/quilt</ulink></para>
+      <para>For additional information on using Quilt, including its commands, see <ulink xl:href="http://www.suse.de/~agruen/quilt.pdf">Introduction to Quilt</ulink> and the <ulink xl:href="http://linux.die.net/man/1/quilt"><literal>quilt(1</literal>) man page.</ulink></para>
+    </section>
+    <section remap="h3">
+      <title>29.2.2 Get the Lustre Source and Unpatched Kernel</title>
+      <para>The Lustre Engineering Team has targeted several Linux kernels for use with Lustre servers (MDS/OSS) and provides a series of patches for each one. The Lustre patches are maintained in the kernel_patch directory bundled with the Lustre source code.</para>
+      <caution>
+        <para>Lustre contains kernel modifications which interact with storage devices and may introduce security issues and data loss if not installed, configured and administered correctly. Before installing Lustre, be cautious and back up ALL data.</para>
+      </caution>
+      <note>
+        <para>Each patch series has been tailored to a specific kernel version, and may or may not apply cleanly to other versions of the kernel.</para>
+      </note>
+      <para>To obtain the Lustre source and unpatched kernel:</para>
+      <orderedlist>
+        <listitem>
+          <para>Verify that all of the Lustre installation requirements have been met.</para>
+          <para>For more information on these prerequisites, see:</para>
+          <itemizedlist>
+            <listitem>
+              <para> Hardware requirements in <xref linkend="settinguplustresystem">Chapter 5</xref>.</para>
+            </listitem>
+            <listitem>
+              <para> Software and environmental requirements in <xref linkend="dbdoclet.50438261_99193">Preparing to Install the Lustre Software</xref></para>
+            </listitem>
+          </itemizedlist>
          </listitem>
-
-<listitem>
-          <para> Creating Lustre and kernel RPMs from source code.</para>
+        <listitem>
+          <para>Download the Lustre source code.</para>
+          <para>On the <ulink xl:href="http://git.whamcloud.com/">Lustre download site</ulink>, select a version of Lustre to download and then select &apos;Source&apos; as the platform.</para>
          </listitem>
-
-</itemizedlist>
-              <note><para>When using third-party network hardware with Lustre, the third-party modules (typically, the drivers) must be linked against the Linux kernel. The LNET modules in Lustre also need these references. To meet these requirements, a specific process must be followed to install and recompile Lustre. See <link xl:href="InstallingLustrefrSourceCode.html#50438210_27248">Installing Lustre with a Third-Party Network Stack</link>, for an example showing how to install Lustre 1.6.6 using the Myricom MX 1.2.7 driver. The same process can be used for other third-party network stacks.</para></note>
+        <listitem>
+          <para>Download the unpatched kernel.</para>
+          <para>Visit your Linux distributor for the Kernel source.</para>
+        </listitem>
+      </orderedlist>
      </section>
-    <section xml:id="dbdoclet.50438210_65411">
-      <title>29.2 Patching the Kernel</title>
-      <para>If you are using non-standard hardware, plan to apply a Lustre patch, or have another reason not to use packaged Lustre binaries, you have to apply several Lustre patches to the core kernel and run the Lustre configure script against the kernel.</para>
-      <section remap="h3">
-        <title>29.2.1 Introducing the Quilt Utility</title>
-        <para>To simplify the process of applying Lustre patches to the kernel, we recommend that you use the Quilt utility.</para>
-        <para>Quilt manages a stack of patches on a single source tree. A series file lists the patch files and the order in which they are applied. Patches are applied, incrementally, on the base tree and all preceding patches. You can:</para>
-        <itemizedlist><listitem>
-            <para> Apply patches from the stack (quilt push)</para>
-          </listitem>
-
-<listitem>
-            <para> Remove patches from the stack (quilt pop)</para>
-          </listitem>
-
-<listitem>
-            <para> Query the contents of the series file (quilt series), the contents of the stack (quilt applied, quilt previous, quilt top), and the patches that are not applied at a particular moment (quilt next, quilt unapplied).</para>
-          </listitem>
-
-<listitem>
-            <para> Edit and refresh (update) patches with Quilt, as well as revert inadvertent changes, and fork or clone patches and show the diffs before and after work.</para>
-          </listitem>
-
-</itemizedlist>
-        <para>A variety of Quilt packages (RPMs, SRPMs and tarballs) are available from various sources. Use the most recent version you can find. Quilt depends on several other utilities, e.g., the coreutils RPM that is only available in RedHat 9. For other RedHat kernels, you have to get the required packages to successfully install Quilt. If you cannot locate a Quilt package or fulfill its dependencies, you can build Quilt from a tarball, available at the Quilt project website:</para>
-        <para><link xl:href="http://savannah.nongnu.org/projects/quilt">http://savannah.nongnu.org/projects/quilt</link></para>
-        <para>For additional information on using Quilt, including its commands, see <link xl:href="http://www.suse.de/~agruen/quilt.pdf">Introduction to Quilt</link> and the <link xl:href="http://linux.die.net/man/1/quilt">quilt(1) man page.</link></para>
-      </section>
-      <section remap="h3">
-        <title>29.2.2 <anchor xml:id="dbdoclet.50438210_44148" xreflabel=""/>Get the Lustre Source and Unpatched Kernel</title>
-        <para>The Lustre Engineering Team has targeted several Linux kernels for use with Lustre servers (MDS/OSS) and provides a series of patches for each one. The Lustre patches are maintained in the kernel_patch directory bundled with the Lustre source code.</para>
-
-                <caution><para>Lustre contains kernel modifications which interact with storage devices and may introduce security issues and data loss if not installed, configured and administered correctly. Before installing Lustre, be cautious and back up ALL data.</para></caution>
-                <note><para>Each patch series has been tailored to a specific kernel version, and may or may not apply cleanly to other versions of the kernel.</para></note>
-        <para>To obtain the Lustre source and unpatched kernel:</para>
-        <orderedlist><listitem>
-        <para>Verify that all of the Lustre installation requirements have been met.</para>
-        <para>For more information on these prerequisites, see:</para>
-        <itemizedlist><listitem>
-            <para> Hardware requirements in <link xl:href="SettingUpLustreSystem.html#50438256_38751">Chapter 5</link>: <link xl:href="SettingUpLustreSystem.html#50438256_66186">Setting Up a Lustre File System</link></para>
-          </listitem>
-
-<listitem>
-            <para> Software and environmental requirements in <link xl:href="InstallingLustre.html#50438261_99193">Preparing to Install the Lustre Software</link></para>
-          </listitem>
-
-</itemizedlist>
-
-        </listitem><listitem>
-        <para>Download the Lustre source code.</para>
-        <para>On the <link xl:href="http://www.oracle.com/technetwork/indexes/downloads/sun-az-index-095901.html#L">Lustre download site</link>, select a version of Lustre to download and then select 'Source' as the platform.</para>
-        </listitem><listitem>
-        <para>Download the unpatched kernel.</para>
-        <para>For convenience, Oracle maintains an archive of unpatched kernel sources at:</para>
-        <para><link xl:href="http://downloads.lustre.org/public/kernels/">http://downloads.lustre.org/public/kernels/</link><anchor xml:id="dbdoclet.50438210_28487" xreflabel=""/></para>
-</listitem></orderedlist>
-      </section>
-      <section remap="h3">
-        <title>29.2.3 Patch the Kernel</title>
-        <para>This procedure describes how to use Quilt to apply the Lustre patches to the kernel. To illustrate the steps in this procedure, a RHEL 5 kernel is patched for Lustre 1.6.5.1.</para>
-        <orderedlist><listitem>
-        <para>Unpack the Lustre source and kernel to separate source trees.</para>
-        <orderedlist><listitem>
-        <para>Unpack the Lustre source.</para>
-        <para>For this procedure, we assume that the resulting source tree is in /tmp/lustre-1.6.5.1</para>
-        </listitem><listitem>
-        <para>Unpack the kernel.</para>
-        <para>For this procedure, we assume that the resulting source tree (also known as the destination tree) is in /tmp/kernels/linux-2.6.18</para>
-        </listitem></orderedlist>
-        </listitem><listitem>
-        <para>Select a config file for your kernel, located in the kernel_configs directory (lustre/kernel_patches/kernel_config).</para>
-        <para>The kernel_config directory contains the .config files, which are named to indicate the kernel and architecture with which they are associated. For example, the configuration file for the 2.6.18 kernel shipped with RHEL 5 (suitable for i686 SMP systems) is kernel-2.6.18-2.6-rhel5-i686-smp.config.</para>
-        </listitem><listitem>
-        <para>Select the series file for your kernel, located in the series directory (lustre/kernel_patches/series).</para>
-        <para>The series file contains the patches that need to be applied to the kernel.</para>
-        </listitem><listitem>
-        <para>Set up the necessary symlinks between the kernel patches and the Lustre source.</para>
-        <para>This example assumes that the Lustre source files are unpacked under /tmp/lustre-1.6.5.1 and you have chosen the 2.6-rhel5.series file). Run:</para>
-        <screen>$ cd /tmp/kernels/linux-2.6.18
+    <section remap="h3">
+      <title>29.2.3 Patch the Kernel</title>
+      <para>This procedure describes how to use Quilt to apply the Lustre patches to the kernel. To illustrate the steps in this procedure, a RHEL 5 kernel is patched for Lustre 1.6.5.1.</para>
+      <orderedlist>
+        <listitem>
+          <para>Unpack the Lustre source and kernel to separate source trees.</para>
+          <orderedlist>
+            <listitem>
+              <para>Unpack the Lustre source.</para>
+              <para>For this procedure, we assume that the resulting source tree is in <literal>/tmp/lustre-1.6.5.1</literal></para>
+            </listitem>
+            <listitem>
+              <para>Unpack the kernel.</para>
+              <para>For this procedure, we assume that the resulting source tree (also known as the destination tree) is in <literal>/tmp/kernels/linux-2.6.18</literal></para>
+            </listitem>
+          </orderedlist>
+        </listitem>
+        <listitem>
+          <para>Select a <literal>config</literal> file for your kernel, located in the <literal>kernel_configs</literal> directory (<literal>lustre/kernel_patches/kernel_config</literal>).</para>
+          <para>The <literal>kernel_config</literal> directory contains the <literal>.config</literal> files, which are named to indicate the kernel and architecture with which they are associated. For example, the configuration file for the 2.6.18 kernel shipped with RHEL 5 (suitable for i686 SMP systems) is <literal>kernel-2.6.18-2.6-rhel5-i686-smp.config</literal>.</para>
+        </listitem>
+        <listitem>
+          <para>Select the series file for your kernel, located in the series directory (<literal>lustre/kernel_patches/series)</literal>.</para>
+          <para>The series file contains the patches that need to be applied to the kernel.</para>
+        </listitem>
+        <listitem>
+          <para>Set up the necessary symlinks between the kernel patches and the Lustre source.</para>
+          <para>This example assumes that the Lustre source files are unpacked under <literal>/tmp/lustre-1.6.5.1</literal> and you have chosen the <literal>2.6-rhel5.series</literal> file). Run:</para>
+          <screen>$ cd /tmp/kernels/linux-2.6.18
  $ rm -f patches series
-$ ln -s /tmp/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-\ rhel5.series\
- ./series
-$ ln -s /tmp/lustre-1.6.5.1/lustre/kernel_patches/patches .
-</screen>
-        </listitem><listitem>
-        <para>Use Quilt to apply the patches in the selected series file to the unpatched kernel. Run:</para>
-        <screen>$ cd /tmp/kernels/linux-2.6.18
-$ quilt push -av
-</screen>
-        <para>The patched destination tree acts as a base Linux source tree for Lustre.</para>
-        </listitem></orderedlist>
-      </section>
+$ ln -s /tmp/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-rhel5.series ./series
+$ ln -s /tmp/lustre-1.6.5.1/lustre/kernel_patches/patches .</screen>
+        </listitem>
+        <listitem>
+          <para>Use <literal>quilt</literal> to apply the patches in the selected series file to the unpatched kernel. Run:</para>
+          <screen>$ cd /tmp/kernels/linux-2.6.18
+$ quilt push -av</screen>
+          <para>The patched destination tree acts as a base Linux source tree for Lustre.</para>
+        </listitem>
+      </orderedlist>
      </section>
-    <section xml:id="dbdoclet.50438210_47529">
-      <title>29.3 Creating and Installing the Lustre Packages</title>
-      <para>After patching the kernel, configure it to work with Lustre, create the Lustre packages (RPMs) and install them.</para>
-
-        <orderedlist><listitem>
-      <para>Configure the patched kernel to run with Lustre. Run:</para>
-      <screen>$ cd &lt;path to kernel tree&gt;
+  </section>
+  <section xml:id="dbdoclet.50438210_47529">
+    <title>29.3 Creating and Installing the Lustre Packages</title>
+    <para>After patching the kernel, configure it to work with Lustre, create the Lustre packages (RPMs) and install them.</para>
+    <orderedlist>
+      <listitem>
+        <para>Configure the patched kernel to run with Lustre. Run:</para>
+        <screen>$ cd &lt;path to kernel tree&gt;
  $ cp /boot/config-`uname -r` .config
  $ make oldconfig || make menuconfig
  $ make include/asm
  $ make include/linux/version.h
  $ make SUBDIRS=scripts
-$ make include/linux/utsrelease.h
-</screen>
-        </listitem><listitem>
-      <para>Run the Lustre configure script against the patched kernel and create the Lustre packages.</para>
-      <screen>$ cd &lt;path to lustre source tree&gt;
+$ make include/linux/utsrelease.h</screen>
+      </listitem>
+      <listitem>
+        <para>Run the Lustre configure script against the patched kernel and create the Lustre packages.</para>
+        <screen>$ cd &lt;path to lustre source tree&gt;
  $ ./configure --with-linux=&lt;path to kernel tree&gt;
-$ make rpms
-</screen>
-      <para>This creates a set of .rpms in /usr/src/redhat/RPMS/&lt;arch&gt; with an appended date-stamp. The SuSE path is /usr/src/packages.</para>
-              <note><para>You do not need to run the Lustre configure script against an unpatched kernel.</para></note>
-      <para><emphasis role="bold">Example set of RPMs:</emphasis></para>
-      <screen>lustre-1.6.5.1-\2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i686.rpm
- 
-lustre-debuginfo-1.6.5.1-\2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_2008102\
-1.i686.rpm
- 
-lustre-modules-1.6.5.1-\2.6.18_53.xx.xxel5_lustre.1.6.5.1.custom_20081021.i\
-686.rpm
- 
-lustre-source-1.6.5.1-\2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i\
-686.rpm
-</screen>
-              <note><para>If the steps to create the RPMs fail, contact Lustre Support by reporting a bug. See <link xl:href="LustreTroubleshooting.html#50438198_30989">Reporting a Lustre Bug</link>.</para></note>
-              <note><para>Several features and packages are available that extend the core functionality of Lustre. These features/packages can be enabled at the build time by issuing appropriate arguments to the configure command. For a list of these features and packages, run ./configure -help in the Lustre source tree. The configs/ directory of the kernel source contains the config files matching each the kernel version. Copy one to .config at the root of the kernel tree.</para></note>
-        </listitem><listitem>
-      <para><anchor xml:id="dbdoclet.50438210_41207" xreflabel=""/>Create the kernel package. Navigate to the kernel source directory and run:</para>
-      <screen>$ make rpm
-</screen>
-      <para>Example result:</para>
-      <screen>kernel-2.6.95.0.3.EL_lustre.1.6.5.1custom-1.i686.rpm
-</screen>
-              <note><para><link xl:href="InstallingLustrefrSourceCode.html#50438210_41207">Step 3</link> is only valid for RedHat and SuSE kernels. If you are using a stock Linux kernel, you need to get a script to create the kernel RPM.</para></note>
-        </listitem><listitem>
-       <para>Install the Lustre packages.</para>
-      <para>Some Lustre packages are installed on servers (MDS and OSSs), and others are installed on Lustre clients. For guidance on where to install specific packages, see <link xl:href="InstallingLustre.html#50438261_21654">TABLE 8-1</link> in <link xl:href="InstallingLustre.html#50438261_99193">Preparing to Install the Lustre Software</link>. which lists required packages and for each package, where to install it. Depending on the selected platform, not all of the packages listed in <link xl:href="InstallingLustre.html#50438261_21654">TABLE 8-1</link> need to be installed.</para>
-              <note><para>Running the patched server kernel on the clients is optional. It is not necessary unless the clients will be used for multiple purposes, for example, to run as a client and an OST.</para></note>
-       <para>Lustre packages should be installed in this order:</para>
-        <orderedlist><listitem>
-      <para>Install the kernel, modules and ldiskfs packages.</para>
-      <para>Navigate to the directory where the RPMs are stored, and use the rpm -ivh command to install the kernel, module and ldiskfs packages.</para>
-      <screen>$ rpm -ivh kernel-lustre-smp-&lt;ver&gt; \
+$ make rpms</screen>
+        <para>This creates a set of <literal>.rpms</literal> in <literal>/usr/src/redhat/RPMS/&lt;arch&gt;</literal> with an appended date-stamp. The SuSE path is <literal>/usr/src/packages</literal>.</para>
+        <note>
+          <para>You do not need to run the Lustre configure script against an unpatched kernel.</para>
+        </note>
+        <para><emphasis role="bold">Example set of RPMs:</emphasis></para>
+        <screen>lustre-1.6.5.1-\2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i686.rpm
+
+lustre-debuginfo-1.6.5.1-\2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i686.rpm
+
+lustre-modules-1.6.5.1-\2.6.18_53.xx.xxel5_lustre.1.6.5.1.custom_20081021.i686.rpm
+
+lustre-source-1.6.5.1-\2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i686.rpm</screen>
+        <note>
+          <para>If the steps to create the RPMs fail, contact Lustre Support by reporting a bug. See <xref linkend="dbdoclet.50438198_30989">Reporting a Lustre Bug</xref>.</para>
+        </note>
+        <note>
+          <para>Several features and packages are available that extend the core functionality of Lustre. These features/packages can be enabled at the build time by issuing appropriate arguments to the configure command. For a list of these features and packages, run <literal>./configure -help</literal> in the Lustre source tree. The configs/ directory of the kernel source contains the config files matching each the kernel version. Copy one to <literal>.config</literal> at the root of the kernel tree.</para>
+        </note>
+      </listitem>
+      <listitem xml:id="dbdoclet.50438210_41207">
+        <para>Create the kernel package. Navigate to the kernel source directory and run:</para>
+        <screen>$ make rpm</screen>
+        <para>Example result:</para>
+        <screen>kernel-2.6.95.0.3.EL_lustre.1.6.5.1custom-1.i686.rpm</screen>
+        <note>
+          <para><xref linkend="dbdoclet.50438210_41207">Step 3</xref> is only valid for RedHat and SuSE kernels. If you are using a stock Linux kernel, you need to get a script to create the kernel RPM.</para>
+        </note>
+      </listitem>
+      <listitem>
+        <para>Install the Lustre packages.</para>
+        <para>Some Lustre packages are installed on servers (MDS and OSSs), and others are installed on Lustre clients. For guidance on where to install specific packages, see <xref linkend="installinglustre.tab.req">TABLE 8-1</xref> that lists required packages and for each package and where to install it. Depending on the selected platform, not all of the packages listed in <xref linkend="installinglustre.tab.req">TABLE 8-1</xref> need to be installed.</para>
+        <note>
+          <para>Running the patched server kernel on the clients is optional. It is not necessary unless the clients will be used for multiple purposes, for example, to run as a client and an OST.</para>
+        </note>
+        <para>Lustre packages should be installed in this order:</para>
+        <orderedlist>
+          <listitem>
+            <para>Install the kernel, modules and <literal>ldiskfs</literal> packages.</para>
+            <para>Navigate to the directory where the RPMs are stored, and use the <literal>rpm -ivh</literal> command to install the kernel, module and ldiskfs packages.</para>
+            <screen>$ rpm -ivh kernel-lustre-smp-&lt;ver&gt; \
  kernel-ib-&lt;ver&gt; \
  lustre-modules-&lt;ver&gt; \
-lustre-ldiskfs-&lt;ver&gt;
-</screen>
-        </listitem><listitem>
-      <para>Install the utilities/userspace packages.</para>
-      <para>Use the rpm -ivh command to install the utilities packages. For example:</para>
-      <screen>$ rpm -ivh lustre-&lt;ver&gt;
-</screen>
-        </listitem><listitem>
-      <para>Install the e2fsprogs package.</para>
-      <para>Make sure the e2fsprogs package is unpacked, and use the rpm -i command to install it. For example:</para>
-      <screen>$ rpm -i e2fsprogs-&lt;ver&gt;
-</screen>
-        </listitem><listitem>
-      <para>(Optional) If you want to add optional packages to your Lustre system, install them now.</para>
-        </listitem></orderedlist>
-        </listitem><listitem>
-      <para>Verify that the boot loader (grub.conf or lilo.conf) has been updated to load the patched kernel.</para>
-        </listitem><listitem>
-      <para>Reboot the patched clients and the servers.</para>
-        <orderedlist><listitem>
-      <para>If you applied the patched kernel to any clients, reboot them.</para>
-      <para>Unpatched clients do not need to be rebooted.</para>
-        </listitem><listitem>
-      <para>Reboot the servers.</para>
-      <para>Once all the machines have rebooted, the next steps are to configure Lustre Networking (LNET) and the Lustre file system. See <link xl:href="ConfiguringLustre.html#50438267_88428">Chapter 10</link>: <link xl:href="ConfiguringLustre.html#50438267_66186">Configuring Lustre</link>.</para>
-        </listitem></orderedlist>
-        </listitem></orderedlist>
-    </section>
-    <section xml:id="dbdoclet.50438210_27248">
-      <title>29.4 Installing Lustre with a Third-Party Network Stack</title>
-      <para>When using third-party network hardware, you must follow a specific process to install and recompile Lustre. This section provides an installation example, describing how to install Lustre 1.6.6 while using the Myricom MX 1.2.7 driver. The same process is used for other third-party network stacks, by replacing MX-specific references in <xref linkend="dbdoclet.50438210_12366"/> (Step 2) with the stack-specific build and using the proper --with option when configuring the Lustre source code.</para>
-        <orderedlist><listitem>
-      <para>Compile and install the Lustre kernel.</para>
-        <orderedlist><listitem>
-      <para>Install the necessary build tools.</para>
-      <para>GCC and related tools must be installed. For more information, see <link xl:href="InstallingLustre.html#50438261_37079">Required Software</link>.</para>
-      <screen>$ yum install rpm-build redhat-rpm-config
+lustre-ldiskfs-&lt;ver&gt;</screen>
+          </listitem>
+          <listitem>
+            <para>Install the utilities/userspace packages.</para>
+            <para>Use the <literal>rpm -ivh</literal> command to install the utilities packages. For example:</para>
+            <screen>$ rpm -ivh lustre-&lt;ver&gt;</screen>
+          </listitem>
+          <listitem>
+            <para>Install the <literal>e2fsprogs</literal> package.</para>
+            <para>Make sure the <literal>e2fsprogs</literal> package is unpacked, and use the <literal>rpm -i</literal> command to install it. For example:</para>
+            <screen>$ rpm -i e2fsprogs-&lt;ver&gt;</screen>
+          </listitem>
+          <listitem>
+            <para>(Optional) If you want to add optional packages to your Lustre system, install them now.</para>
+          </listitem>
+        </orderedlist>
+      </listitem>
+      <listitem>
+        <para>Verify that the boot loader (<literal>grub.conf</literal> or <literal>lilo.conf</literal>) has been updated to load the patched kernel.</para>
+      </listitem>
+      <listitem>
+        <para>Reboot the patched clients and the servers.</para>
+        <orderedlist>
+          <listitem>
+            <para>If you applied the patched kernel to any clients, reboot them.</para>
+            <para>Unpatched clients do not need to be rebooted.</para>
+          </listitem>
+          <listitem>
+            <para>Reboot the servers.</para>
+            <para>Once all the machines have rebooted, the next steps are to configure Lustre Networking (LNET) and the Lustre file system. See <xref linkend="configuringlustre">Configuring Lustre</xref>.</para>
+          </listitem>
+        </orderedlist>
+      </listitem>
+    </orderedlist>
+  </section>
+  <section xml:id="dbdoclet.50438210_27248">
+    <title>29.4 Installing Lustre with a Third-Party Network Stack</title>
+    <para>When using third-party network hardware, you must follow a specific process to install and recompile Lustre. This section provides an installation example, describing how to install Lustre 1.6.6 while using the Myricom MX 1.2.7 driver. The same process is used for other third-party network stacks, by replacing MX-specific references in <xref linkend="dbdoclet.50438210_12366"/>  with the stack-specific build and using the proper <literal>--with</literal> option when configuring the Lustre source code.</para>
+    <orderedlist>
+      <listitem xml:id="dbdoclet.50438210_12366">
+        <para>Compile and install the Lustre kernel.</para>
+        <orderedlist>
+          <listitem>
+            <para>Install the necessary build tools.</para>
+            <para>GCC and related tools must be installed. For more information, see <xref linkend="dbdoclet.50438261_37079">Required Software</xref>.</para>
+            <screen>$ yum install rpm-build redhat-rpm-config
  $ mkdir -p rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
-$ echo &apos;%_topdir %(echo $HOME)/rpmbuild&apos; &gt; .rpmmacros
-</screen>
-        </listitem><listitem>
-      <para>Install the patched Lustre source code.</para>
-      <para>This RPM is available at the <link xl:href="http://www.oracle.com/technetwork/indexes/downloads/sun-az-index-095901.html#L">Lustre download site</link>.</para>
-      <screen>$ rpm -ivh \
-kernel-lustre-source-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64.rpm
-</screen>
-        </listitem><listitem>
-      <para>Build the Linux kernel RPM.</para>
-      <screen>$ cd /usr/src/linux-2.6.18-92.1.10.el5_lustre.1.6.6
+$ echo &apos;%_topdir %(echo $HOME)/rpmbuild&apos; &gt; .rpmmacros</screen>
+          </listitem>
+          <listitem>
+            <para>Install the patched Lustre source code.</para>
+            <para>This RPM is available at the <ulink xl:href="http://build.whamcloud.com">Lustre download site</ulink>.</para>
+            <screen>$ rpm -ivh \
+kernel-lustre-source-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64.rpm</screen>
+          </listitem>
+          <listitem>
+            <para>Build the Linux kernel RPM.</para>
+            <screen>$ cd /usr/src/linux-2.6.18-92.1.10.el5_lustre.1.6.6
  $ make distclean
  $ make oldconfig dep bzImage modules
  $ cp /boot/config-`uname -r` .config
@@ -260,77 +275,82 @@ $ make oldconfig || make menuconfig
  $ make include/asm
  $ make include/linux/version.h
  $ make SUBDIRS=scripts
-$ make rpm
-</screen>
-        </listitem><listitem>
-      <para>Install the Linux kernel RPM.</para>
-      <para>If you are building a set of RPMs for a cluster installation, this step is not necessary. Source RPMs are only needed on the build machine.</para>
-      <screen>$ rpm -ivh \
+$ make rpm</screen>
+          </listitem>
+          <listitem>
+            <para>Install the Linux kernel RPM.</para>
+            <para>If you are building a set of RPMs for a cluster installation, this step is not necessary. Source RPMs are only needed on the build machine.</para>
+            <screen>$ rpm -ivh \
  ~/rpmbuild/kernel-lustre-2.6.18-92.1.10.el5_\lustre.1.6.6.x86_64.rpm
-$ mkinitrd /boot/2.6.18-92.1.10.el5_lustre.1.6.6
-</screen>
-        </listitem><listitem>
-      <para>Update the boot loader (/etc/grub.conf) with the new kernel boot information.</para>
-      <screen>$ /sbin/shutdown 0 -r
-</screen>
-        </listitem></orderedlist>
-        </listitem><listitem>
-      <para><anchor xml:id="dbdoclet.50438210_12366" xreflabel=""/>Compile and install the MX stack.</para>
-      <screen>$ cd /usr/src/
+$ mkinitrd /boot/2.6.18-92.1.10.el5_lustre.1.6.6</screen>
+          </listitem>
+          <listitem>
+            <para>Update the boot loader (<literal>/etc/grub.conf</literal>) with the new kernel boot information.</para>
+            <screen>$ /sbin/shutdown 0 -r</screen>
+          </listitem>
+        </orderedlist>
+      </listitem>
+      <listitem>
+        <para>Compile and install the MX stack.</para>
+        <screen>$ cd /usr/src/
  $ gunzip mx_1.2.7.tar.gz (can be obtained from www.myri.com/scs/)
  $ tar -xvf mx_1.2.7.tar
  $ cd mx-1.2.7
  $ ln -s common include
  $ ./configure --with-kernel-lib
  $ make
-$ make install
-</screen>
-        </listitem><listitem>
-      <para>Compile and install the Lustre source code.</para>
-        <orderedlist><listitem>
-      <para>Install the Lustre source (this can be done via RPM or tarball). The source file is available at the <link xl:href="http://www.oracle.com/technetwork/indexes/downloads/sun-az-index-095901.html#L">Lustre download site</link>. This example shows installation via the tarball.</para>
-      <screen>$ cd /usr/src/
+$ make install</screen>
+      </listitem>
+      <listitem>
+        <para>Compile and install the Lustre source code.</para>
+        <orderedlist>
+          <listitem>
+            <para>Install the Lustre source (this can be done via RPM or tarball). The source file is available at the <ulink xl:href="http://git.whamcloud.com/">Lustre download site</ulink>. This example shows installation via the tarball.</para>
+            <screen>$ cd /usr/src/
  $ gunzip lustre-1.6.6.tar.gz
  $ tar -xvf lustre-1.6.6.tar
  </screen>
-        </listitem><listitem>
-      <para>Configure and build the Lustre source code.</para>
-      <para>The ./configure --help command shows a list of all of the --with options. All third-party network stacks are built in this manner.</para>
-      <screen>$ cd lustre-1.6.6
+          </listitem>
+          <listitem>
+            <para>Configure and build the Lustre source code.</para>
+            <para>The <literal>./configure --help</literal> command shows a list of all of the <literal>--with</literal> options. All third-party network stacks are built in this manner.</para>
+            <screen>$ cd lustre-1.6.6
  $ ./configure --with-linux=/usr/src/linux \
  --with-mx=/usr/src/mx-1.2.7
  $ make
-$ make rpms
-</screen>
-      <para>The make rpms command output shows the location of the generated RPMs</para>
-        </listitem></orderedlist>
-        </listitem><listitem>
-      <para>Use the rpm -ivh command to install the RPMS.</para>
-      <screen>$ rpm -ivh \
+$ make rpms</screen>
+            <para>The <literal>make rpms</literal> command output shows the location of the generated RPMs</para>
+          </listitem>
+        </orderedlist>
+      </listitem>
+      <listitem>
+        <para>Use the <literal>rpm -ivh</literal> command to install the RPMS.</para>
+        <screen>$ rpm -ivh \
  lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm
  $ rpm -ivh \
  lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6\
  smp.x86_64.rpm
  $ rpm -ivh \
  lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6\
-smp.x86_64.rpm
-</screen>
-        </listitem><listitem>
-      <para>Add the following lines to the /etc/modprobe.conf file.</para>
-      <screen>options kmxlnd hosts=/etc/hosts.mxlnd
-options lnet networks=mx0(myri0),tcp0(eth0)
-</screen>
-        </listitem><listitem>
-      <para>Populate the myri0 configuration with the proper IP addresses.</para>
-      <screen>vim /etc/sysconfig/network-scripts/myri0
-</screen>
-        </listitem><listitem>
-      <para>Add the following line to the /etc/hosts.mxlnd file.</para>
-      <screen>$ IP HOST BOARD EP_ID
-</screen>
-        </listitem><listitem>
-      <para>Start Lustre.</para>
-      <para>Once all the machines have rebooted, the next steps are to configure Lustre Networking (LNET) and the Lustre file system. See <link xl:href="ConfiguringLustre.html#50438267_88428">Chapter 10</link>: <link xl:href="InstallingLustrefrSourceCode.html#50438210_67391">Installing Lustre from Source Code</link>.</para>
-        </listitem></orderedlist>
-    </section>
+smp.x86_64.rpm</screen>
+      </listitem>
+      <listitem>
+        <para>Add the following lines to the <literal>/etc/modprobe.conf</literal> file.</para>
+        <screen>options kmxlnd hosts=/etc/hosts.mxlnd
+options lnet networks=mx0(myri0),tcp0(eth0)</screen>
+      </listitem>
+      <listitem>
+        <para>Populate the <literal>myri0</literal> configuration with the proper IP addresses.</para>
+        <screen>vim /etc/sysconfig/network-scripts/myri0</screen>
+      </listitem>
+      <listitem>
+        <para>Add the following line to the <literal>/etc/hosts.mxlnd</literal> file.</para>
+        <screen>$ IP HOST BOARD EP_ID </screen>
+      </listitem>
+      <listitem>
+        <para>Start Lustre.</para>
+        <para>Once all the machines have rebooted, the next steps are to configure Lustre Networking (LNET) and the Lustre file system. See <xref linkend="configuringlustre">Chapter 10</xref>.</para>
+      </listitem>
+    </orderedlist>
+  </section>
  </chapter>
diff --git a/LustreDebugging.xml b/LustreDebugging.xml

index 009f864..1ca61f0 100644 (file)
--- a/LustreDebugging.xml
+++ b/LustreDebugging.xml
@@ -1,722 +1,983 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" xml:id='lustredebugging'>
+<?xml version='1.0' encoding='UTF-8'?>
+<!-- This document was created with Syntext Serna Free. -->
+<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustredebugging">
    <info>
-    <title xml:id='lustredebugging.title'>Lustre Debugging</title>
+    <title xml:id="lustredebugging.title">Lustre Debugging</title>
    </info>
-
    <para>This chapter describes tips and information to debug Lustre, and includes the following sections:</para>
-  <itemizedlist><listitem>
+  <itemizedlist>
+    <listitem>
        <para><xref linkend="dbdoclet.50438274_15874"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438274_23607"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438274_80443"/></para>
      </listitem>
-
-</itemizedlist>
-
-    <section xml:id="dbdoclet.50438274_15874">
-      <title>28.1 Diagnostic and Debugging Tools</title>
-      <para>A variety of diagnostic and analysis tools are available to debug issues with the Lustre software. Some of these are provided in Linux distributions, while others have been developed and are made available by the Lustre project.</para>
-      <section remap="h3">
-        <title>28.1.1 Lustre Debugging Tools</title>
-        <para>The following in-kernel debug mechanisms are incorporated into the Lustre software:</para>
-        <itemizedlist><listitem>
-            <para><emphasis role="bold">Debug logs</emphasis>  - A circular debug buffer to which Lustre internal debug messages are written (in contrast to error messages, which are printed to the syslog or console). Entries to the Lustre debug log are controlled by the mask set by /proc/sys/lnet/debug. The log size defaults to 5 MB per CPU but can be increased as a busy system will quickly overwrite 5 MB. When the buffer fills, the oldest information is discarded.</para>
+  </itemizedlist>
+  <section xml:id="dbdoclet.50438274_15874">
+    <title>28.1 Diagnostic and Debugging Tools</title>
+    <para>A variety of diagnostic and analysis tools are available to debug issues with the Lustre software. Some of these are provided in Linux distributions, while others have been developed and are made available by the Lustre project.</para>
+    <section remap="h3">
+      <title>28.1.1 Lustre Debugging Tools</title>
+      <para>The following in-kernel debug mechanisms are incorporated into the Lustre software:</para>
+      <itemizedlist>
+        <listitem>
+          <para><emphasis role="bold">Debug logs</emphasis>  - A circular debug buffer to which Lustre internal debug messages are written (in contrast to error messages, which are printed to the syslog or console). Entries to the Lustre debug log are controlled by the mask set by <literal>/proc/sys/lnet/debug</literal>. The log size defaults to 5 MB per CPU but can be increased as a busy system will quickly overwrite 5 MB. When the buffer fills, the oldest information is discarded.</para>
+        </listitem>
+        <listitem>
+          <para><emphasis role="bold">Debug daemon</emphasis>  - The debug daemon controls logging of debug messages.</para>
+        </listitem>
+        <listitem>
+          <para><emphasis role="bold">
+              <literal>/proc/sys/lnet/debug</literal>
+            </emphasis>  - This file contains a mask that can be used to delimit the debugging information written out to the kernel debug logs.</para>
+        </listitem>
+      </itemizedlist>
+      <para>The following tools are also provided with the Lustre software:</para>
+      <itemizedlist>
+        <listitem>
+          <para><emphasis role="bold">
+              <literal>lctl</literal>
+            </emphasis>  - This tool is used with the debug_kernel option to manually dump the Lustre debugging log or post-process debugging logs that are dumped automatically. For more information about the lctl tool, see <xref linkend="dbdoclet.50438274_62472"/> and <xref linkend="dbdoclet.50438219_38274"/>.</para>
+        </listitem>
+        <listitem>
+          <para><emphasis role="bold">Lustre subsystem asserts</emphasis>  - A panic-style assertion (LBUG) in the kernel causes Lustre to dump the debug log to the file <literal>/tmp/lustre-log.<emphasis>&lt;timestamp&gt;</emphasis></literal> where it can be retrieved after a reboot. For more information, see <xref linkend="dbdoclet.50438198_40669">Viewing Error Messages</xref>.</para>
+        </listitem>
+        <listitem>
+          <para><emphasis role="bold">
+              <literal>lfs</literal>
+            </emphasis>  - This utility provides access to the extended attributes (EAs) of a Lustre file (along with other information). For more inforamtion about lfs, see <xref linkend="dbdoclet.50438206_94597">lfs</xref>.</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section remap="h3">
+      <title>28.1.2 External Debugging Tools</title>
+      <para>The tools described in this section are provided in the Linux kernel or are available at an external website. For information about using some of these tools for Lustre debugging, see <xref linkend="dbdoclet.50438274_23607">Lustre Debugging Procedures</xref> and <xref linkend="dbdoclet.50438274_80443">Lustre Debugging for Developers</xref>.</para>
+      <section remap="h4">
+        <title>28.1.2.1 Tools for Administrators and Developers</title>
+        <para>Some general debugging tools provided as a part of the standard Linux distro are:</para>
+        <itemizedlist>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>strace</literal>
+              </emphasis> . This tool allows a system call to be traced.</para>
            </listitem>
-
-<listitem>
-            <para><emphasis role="bold">Debug daemon</emphasis>  - The debug daemon controls logging of debug messages.</para>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>/var/log/messages</literal>
+              </emphasis> . <literal>syslogd</literal> prints fatal or serious messages at this log.</para>
            </listitem>
-
-<listitem>
-            <para><emphasis role="bold">/proc/sys/lnet/debug</emphasis>  - This file contains a mask that can be used to delimit the debugging information written out to the kernel debug logs.</para>
+          <listitem>
+            <para><emphasis role="bold">Crash dumps</emphasis> . On crash-dump enabled kernels, sysrq c produces a crash dump. Lustre enhances this crash dump with a log dump (the last 64 KB of the log) to the console.</para>
            </listitem>
-
-</itemizedlist>
-        <para>The following tools are also provided with the Lustre software:</para>
-        <itemizedlist><listitem>
-                <para><emphasis role="bold">lctl</emphasis>  - This tool is used with the debug_kernel option to manually dump the Lustre debugging log or post-process debugging logs that are dumped automatically. For more information about the lctl tool, see <xref linkend="dbdoclet.50438274_62472"/> and <xref linkend="systemconfigurationutilities"/>(lctl).</para>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>debugfs</literal>
+              </emphasis>. Interactive file system debugger.</para>
            </listitem>
-
-<listitem>
-            <para><emphasis role="bold">Lustre subsystem asserts</emphasis>  - A panic-style assertion (LBUG) in the kernel causes Lustre to dump the debug log to the file /tmp/lustre-log.<emphasis>&lt;timestamp&gt;</emphasis> where it can be retrieved after a reboot. For more information, see <link xl:href="LustreTroubleshooting.html#50438198_40669">Viewing Error Messages</link>.</para>
+        </itemizedlist>
+        <para>The following logging and data collection tools can be used to collect information for debugging Lustre kernel issues:</para>
+        <itemizedlist>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>kdump</literal>
+              </emphasis> . A Linux kernel crash utility useful for debugging a system running Red Hat Enterprise Linux. For more information about <literal>kdump</literal>, see the Red Hat knowledge base article <ulink xl:href="http://kbase.redhat.com/faq/docs/DOC-6039">How do I configure kexec/kdump on Red Hat Enterprise Linux 5?</ulink>. To download <literal>kdump</literal>, go to the <ulink xl:href="http://fedoraproject.org/wiki/SystemConfig/kdump#Download">Fedora Project Download</ulink> site.</para>
            </listitem>
-
-<listitem>
-            <para><emphasis role="bold">lfs</emphasis>  - This utility provides access to the extended attributes (EAs) of a Lustre file (along with other information). For more inforamtion about lfs, see <link xl:href="UserUtilities.html#50438206_94597">lfs</link>.</para>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>netconsole</literal>
+              </emphasis>. Enables kernel-level network logging over UDP. A system requires (SysRq) allows users to collect relevant data through <literal>netconsole</literal>.</para>
            </listitem>
-
-</itemizedlist>
-      </section>
-      <section remap="h3">
-        <title>28.1.2 External Debugging Tools</title>
-        <para>The tools described in this section are provided in the Linux kernel or are available at an external website. For information about using some of these tools for Lustre debugging, see <link xl:href="LustreDebugging.html#50438274_23607">Lustre Debugging Procedures</link> and <link xl:href="LustreDebugging.html#50438274_80443">Lustre Debugging for Developers</link>.</para>
-        <section remap="h4">
-          <title>28.1.2.1 Tools for Administrators and Developers</title>
-          <para>Some general debugging tools provided as a part of the standard Linux distro are:</para>
-          <itemizedlist><listitem>
-              <para><emphasis role="bold">strace</emphasis> . This tool allows a system call to be traced.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">/var/log/messages</emphasis> . syslogd prints fatal or serious messages at this log.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">Crash dumps</emphasis> . On crash-dump enabled kernels, sysrq c produces a crash dump. Lustre enhances this crash dump with a log dump (the last 64 KB of the log) to the console.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">debugfs</emphasis> . Interactive file system debugger.</para>
-            </listitem>
-
-</itemizedlist>
-          <para>The following logging and data collection tools can be used to collect information for debugging Lustre kernel issues:</para>
-          <itemizedlist><listitem>
-              <para><emphasis role="bold">kdump</emphasis> . A Linux kernel crash utility useful for debugging a system running Red Hat Enterprise Linux. For more information about kdump, see the Red Hat knowledge base article <link xl:href="http://kbase.redhat.com/faq/docs/DOC-6039">How do I configure kexec/kdump on Red Hat Enterprise Linux 5?</link>. To download kdump, go to the <link xl:href="http://fedoraproject.org/wiki/SystemConfig/kdump#Download">Fedora Project Download</link> site.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">netconsole</emphasis> . Enables kernel-level network logging over UDP. A system requires (SysRq) allows users to collect relevant data through netconsole.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">netdump</emphasis> . A crash dump utility from Red Hat that allows memory images to be dumped over a network to a central server for analysis. The netdump utility was replaced by kdump in RHEL 5. For more information about netdump, see <link xl:href="http://www.redhat.com/support/wpapers/redhat/netdump/">Red Hat, Inc.&apos;s Network Console and Crash Dump Facility</link>.</para>
-            </listitem>
-
-</itemizedlist>
-        </section>
-        <section remap="h4">
-          <title>28.1.2.2 Tools for Developers</title>
-          <para>The tools described in this section may be useful for debugging Lustre in a development environment.</para>
-          <para>Of general interest is:</para>
-          <itemizedlist><listitem>
-              <para><emphasis role="bold">leak_finder.pl</emphasis> . This program provided with Lustre is useful for finding memory leaks in the code.</para>
-            </listitem>
-
-</itemizedlist>
-          <para>A virtual machine is often used to create an isolated development and test environment. Some commonly-used virtual machines are:</para>
-          <itemizedlist><listitem>
-              <para><emphasis role="bold">VirtualBox Open Source Edition</emphasis> . Provides enterprise-class virtualization capability for all major platforms and is available free at <link xl:href="http://www.sun.com/software/products/virtualbox/get.jsp?intcmp=2945">Get Sun VirtualBox</link>.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">VMware Server</emphasis> . Virtualization platform available as free introductory software at <link xl:href="http://downloads.vmware.com/d/info/datacenter_downloads/vmware_server/2_0">Download VMware Server</link>.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">Xen</emphasis> . A para-virtualized environment with virtualization capabilities similar to VMware Server and Virtual Box. However, Xen allows the use of modified kernels to provide near-native performance and the ability to emulate shared storage. For more information, go to <link xl:href="http://xen.org/">xen.org</link>.</para>
-            </listitem>
-
-</itemizedlist>
-          <para>A variety of debuggers and analysis tools are available including:</para>
-          <itemizedlist><listitem>
-              <para><emphasis role="bold">kgdb</emphasis> . The Linux Kernel Source Level Debugger kgdb is used in conjunction with the GNU Debugger gdb for debugging the Linux kernel. For more information about using kgdb with gdb, see <link xl:href="http://www.linuxtopia.org/online_books/redhat_linux_debugging_with_gdb/running.html">Chapter 6. Running Programs Under gdb</link> in the <emphasis>Red Hat Linux 4 Debugging with GDB</emphasis> guide.</para>
-            </listitem>
-
-<listitem>
-              <para><emphasis role="bold">crash</emphasis> . Used to analyze saved crash dump data when a system had panicked or locked up or appears unresponsive. For more information about using crash to analyze a crash dump, see:</para>
-              <itemizedlist><listitem>
-                  <para> Red Hat Magazine article: <link xl:href="http://magazine.redhat.com/2007/08/15/a-quick-overview-of-linux-kernel-crash-dump-analysis/">A quick overview of Linux kernel crash dump analysis</link></para>
-                </listitem>
-
-<listitem>
-                  <para><link xl:href="http://people.redhat.com/anderson/crash_whitepaper/#EXAMPLES">Crash Usage: A Case Study</link>  from the white paper <emphasis>Red Hat Crash Utility</emphasis> by David Anderson</para>
-                </listitem>
-
-<listitem>
-                  <para> Kernel Trap forum entry: <link xl:href="http://kerneltrap.org/node/5758">Linux: Kernel Crash Dumps</link></para>
-                </listitem>
-
-<listitem>
-                  <para> White paper: <link xl:href="http://www.google.com/url?sa=t&amp;source=web&amp;ct=res&amp;cd=8&amp;ved=0CCUQFjAH&amp;url=http%3A%2F%2Fwww.kernel.sg%2Fpapers%2Fcrash-dump-analysis.pdf&amp;rct=j&amp;q=redhat+crash+dump&amp;ei=6aQBS-ifK4T8tAPcjdiHCw&amp;usg=AFQjCNEk03E3GDtAsawG3gfpwc1gGNELAg">A Quick Overview of Linux Kernel Crash Dump Analysis</link></para>
-                </listitem>
-
-</itemizedlist>
-            </listitem>
-</itemizedlist>
-        </section>
-      </section>
-    </section>
-    <section xml:id="dbdoclet.50438274_23607">
-      <title>28.2 Lustre Debugging Procedures</title>
-      <para>The procedures below may be useful to administrators or developers debugging a Lustre files system.</para>
-      <section remap="h3">
-        <title>28.2.1 Understanding the Lustre Debug Messaging Format</title>
-        <para>Lustre debug messages are categorized by originating sybsystem, message type, and locaton in the source code. For a list of subsystems and message types, see <xref linkend="dbdoclet.50438274_57603"/>.</para>
-                <note><para>For a current list of subsystems and debug message types, see lnet/include/libcfs/libcfs.h in the Lustre tree</para></note>
-                <para>The elements of a Lustre debug message are described in <xref linkend="dbdoclet.50438274_57177"/>Format of Lustre Debug Messages.</para>
-        <section remap="h4">
-          <title>28.2.1.1 <anchor xml:id="dbdoclet.50438274_57603" xreflabel=""/>Lustre <anchor xml:id="dbdoclet.50438274_marker-1295746" xreflabel=""/>Debug Messages</title>
-          <para>Each Lustre debug message has the tag of the subsystem it originated in, the message type, and the location in the source code. The subsystems and debug types used in Lustre are as follows:</para>
-          <itemizedlist><listitem>
-              <para>  Standard Subsystems:</para>
-            </listitem>
-
-</itemizedlist>
-          <para> mdc, mds, osc, ost, obdclass, obdfilter, llite, ptlrpc, portals, lnd, ldlm, lov</para>
-          <itemizedlist><listitem>
-              <para>  Debug Types:</para>
-            </listitem>
-<listitem>
-              <para><informaltable frame="all">
-                  <tgroup cols="2">
-                    <colspec colname="c1" colwidth="50*"/>
-                    <colspec colname="c2" colwidth="50*"/>
-                    <thead>
-                      <row>
-                        <entry><para><emphasis role="bold">Types</emphasis></para></entry>
-                        <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-                      </row>
-                    </thead>
-                    <tbody>
-                      <row>
-                        <entry><para> <emphasis role="bold">trace</emphasis></para></entry>
-                        <entry><para> Entry/Exit markers</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">dlmtrace</emphasis></para></entry>
-                        <entry><para> Locking-related information</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">inode</emphasis></para></entry>
-                        <entry><para>  </para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">super</emphasis></para></entry>
-                        <entry><para>  </para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">ext2</emphasis></para></entry>
-                        <entry><para> Anything from the ext2_debug</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">malloc</emphasis></para></entry>
-                        <entry><para> Print malloc or free information</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">cache</emphasis></para></entry>
-                        <entry><para> Cache-related information</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">info</emphasis></para></entry>
-                        <entry><para> General information</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">ioctl</emphasis></para></entry>
-                        <entry><para> IOCTL-related information</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">blocks</emphasis></para></entry>
-                        <entry><para> Ext2 block allocation information</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">net</emphasis></para></entry>
-                        <entry><para> Networking</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">warning</emphasis></para></entry>
-                        <entry><para>  </para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">buffs</emphasis></para></entry>
-                        <entry><para>  </para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">other</emphasis></para></entry>
-                        <entry><para>  </para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">dentry</emphasis></para></entry>
-                        <entry><para>  </para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">portals</emphasis></para></entry>
-                        <entry><para> Entry/Exit markers</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">page</emphasis></para></entry>
-                        <entry><para> Bulk page handling</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">error</emphasis></para></entry>
-                        <entry><para> Error messages</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">emerg</emphasis></para></entry>
-                        <entry><para>  </para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">rpctrace</emphasis></para></entry>
-                        <entry><para> For distributed debugging</para></entry>
-                      </row>
-                      <row>
-                        <entry><para> <emphasis role="bold">ha</emphasis></para></entry>
-                        <entry><para> Failover and recovery-related information</para></entry>
-                      </row>
-                    </tbody>
-                  </tgroup>
-                </informaltable>
-</para>
-            </listitem>
-</itemizedlist>
-        </section>
-        <section remap="h4">
-          <title>28.2.1.2 <anchor xml:id="dbdoclet.50438274_57177" xreflabel=""/>Format of Lustre Debug Messages</title>
-          <para>Lustre uses the CDEBUG and CERROR macros to print the debug or error messages. To print the message, the CDEBUG macro uses portals_debug_msg (portals/linux/oslib/debug.c). The message format is described below, along with an example.</para>
-          <informaltable frame="all">
-            <tgroup cols="2">
-              <colspec colname="c1" colwidth="50*"/>
-              <colspec colname="c2" colwidth="50*"/>
-              <thead>
-                <row>
-                  <entry><para><emphasis role="bold">Parameter</emphasis></para></entry>
-                  <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-                </row>
-              </thead>
-              <tbody>
-                <row>
-                  <entry><para> <emphasis role="bold">subsystem</emphasis></para></entry>
-                  <entry><para> 800000</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">debug mask</emphasis></para></entry>
-                  <entry><para> 000010</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">smp_processor_id</emphasis></para></entry>
-                  <entry><para> 0</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">sec.used</emphasis></para></entry>
-                  <entry><para> 10818808</para><para> 47.677302</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">stack size</emphasis></para></entry>
-                  <entry><para> 1204:</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">pid</emphasis></para></entry>
-                  <entry><para> 2973:</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">host pid (if uml) or zero</emphasis></para></entry>
-                  <entry><para> 31070:</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">(file:line #:functional())</emphasis></para></entry>
-                  <entry><para> (as_dev.c:144:create_write_buffers())</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">debug message</emphasis></para></entry>
-                  <entry><para> kmalloced &apos;*obj&apos;: 24 at a375571c (tot 17447717)</para></entry>
-                </row>
-              </tbody>
-            </tgroup>
-          </informaltable>
-        </section>
-        <section remap="h4">
-          <title>28.2.1.3 Lustre <anchor xml:id="dbdoclet.50438274_marker-1295885" xreflabel=""/>Debug Messages Buffer</title>
-          <para>Lustre debug messages are maintained in a buffer, with the maximum buffer size specified (in MBs) by the debug_mb parameter (/proc/sys/lnet/debug_mb). The buffer is circular, so debug messages are kept until the allocated buffer limit is reached, and then the first messages are overwritten.</para>
-        </section>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>netdump</literal>
+              </emphasis>. A crash dump utility from Red Hat that allows memory images to be dumped over a network to a central server for analysis. The <literal>netdump</literal> utility was replaced by <literal>kdump</literal> in RHEL 5. For more information about <literal>netdump</literal>, see <ulink xl:href="http://www.redhat.com/support/wpapers/redhat/netdump/">Red Hat, Inc.&apos;s Network Console and Crash Dump Facility</ulink>.</para>
+          </listitem>
+        </itemizedlist>
        </section>
-      <section remap="h3">
-        <title>28.2.2 <anchor xml:id="dbdoclet.50438274_62472" xreflabel=""/>Using the lctl Tool to View Debug Messages</title>
-        <para>The lctl tool allows debug messages to be filtered based on subsystems and message types to extract information useful for troubleshooting from a kernel debug log. For a command reference, see <link xl:href="SystemConfigurationUtilities.html#50438219_38274">lctl</link>.</para>
-        <para>You can use lctl to:</para>
-        <itemizedlist><listitem>
-            <para> Obtain a list of all the types and subsystems:</para>
+      <section remap="h4">
+        <title>28.1.2.2 Tools for Developers</title>
+        <para>The tools described in this section may be useful for debugging Lustre in a development environment.</para>
+        <para>Of general interest is:</para>
+        <itemizedlist>
+          <listitem>
+            <para><literal>
+                <emphasis role="bold">leak_finder.pl</emphasis>
+              </literal> . This program provided with Lustre is useful for finding memory leaks in the code.</para>
            </listitem>
-
-</itemizedlist>
-        <screen>lctl &gt; debug_list <emphasis>&lt;subs | types&gt;</emphasis></screen>
-        <itemizedlist><listitem>
-            <para> Filter the debug log:</para>
+        </itemizedlist>
+        <para>A virtual machine is often used to create an isolated development and test environment. Some commonly-used virtual machines are:</para>
+        <itemizedlist>
+          <listitem>
+            <para><emphasis role="bold">VirtualBox Open Source Edition</emphasis> . Provides enterprise-class virtualization capability for all major platforms and is available free at <ulink xl:href="http://www.sun.com/software/products/virtualbox/get.jsp?intcmp=2945">Get Sun VirtualBox</ulink>.</para>
            </listitem>
-
-</itemizedlist>
-        <screen>lctl &gt; filter <emphasis>&lt;subsystem name | debug type&gt;</emphasis></screen>
-                <note><para>When lctl filters, it removes unwanted lines from the displayed output. This does not affect the contents of the debug log in the kernel&apos;s memory. As a result, you can print the log many times with different filtering levels without worrying about losing data.</para></note>
-
-        <itemizedlist><listitem>
-            <para> Show debug messages belonging to certain subsystem or type:</para>
+          <listitem>
+            <para><emphasis role="bold">VMware Server</emphasis> . Virtualization platform available as free introductory software at <ulink xl:href="http://downloads.vmware.com/d/info/datacenter_downloads/vmware_server/2_0">Download VMware Server</ulink>.</para>
            </listitem>
-
-</itemizedlist>
-        <screen>lctl &gt; show <emphasis>&lt;subsystem name | debug type&gt;</emphasis></screen>
-        <para>debug_kernel pulls the data from the kernel logs, filters it appropriately, and displays or saves it as per the specified options</para>
-        <screen>lctl &gt; debug_kernel [<emphasis>output filename</emphasis>]
-</screen>
-        <para>If the debugging is being done on User Mode Linux (UML), it might be useful to save the logs on the host machine so that they can be used at a later time.</para>
-        <itemizedlist><listitem>
-            <para> Filter a log on disk, if you already have a debug log saved to disk (likely from a crash):</para>
+          <listitem>
+            <para><emphasis role="bold">Xen</emphasis> . A para-virtualized environment with virtualization capabilities similar to VMware Server and Virtual Box. However, Xen allows the use of modified kernels to provide near-native performance and the ability to emulate shared storage. For more information, go to <ulink xl:href="http://xen.org/">xen.org</ulink>.</para>
            </listitem>
-
-</itemizedlist>
-        <screen>lctl &gt; debug_file <emphasis>&lt;input filename&gt;</emphasis> [<emphasis>output filename</emphasis>] 
-</screen>
-        <para>During the debug session, you can add markers or breaks to the log for any reason:</para>
-        <screen>lctl &gt; mark [marker text] 
-</screen>
-        <para>The marker text defaults to the current date and time in the debug log (similar to the example shown below):</para>
-        <screen>DEBUG MARKER: Tue Mar 5 16:06:44 EST 2002 
-</screen>
-        <itemizedlist><listitem>
-            <para> Completely flush the kernel debug buffer:</para>
+        </itemizedlist>
+        <para>A variety of debuggers and analysis tools are available including:</para>
+        <itemizedlist>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>kgdb</literal>
+              </emphasis> . The Linux Kernel Source Level Debugger kgdb is used in conjunction with the GNU Debugger <literal>gdb</literal> for debugging the Linux kernel. For more information about using <literal>kgdb</literal> with <literal>gdb</literal>, see <ulink xl:href="http://www.linuxtopia.org/online_books/redhat_linux_debugging_with_gdb/running.html">Chapter 6. Running Programs Under gdb</ulink> in the <emphasis>Red Hat Linux 4 Debugging with GDB</emphasis> guide.</para>
            </listitem>
-
-</itemizedlist>
-        <screen>lctl &gt; clear
-</screen>
-                <note><para>Debug messages displayed with lctl are also subject to the kernel debug masks; the filters are additive.</para></note>
-        <section remap="h4">
-          <title>28.2.2.1 Sample lctl<anchor xml:id="dbdoclet.50438274_marker-1295914" xreflabel=""/>Run</title>
-          <para>Below is a sample run using the lctl command.</para>
-          <screen>bash-2.04# ./lctl 
-lctl &gt; debug_kernel /tmp/lustre_logs/log_all 
-Debug log: 324 lines, 324 kept, 0 dropped. 
-lctl &gt; filter trace 
-Disabling output of type &quot;trace&quot; 
-lctl &gt; debug_kernel /tmp/lustre_logs/log_notrace 
-Debug log: 324 lines, 282 kept, 42 dropped. 
-lctl &gt; show trace 
-Enabling output of type &quot;trace&quot; 
-lctl &gt; filter portals 
-Disabling output from subsystem &quot;portals&quot; 
-lctl &gt; debug_kernel /tmp/lustre_logs/log_noportals 
-Debug log: 324 lines, 258 kept, 66 dropped. 
-</screen>
-        </section>
-      </section>
-      <section remap="h3">
-        <title>28.2.3 Dumping the Buffer to a File (debug_daemon)</title>
-        <para>The debug_daemon option is used by lctl to control the dumping of the debug_kernel buffer to a user-specified file. This functionality uses a kernel thread on top of debug_kernel, which works in parallel with the debug_daemon command.</para>
-        <para>The debug_daemon is highly dependent on file system write speed. File system write operations may not be fast enough to flush out all of the debug_buffer if the Lustre file system is under heavy system load and continues to CDEBUG to the debug_buffer. The debug_daemon will write the message DEBUG MARKER: Trace buffer full into the debug_buffer to indicate the debug_buffer contents are overlapping before the debug_daemon flushes data to a file.</para>
-        <para>Users can use lctlcontrol to start or stop the Lustre daemon from dumping the debug_buffer to a file. Users can also temporarily hold daemon from dumping the file. Use of the debug_daemon sub-command to lctl can provide the same function.</para>
-        <section remap="h4">
-          <title>28.2.3.1 lctldebug_daemon Commands</title>
-          <para>This section describes lctldebug_daemon commands.</para>
-          <para>To initiate the debug_daemon to start dumping debug_buffer into a file., enter</para>
-          <screen>$ lctl debug_daemon start [{file} {megabytes}]
-</screen>
-          <para>The file can be a system default file, as shown in /proc/sys/lnet/debug_path. After Lustre starts, the default path is /tmp/lustre-log-$HOSTNAME. Users can specify a new filename for debug_daemon to output debug_buffer. The new file name shows up in /proc/sys/lnet/debug_path. Megabytes is the limitation of the file size in MBs.</para>
-          <para>The daemon wraps around and dumps data to the beginning of the file when the output file size is over the limit of the user-specified file size. To decode the dumped file to ASCII and order the log entries by time, run:</para>
-          <screen>lctl debug_file {file} &gt; {newfile}
-</screen>
-          <para>The output is internally sorted by the lctl command using quicksort.</para>
-          <para>To completely shut down the debug_daemon operation and flush the file output, enter:</para>
-          <screen>debug_daemon stop
-</screen>
-          <para>Otherwise, debug_daemon is shut down as part of the Lustre file system shutdown process. Users can restart debug_daemon by using start command after each stop command issued.</para>
-          <para>This is an example using debug_daemon with the interactive mode of lctl to dump debug logs to a 10 MB file.</para>
-          <screen>#~/utils/lctl
-</screen>
-          <para>To start the daemon to dump debug_buffer into a 40 MB /tmp/dump file, enter:</para>
-          <screen>lctl &gt; debug_daemon start /trace/log 40 
-</screen>
-          <para>To completely shut down the daemon, enter:</para>
-          <screen>lctl &gt; debug_daemon stop 
-</screen>
-          <para>To start another daemon with an unlimited file size, enter:</para>
-          <screen>lctl &gt; debug_daemon start /tmp/unlimited 
-</screen>
-          <para>The text message *** End of debug_daemon trace log *** appears at the end of each output file.</para>
-        </section>
-      </section>
-      <section remap="h3">
-        <title>28.2.4 Controlling Information Written to the Kernel <anchor xml:id="dbdoclet.50438274_marker-1295955" xreflabel=""/>Debug Log</title>
-        <para>Masks are provided in /proc/sys/lnet/subsystem_debug and /proc/sys/lnet/debug to be used with the systctl command to determine what information is to be written to the debug log. The subsystem_debug mask determines the information written to the log based on the subsystem (such as iobdfilter, net, portals, or OSC). The debug mask controls information based on debug type (such as info, error, trace, or alloc).</para>
-        <para>To turn off Lustre debugging completely:</para>
-        <screen>sysctl -w lnet.debug=0 
-</screen>
-        <para>To turn on full Lustre debugging:</para>
-        <screen>sysctl -w lnet.debug=-1 
-</screen>
-        <para>To turn on logging of messages related to network communications:</para>
-        <screen>sysctl -w lnet.debug=net 
-</screen>
-        <para>To turn on logging of messages related to network communications and existing debug flags:</para>
-        <screen>sysctl -w lnet.debug=+net 
-</screen>
-        <para>To turn off network logging with changing existing flags:</para>
-        <screen>sysctl -w lnet.debug=-net 
-</screen>
-        <para>The various options available to print to kernel debug logs are listed in lnet/include/libcfs/libcfs.h</para>
-      </section>
-      <section remap="h3">
-        <title>28.2.5 <anchor xml:id="dbdoclet.50438274_26909" xreflabel=""/>Troubleshooting with strace<anchor xml:id="dbdoclet.50438274_marker-1295969" xreflabel=""/></title>
-        <para>The strace utility provided with the Linux distribution enables system calls to be traced by intercepting all the system calls made by a process and recording the system call name, aruguments, and return values.</para>
-        <para>To invoke strace on a program, enter:</para>
-        <screen>$ strace <emphasis>&lt;program&gt; &lt;args&gt;</emphasis> 
-</screen>
-        <para>Sometimes, a system call may fork child processes. In this situation, use the -f option of strace to trace the child processes:</para>
-        <screen>$ strace -f <emphasis>&lt;program&gt; &lt;args&gt;</emphasis> 
-</screen>
-        <para>To redirect the strace output to a file, enter:</para>
-        <screen>$ strace -o <emphasis>&lt;filename&gt; &lt;program&gt; &lt;args&gt;</emphasis> 
-</screen>
-        <para>Use the -ff option, along with -o, to save the trace output in filename.pid, where pid is the process ID of the process being traced. Use the -ttt option to timestamp all lines in the strace output, so they can be correlated to operations in the lustre kernel debug log.</para>
-        <para>If the debugging is done in UML, save the traces on the host machine. In this example, hostfs is mounted on /r:</para>
-        <screen>$ strace -o /r/tmp/vi.strace 
-</screen>
-      </section>
-      <section remap="h3">
-        <title>28.2.6 <anchor xml:id="dbdoclet.50438274_54455" xreflabel=""/>Looking at Disk <anchor xml:id="dbdoclet.50438274_marker-1295982" xreflabel=""/>Content</title>
-        <para>In Lustre, the inodes on the metadata server contain extended attributes (EAs) that store information about file striping. EAs contain a list of all object IDs and their locations (that is, the OST that stores them). The lfs tool can be used to obtain this information for a given file using the getstripe subcommand. Use a corresponding lfs setstripe command to specify striping attributes for a new file or directory.</para>
-        <para>The lfsgetstripe utility is written in C; it takes a Lustre filename as input and lists all the objects that form a part of this file. To obtain this information for the file /mnt/lustre/frog in Lustre file system, run:</para>
-        <screen>$ lfs getstripe /mnt/lustre/frog
-$
-   obdix                           objid
-   0                               17
-   1                               4
-</screen>
-        <para>The debugfs tool is provided in the e2fsprogs package. It can be used for interactive debugging of an ldiskfs file system. The debugfs tool can either be used to check status or modify information in the file system. In Lustre, all objects that belong to a file are stored in an underlying ldiskfs file system on the OSTs. The file system uses the object IDs as the file names. Once the object IDs are known, use the debugfs tool to obtain the attributes of all objects from different OSTs.</para>
-        <para>A sample run for the /mnt/lustre/frog file used in the above example is shown here:</para>
-        <screen>     $ debugfs -c /tmp/ost1
-   debugfs: cd O
-   debugfs: cd 0                                   /* for files in group 0 \
-*/
-   debugfs: cd d&lt;objid % 32&gt;
-   debugfs: stat &lt;objid&gt;                             /* for getattr on object\
- */
-   debugfs: quit
-## Suppose object id is 36, then follow the steps below:
-   $ debugfs /tmp/ost1
-   debugfs: cd O
-   debugfs: cd 0
-   debugfs: cd d4                                  /* objid % 32 */
-   debugfs: stat 36                                /* for getattr on obj 4*\
-/
-   debugfs: dump 36 /tmp/obj.36                    /* dump contents of obj \
-4 */
-   debugfs: quit
-</screen>
-      </section>
-      <section remap="h3">
-        <title>28.2.7 Finding the Lustre <anchor xml:id="dbdoclet.50438274_marker-1296007" xreflabel=""/>UUID of an OST</title>
-        <para>To determine the Lustre UUID of an obdfilter disk (for example, if you mix up the cables on your OST devices or the SCSI bus numbering suddenly changes and the SCSI devices get new names), use debugfs to get the last_rcvd file.</para>
-      </section>
-      <section remap="h3">
-        <title>28.2.8 Printing Debug Messages to the Console</title>
-        <para>To dump debug messages to the console (/var/log/messages), set the corresponding debug mask in the printk flag:</para>
-        <screen>sysctl -w lnet.printk=-1 
-</screen>
-        <para>This slows down the system dramatically. It is also possible to selectively enable or disable this capability for particular flags using:</para>
-        <screen>sysctl -w lnet.printk=+vfstrace 
-sysctl -w lnet.printk=-vfstrace 
-</screen>
-        <para>It is possible to disable warning, error , and console messages, though it is strongly recommended to have something like lctldebug_daemon runing to capture this data to a local file system for debugging purposes.</para>
-      </section>
-      <section remap="h3">
-        <title>28.2.9 Tracing <anchor xml:id="dbdoclet.50438274_marker-1296017" xreflabel=""/>Lock Traffic</title>
-        <para>Lustre has a specific debug type category for tracing lock traffic. Use:</para>
-        <screen>lctl&gt; filter all_types 
-lctl&gt; show dlmtrace 
-lctl&gt; debug_kernel [filename] 
-</screen>
+          <listitem>
+            <para><emphasis role="bold">
+                <literal>crash</literal>
+              </emphasis> . Used to analyze saved crash dump data when a system had panicked or locked up or appears unresponsive. For more information about using crash to analyze a crash dump, see:</para>
+            <itemizedlist>
+              <listitem>
+                <para> Red Hat Magazine article: <ulink xl:href="http://magazine.redhat.com/2007/08/15/a-quick-overview-of-linux-kernel-crash-dump-analysis/">A quick overview of Linux kernel crash dump analysis</ulink></para>
+              </listitem>
+              <listitem>
+                <para><ulink xl:href="http://people.redhat.com/anderson/crash_whitepaper/#EXAMPLES">Crash Usage: A Case Study</ulink>  from the white paper <emphasis>Red Hat Crash Utility</emphasis> by David Anderson</para>
+              </listitem>
+              <listitem>
+                <para> Kernel Trap forum entry: <ulink xl:href="http://kerneltrap.org/node/5758">Linux: Kernel Crash Dumps</ulink></para>
+              </listitem>
+              <listitem>
+                <para> White paper: <ulink xl:href="http://www.google.com/url?sa=t&amp;source=web&amp;ct=res&amp;cd=8&amp;ved=0CCUQFjAH&amp;url=http%3A%2F%2Fwww.kernel.sg%2Fpapers%2Fcrash-dump-analysis.pdf&amp;rct=j&amp;q=redhat+crash+dump&amp;ei=6aQBS-ifK4T8tAPcjdiHCw&amp;usg=AFQjCNEk03E3GDtAsawG3gfpwc1gGNELAg">A Quick Overview of Linux Kernel Crash Dump Analysis</ulink></para>
+              </listitem>
+            </itemizedlist>
+          </listitem>
+        </itemizedlist>
        </section>
      </section>
-    <section xml:id="dbdoclet.50438274_80443">
-      <title>28.3 Lustre Debugging for Developers</title>
-      <para>The procedures in this section may be useful to developers debugging Lustre code.</para>
-      <section remap="h3">
-        <title>28.3.1 Adding Debugging to the <anchor xml:id="dbdoclet.50438274_marker-1296026" xreflabel=""/>Lustre Source Code</title>
-        <para>The debugging infrastructure provides a number of macros that can be used in Lustre source code to aid in debugging or reporting serious errors.</para>
-        <para>To use these macros, you will need to set the DEBUG_SUBSYSTEM variable at the top of the file as shown below:</para>
-        <screen>#define DEBUG_SUBSYSTEM S_PORTALS
-</screen>
-        <para>A list of available macros with descritions is provided in the table below.</para>
+  </section>
+  <section xml:id="dbdoclet.50438274_23607">
+    <title>28.2 Lustre Debugging Procedures</title>
+    <para>The procedures below may be useful to administrators or developers debugging a Lustre files system.</para>
+    <section remap="h3">
+      <title>28.2.1 Understanding the Lustre Debug Messaging Format</title>
+      <para>Lustre debug messages are categorized by originating sybsystem, message type, and locaton in the source code. For a list of subsystems and message types, see <xref linkend="dbdoclet.50438274_57603"/>.</para>
+      <note>
+        <para>For a current list of subsystems and debug message types, see <literal>lnet/include/libcfs/libcfs.h</literal> in the Lustre tree</para>
+      </note>
+      <para>The elements of a Lustre debug message are described in <xref linkend="dbdoclet.50438274_57177"/> Format of Lustre Debug Messages.</para>
+      <section xml:id="dbdoclet.50438274_57603">
+        <title>28.2.1.1 Lustre Debug Messages</title>
+        <para>Each Lustre debug message has the tag of the subsystem it originated in, the message type, and the location in the source code. The subsystems and debug types used in Lustre are as follows:</para>
+        <itemizedlist>
+          <listitem>
+            <para>  Standard Subsystems:</para>
+            <para> mdc, mds, osc, ost, obdclass, obdfilter, llite, ptlrpc, portals, lnd, ldlm, lov</para>
+          </listitem>
+        </itemizedlist>
+        <itemizedlist>
+          <listitem>
+            <para>  Debug Types:</para>
+          </listitem>
+          <listitem>
+            <para><informaltable frame="all">
+                <tgroup cols="2">
+                  <colspec colname="c1" colwidth="50*"/>
+                  <colspec colname="c2" colwidth="50*"/>
+                  <thead>
+                    <row>
+                      <entry>
+                        <para><emphasis role="bold">Types</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para><emphasis role="bold">Description</emphasis></para>
+                      </entry>
+                    </row>
+                  </thead>
+                  <tbody>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">trace</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Entry/Exit markers</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">dlmtrace</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Locking-related information</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">inode</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> &#160;</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">super</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> &#160;</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">ext2</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Anything from the ext2_debug</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">malloc</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Print malloc or free information</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">cache</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Cache-related information</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">info</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> General information</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">ioctl</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> IOCTL-related information</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">blocks</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Ext2 block allocation information</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">net</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Networking</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">warning</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> &#160;</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">buffs</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> &#160;</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">other</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> &#160;</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">dentry</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> &#160;</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">portals</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Entry/Exit markers</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">page</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Bulk page handling</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">error</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Error messages</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">emerg</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> &#160;</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">rpctrace</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> For distributed debugging</para>
+                      </entry>
+                    </row>
+                    <row>
+                      <entry>
+                        <para> <emphasis role="bold">ha</emphasis></para>
+                      </entry>
+                      <entry>
+                        <para> Failover and recovery-related information</para>
+                      </entry>
+                    </row>
+                  </tbody>
+                </tgroup>
+              </informaltable>
+</para>
+          </listitem>
+        </itemizedlist>
+      </section>
+      <section xml:id="dbdoclet.50438274_57177">
+        <title>28.2.1.2 Format of Lustre Debug Messages</title>
+        <para>Lustre uses the <literal>CDEBUG</literal> and <literal>CERROR</literal> macros to print the debug or error messages. To print the message, the <literal>CDEBUG</literal> macro uses <literal>portals_debug_msg</literal> (<literal>portals/linux/oslib/debug.c</literal>). The message format is described below, along with an example.</para>
          <informaltable frame="all">
            <tgroup cols="2">
              <colspec colname="c1" colwidth="50*"/>
              <colspec colname="c2" colwidth="50*"/>
              <thead>
                <row>
-                <entry><para><emphasis role="bold">Macro</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
+                <entry>
+                  <para><emphasis role="bold">Parameter</emphasis></para>
+                </entry>
+                <entry>
+                  <para><emphasis role="bold">Description</emphasis></para>
+                </entry>
                </row>
              </thead>
              <tbody>
                <row>
-                <entry><para> <emphasis role="bold">LBUG</emphasis></para></entry>
-                <entry><para> A panic-style assertion in the kernel which causes Lustre to dump its circular log to the /tmp/lustre-log file. This file can be retrieved after a reboot. LBUG freezes the thread to allow capture of the panic stack. A system reboot is needed to clear the thread.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">LASSERT</emphasis></para></entry>
-                <entry><para> Validates a given expression as true, otherwise calls LBUG. The failed expression is printed on the console, although the values that make up the expression are not printed.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">LASSERTF</emphasis></para></entry>
-                <entry><para> Similar to LASSERT but allows a free-format message to be printed, like printf/printk.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">CDEBUG</emphasis></para></entry>
-                <entry><para> The basic, most commonly used debug macro that takes just one more argument than standard printf - the debug type. This message adds to the debug log with the debug mask set accordingly. Later, when a user retrieves the log for troubleshooting, they can filter based on this type.</para><para>CDEBUG(D_INFO, &quot;This is my debug message: the number is %d\n&quot;, number).</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">CERROR</emphasis></para></entry>
-                <entry><para> Behaves similarly to CDEBUG, but unconditionally prints the message in the debug log and to the console. This is appropriate for serious errors or fatal conditions:</para><para>CERROR(&quot;Something very bad has happened, and the return code is %d.\n&quot;, rc);</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">subsystem</emphasis></para>
+                </entry>
+                <entry>
+                  <para> 800000</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">ENTRY and EXIT</emphasis></para></entry>
-                <entry><para> Add messages to aid in call tracing (takes no arguments). When using these macros, cover all exit conditions to avoid confusion when the debug log reports that a function was entered, but never exited.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">debug mask</emphasis></para>
+                </entry>
+                <entry>
+                  <para> 000010</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">LDLM_DEBUG and LDLM_DEBUG_NOLOCK</emphasis></para></entry>
-                <entry><para> Used when tracing MDS and VFS operations for locking. These macros build a thin trace that shows the protocol exchanges between nodes.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">smp_processor_id</emphasis></para>
+                </entry>
+                <entry>
+                  <para> 0</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">DEBUG_REQ</emphasis></para></entry>
-                <entry><para> Prints information about the given ptlrpc_request structure.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">sec.used</emphasis></para>
+                </entry>
+                <entry>
+                  <para> 10818808</para>
+                  <para> 47.677302</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">OBD_FAIL_CHECK</emphasis></para></entry>
-                <entry><para> Allows insertion of failure points into the Lustre code. This is useful to generate regression tests that can hit a very specific sequence of events. This works in conjunction with &quot;sysctl -w lustre.fail_loc={fail_loc}&quot; to set a specific failure point for which a given OBD_FAIL_CHECK will test.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">stack size</emphasis></para>
+                </entry>
+                <entry>
+                  <para> 1204:</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">OBD_FAIL_TIMEOUT</emphasis></para></entry>
-                <entry><para> Similar to OBD_FAIL_CHECK. Useful to simulate hung, blocked or busy processes or network devices. If the given fail_loc is hit, OBD_FAIL_TIMEOUT waits for the specified number of seconds.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">pid</emphasis></para>
+                </entry>
+                <entry>
+                  <para> 2973:</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">OBD_RACE</emphasis></para></entry>
-                <entry><para> Similar to OBD_FAIL_CHECK. Useful to have multiple processes execute the same code concurrently to provoke locking races. The first process to hit OBD_RACE sleeps until a second process hits OBD_RACE, then both processes continue.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">host pid (if uml) or zero</emphasis></para>
+                </entry>
+                <entry>
+                  <para> 31070:</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">OBD_FAIL_ONCE</emphasis></para></entry>
-                <entry><para> A flag set on a lustre.fail_loc breakpoint to cause the OBD_FAIL_CHECK condition to be hit only one time. Otherwise, a fail_loc is permanent until it is cleared with &quot;sysctl -w lustre.fail_loc=0&quot;.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">(file:line #:functional())</emphasis></para>
+                </entry>
+                <entry>
+                  <para> (as_dev.c:144:create_write_buffers())</para>
+                </entry>
                </row>
                <row>
-                <entry><para> <emphasis role="bold">OBD_FAIL_RAND</emphasis></para></entry>
-                <entry><para> Has OBD_FAIL_CHECK fail randomly; on average every (1 / lustre.fail_val) times.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">OBD_FAIL_SKIP</emphasis></para></entry>
-                <entry><para> Has OBD_FAIL_CHECK succeed lustre.fail_val times, and then fail permanently or once with OBD_FAIL_ONCE.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">OBD_FAIL_SOME</emphasis></para></entry>
-                <entry><para> Has OBD_FAIL_CHECK fail lustre.fail_val times, and then succeed.</para></entry>
+                <entry>
+                  <para> <emphasis role="bold">debug message</emphasis></para>
+                </entry>
+                <entry>
+                  <para> kmalloced &apos;*obj&apos;: 24 at a375571c (tot 17447717)</para>
+                </entry>
                </row>
              </tbody>
            </tgroup>
          </informaltable>
        </section>
-      <section remap="h3">
-        <title>28.3.2 Accessing a Ptlrpc <anchor xml:id="dbdoclet.50438274_marker-1296099" xreflabel=""/>Request History</title>
-        <para>Each service maintains a request history, which can be useful for first occurrence troubleshooting.</para>
-        <para>Ptlrpc is an RPC protocol layered on LNET that deals with stateful servers and has semantics and built-in support for recovery.</para>
-        <para>A prlrpc request history works as follows:</para>
-        <orderedlist><listitem>
-        <para>Request_in_callback() adds the new request to the service&apos;s request history.</para>
-    </listitem><listitem>
-        <para>When a request buffer becomes idle, it is added to the service&apos;s request buffer history list.</para>
-    </listitem><listitem>
-        <para>Buffers are culled from the service&apos;s request buffer history if it has grown above</para>
-        <para>req_buffer_history_max and its reqs are removed from the service&apos;s request history.</para>
-    </listitem></orderedlist>
-        <para>Request history is accessed and controlled using the following /proc files under the service directory:</para>
-        <itemizedlist><listitem>
-            <para>req_buffer_history_len</para>
-          </listitem>
-
-</itemizedlist>
-        <para>Number of request buffers currently in the history</para>
-        <itemizedlist><listitem>
-            <para>req_buffer_history_max</para>
-          </listitem>
-
-</itemizedlist>
-        <para>Maximum number of request buffers to keep</para>
-        <itemizedlist><listitem>
-            <para>req_history</para>
-          </listitem>
-
-</itemizedlist>
-        <para>The request history</para>
-        <para>Requests in the history include &quot;live&quot; requests that are currently being handled. Each line in req_history looks like:</para>
-        <screen>&lt;seq&gt;:&lt;target NID&gt;:&lt;client ID&gt;:&lt;xid&gt;:&lt;length&gt;:&lt;phase&gt; &lt;svc specific&gt; 
-</screen>
-        <informaltable frame="all">
-          <tgroup cols="2">
-            <colspec colname="c1" colwidth="50*"/>
-            <colspec colname="c2" colwidth="50*"/>
-            <thead>
-              <row>
-                <entry><para><emphasis role="bold">Parameter</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry><para> <emphasis role="bold">seq</emphasis></para></entry>
-                <entry><para> Request sequence number</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">target NID</emphasis></para></entry>
-                <entry><para> Destination NID of the incoming request</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">client ID</emphasis></para></entry>
-                <entry><para> Client PID and NID</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">xid</emphasis></para></entry>
-                <entry><para> rq_xid</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">length</emphasis></para></entry>
-                <entry><para> Size of the request message</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">phase</emphasis></para></entry>
-                <entry><para><itemizedlist><listitem>
-                        <para> New (waiting to be handled or could not be unpacked)</para>
-                      </listitem>
-<listitem>
-                        <para> Interpret (unpacked or being handled)</para>
-                      </listitem>
-<listitem>
-                        <para> Complete (handled)</para>
-                      </listitem>
-</itemizedlist></para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">svc specific</emphasis></para></entry>
-                <entry><para> Service-specific request printout. Currently, the only service that does this is the OST (which prints the opcode if the message has been unpacked successfully</para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
+      <section remap="h4">
+        <title>28.2.1.3 Lustre Debug Messages Buffer</title>
+        <para>Lustre debug messages are maintained in a buffer, with the maximum buffer size specified (in MBs) by the <literal>debug_mb</literal> parameter (<literal>/proc/sys/lnet/debug_mb</literal>). The buffer is circular, so debug messages are kept until the allocated buffer limit is reached, and then the first messages are overwritten.</para>
        </section>
-      <section remap="h3">
-        <title>28.3.3 Finding Memory <anchor xml:id="dbdoclet.50438274_marker-1296153" xreflabel=""/>Leaks Using leak_finder.pl</title>
-        <para>Memory leaks can occur in code when memory has been allocated and then not freed once it is no longer required. The leak_finder.pl program provides a way to find memory leaks.</para>
-        <para>Before running this program, you must turn on debugging to collect all malloc and free entries. Run:</para>
-        <screen>sysctl -w lnet.debug=+malloc 
+    </section>
+    <section xml:id='dbdoclet.50438274_62472'>
+      <title>28.2.2 Using the lctl Tool to View Debug Messages</title>
+      <para>The <literal>lctl</literal> tool allows debug messages to be filtered based on subsystems and message types to extract information useful for troubleshooting from a kernel debug log. For a command reference, see <xref linkend="dbdoclet.50438219_38274">lctl</xref>.</para>
+      <para>You can use <literal>lctl</literal> to:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Obtain a list of all the types and subsystems:</para>
+          <screen>lctl &gt; debug_list <emphasis>&lt;subs | types&gt;</emphasis></screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para>Filter the debug log:</para>
+          <screen>lctl &gt; filter <emphasis>&lt;subsystem name | debug type&gt;</emphasis></screen>
+        </listitem>
+      </itemizedlist>
+      <note>
+        <para>When <literal>lctl</literal> filters, it removes unwanted lines from the displayed output. This does not affect the contents of the debug log in the kernel&apos;s memory. As a result, you can print the log many times with different filtering levels without worrying about losing data.</para>
+      </note>
+      <itemizedlist>
+        <listitem>
+          <para>Show debug messages belonging to certain subsystem or type:</para>
+          <screen>lctl &gt; show <emphasis>&lt;subsystem name | debug type&gt;</emphasis></screen>
+          <para><literal>debug_kernel</literal> pulls the data from the kernel logs, filters it appropriately, and displays or saves it as per the specified options</para>
+          <screen>lctl &gt; debug_kernel [<emphasis>output filename</emphasis>]</screen>
+          <para>If the debugging is being done on User Mode Linux (UML), it might be useful to save the logs on the host machine so that they can be used at a later time.</para>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para>Filter a log on disk, if you already have a debug log saved to disk (likely from a crash):</para>
+          <screen>lctl &gt; debug_file <emphasis>&lt;input filename&gt;</emphasis> [<emphasis>output filename</emphasis>] </screen>
+          <para>During the debug session, you can add markers or breaks to the log for any reason:</para>
+          <screen>lctl &gt; mark [marker text] </screen>
+          <para>The marker text defaults to the current date and time in the debug log (similar to the example shown below):</para>
+          <screen>DEBUG MARKER: Tue Mar 5 16:06:44 EST 2002 
+</screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para>Completely flush the kernel debug buffer:</para>
+          <screen>lctl &gt; clear
+</screen>
+        </listitem>
+      </itemizedlist>
+      <note>
+        <para>Debug messages displayed with <literal>lctl</literal> are also subject to the kernel debug masks; the filters are additive.</para>
+      </note>
+      <section remap="h4">
+        <title>28.2.2.1 Sample <literal>lctl</literal> Run</title>
+        <para>Below is a sample run using the <literal>lctl</literal> command.</para>
+        <screen>bash-2.04# ./lctl 
+lctl &gt; debug_kernel /tmp/lustre_logs/log_all 
+Debug log: 324 lines, 324 kept, 0 dropped. 
+lctl &gt; filter trace 
+Disabling output of type &quot;trace&quot; 
+lctl &gt; debug_kernel /tmp/lustre_logs/log_notrace 
+Debug log: 324 lines, 282 kept, 42 dropped. 
+lctl &gt; show trace 
+Enabling output of type &quot;trace&quot; 
+lctl &gt; filter portals 
+Disabling output from subsystem &quot;portals&quot; 
+lctl &gt; debug_kernel /tmp/lustre_logs/log_noportals 
+Debug log: 324 lines, 258 kept, 66 dropped. 
  </screen>
-        <para>Then complete the following steps:</para>
-        <orderedlist><listitem> 
-        <para> 1. Dump the log into a user-specified log file using lctl (see <link xl:href="LustreDebugging.html#50438274_62472">Using the lctl Tool to View Debug Messages</link>).</para>
-        </listitem><listitem>
-        <para> 2. Run the leak finder on the newly-created log dump:</para>
-        <screen>perl leak_finder.pl &lt;ascii-logname&gt;
+      </section>
+    </section>
+    <section remap="h3">
+      <title>28.2.3 Dumping the Buffer to a File (<literal>debug_daemon</literal>)</title>
+      <para>The <literal>debug_daemon</literal> option is used by <literal>lctl</literal> to control the dumping of the <literal>debug_kernel</literal> buffer to a user-specified file. This functionality uses a kernel thread on top of <literal>debug_kernel</literal>, which works in parallel with the <literal>debug_daemon</literal> command.</para>
+      <para>The <literal>debug_daemon</literal> is highly dependent on file system write speed. File system write operations may not be fast enough to flush out all of the <literal>debug_buffer</literal> if the Lustre file system is under heavy system load and continues to <literal>CDEBUG</literal> to the <literal>debug_buffer</literal>. The <literal>debug_daemon</literal> will write the message <literal>DEBUG MARKER:</literal> Trace buffer full into the <literal>debug_buffer</literal> to indicate the <literal>debug_buffer</literal> contents are overlapping before the <literal>debug_daemon</literal> flushes data to a file.</para>
+      <para>Users can use <literal>lctl control</literal> to start or stop the Lustre daemon from dumping the <literal>debug_buffe</literal>r to a file. Users can also temporarily hold daemon from dumping the file. Use of the <literal>debug_daemon</literal> sub-command to <literal>lctl</literal> can provide the same function.</para>
+      <section remap="h4">
+        <title>28.2.3.1 <literal>lctl debug_daemon</literal> Commands</title>
+        <para>This section describes <literal>lctl debug_daemon</literal> commands.</para>
+        <para>To initiate the <literal>debug_daemon</literal> to start dumping <literal>debug_buffer</literal> into a file., enter</para>
+        <screen>$ lctl debug_daemon start [{file} {megabytes}]</screen>
+        <para>The file can be a system default file, as shown in <literal>/proc/sys/lnet/debug_path</literal>. After Lustre starts, the default path is <literal>/tmp/lustre-log-$HOSTNAME</literal>. Users can specify a new filename for <literal>debug_daemon</literal> to output <literal>debug_buffer</literal>. The new file name shows up in <literal>/proc/sys/lnet/debug_path</literal>. Megabytes is the limitation of the file size in MBs.</para>
+        <para>The daemon wraps around and dumps data to the beginning of the file when the output file size is over the limit of the user-specified file size. To decode the dumped file to ASCII and order the log entries by time, run:</para>
+        <screen>lctl debug_file {file} &gt; {newfile}</screen>
+        <para>The output is internally sorted by the <literal>lct</literal>l command using quicksort.</para>
+        <para>To completely shut down the <literal>debug_daemon</literal> operation and flush the file output, enter:</para>
+        <screen>debug_daemon stop</screen>
+        <para>Otherwise, <literal>debug_daemon</literal> is shut down as part of the Lustre file system shutdown process. Users can restart <literal>debug_daemon</literal> by using start command after each stop command issued.</para>
+        <para>This is an example using <literal>debug_daemon</literal> with the interactive mode of <literal>lctl</literal> to dump debug logs to a 10 MB file.</para>
+        <screen>#~/utils/lctl</screen>
+        <para>To start the daemon to dump debug_buffer into a 40 MB <literal>/tmp/dump</literal> file, enter:</para>
+        <screen>lctl &gt; debug_daemon start /trace/log 40 </screen>
+        <para>To completely shut down the daemon, enter:</para>
+        <screen>lctl &gt; debug_daemon stop </screen>
+        <para>To start another daemon with an unlimited file size, enter:</para>
+        <screen>lctl &gt; debug_daemon start /tmp/unlimited </screen>
+        <para>The text message <literal>*** End of debug_daemon trace log ***</literal> appears at the end of each output file.</para>
+      </section>
+    </section>
+    <section remap="h3">
+      <title>28.2.4 Controlling Information Written to the Kernel Debug Log</title>
+      <para>Masks are provided in <literal>/proc/sys/lnet/subsystem_debug</literal> and <literal>/proc/sys/lnet/debug</literal> to be used with the systctl command to determine what information is to be written to the debug log. The subsystem_debug mask determines the information written to the log based on the subsystem (such as iobdfilter, net, portals, or OSC). The debug mask controls information based on debug type (such as info, error, trace, or alloc).</para>
+      <para>To turn off Lustre debugging completely:</para>
+      <screen>sysctl -w lnet.debug=0 </screen>
+      <para>To turn on full Lustre debugging:</para>
+      <screen>sysctl -w lnet.debug=-1 </screen>
+      <para>To turn on logging of messages related to network communications:</para>
+      <screen>sysctl -w lnet.debug=net </screen>
+      <para>To turn on logging of messages related to network communications and existing debug flags:</para>
+      <screen>sysctl -w lnet.debug=+net </screen>
+      <para>To turn off network logging with changing existing flags:</para>
+      <screen>sysctl -w lnet.debug=-net </screen>
+      <para>The various options available to print to kernel debug logs are listed in <literal>lnet/include/libcfs/libcfs.h</literal></para>
+    </section>
+    <section remap="h3">
+      <title>28.2.5 Troubleshooting with <literal>strace</literal></title>
+      <para>The <literal>strace</literal> utility provided with the Linux distribution enables system calls to be traced by intercepting all the system calls made by a process and recording the system call name, arguments, and return values.</para>
+      <para>To invoke <literal>strace</literal> on a program, enter:</para>
+      <screen>$ strace <emphasis>&lt;program&gt; &lt;args&gt;</emphasis> </screen>
+      <para>Sometimes, a system call may fork child processes. In this situation, use the <literal>-f</literal> option of <literal>strace</literal> to trace the child processes:</para>
+      <screen>$ strace -f <emphasis>&lt;program&gt; &lt;args&gt;</emphasis> </screen>
+      <para>To redirect the <literal>strace</literal> output to a file, enter:</para>
+      <screen>$ strace -o <emphasis>&lt;filename&gt; &lt;program&gt; &lt;args&gt;</emphasis> </screen>
+      <para>Use the <literal>-ff</literal> option, along with <literal>-o</literal>, to save the trace output in <literal>filename.pid</literal>, where <literal>pid</literal> is the process ID of the process being traced. Use the <literal>-ttt</literal> option to timestamp all lines in the strace output, so they can be correlated to operations in the lustre kernel debug log.</para>
+      <para>If the debugging is done in UML, save the traces on the host machine. In this example, <literal>hostfs</literal> is mounted on <literal>/r</literal>:</para>
+      <screen>$ strace -o /r/tmp/vi.strace </screen>
+    </section>
+    <section remap="h3">
+      <title>28.2.6 Looking at Disk Content</title>
+      <para>In Lustre, the inodes on the metadata server contain extended attributes (EAs) that store information about file striping. EAs contain a list of all object IDs and their locations (that is, the OST that stores them). The <literal>lfs</literal> tool can be used to obtain this information for a given file using the <literal>getstripe</literal> subcommand. Use a corresponding <literal>lfs setstripe</literal> command to specify striping attributes for a new file or directory.</para>
+      <para>The <literal>lfs getstripe</literal> utility is written in C; it takes a Lustre filename as input and lists all the objects that form a part of this file. To obtain this information for the file <literal>/mnt/lustre/frog</literal> in Lustre file system, run:</para>
+      <screen>$ lfs getstripe /mnt/lustre/frog
+$
+   obdix                           objid
+   0                               17
+   1                               4
  </screen>
-        </listitem></orderedlist>
-        <para>The output is:</para>
-        <screen>malloced 8bytes at a3116744 (called pathcopy) 
+      <para>The <literal>debugfs</literal> tool is provided in the <literal>e2fsprogs</literal> package. It can be used for interactive debugging of an <literal>ldiskfs</literal> file system. The <literal>debugfs</literal> tool can either be used to check status or modify information in the file system. In Lustre, all objects that belong to a file are stored in an underlying <literal>ldiskfs</literal> file system on the OSTs. The file system uses the object IDs as the file names. Once the object IDs are known, use the <literal>debugfs</literal> tool to obtain the attributes of all objects from different OSTs.</para>
+      <para>A sample run for the <literal>/mnt/lustre/frog</literal> file used in the above example is shown here:</para>
+      <screen>     $ debugfs -c /tmp/ost1
+   debugfs: cd O
+   debugfs: cd 0                                   /* for files in group 0 */
+   debugfs: cd d&lt;objid % 32&gt;
+   debugfs: stat &lt;objid&gt;                           /* for getattr on object */
+   debugfs: quit
+## Suppose object id is 36, then follow the steps below:
+   $ debugfs /tmp/ost1
+   debugfs: cd O
+   debugfs: cd 0
+   debugfs: cd d4                                  /* objid % 32 */
+   debugfs: stat 36                                /* for getattr on obj 4*/
+   debugfs: dump 36 /tmp/obj.36                    /* dump contents of obj 4 */
+   debugfs: quit</screen>
+    </section>
+    <section remap="h3">
+      <title>28.2.7 Finding the Lustre UUID of an OST</title>
+      <para>To determine the Lustre UUID of an obdfilter disk (for example, if you mix up the cables on your OST devices or the SCSI bus numbering suddenly changes and the SCSI devices get new names), use <literal>debugfs</literal> to get the <literal>last_rcvd</literal> file.</para>
+    </section>
+    <section remap="h3">
+      <title>28.2.8 Printing Debug Messages to the Console</title>
+      <para>To dump debug messages to the console (<literal>/var/log/messages</literal>), set the corresponding debug mask in the <literal>printk</literal> flag:</para>
+      <screen>sysctl -w lnet.printk=-1 </screen>
+      <para>This slows down the system dramatically. It is also possible to selectively enable or disable this capability for particular flags using:</para>
+      <screen>sysctl -w lnet.printk=+vfstrace 
+sysctl -w lnet.printk=-vfstrace </screen>
+      <para>It is possible to disable warning, error, and console messages, though it is strongly recommended to have something like <literal>lctl debug_daemon</literal> running to capture this data to a local file system for debugging purposes.</para>
+    </section>
+    <section remap="h3">
+      <title>28.2.9 Tracing Lock Traffic</title>
+      <para>Lustre has a specific debug type category for tracing lock traffic. Use:</para>
+      <screen>lctl&gt; filter all_types 
+lctl&gt; show dlmtrace 
+lctl&gt; debug_kernel [filename] </screen>
+    </section>
+  </section>
+  <section xml:id="dbdoclet.50438274_80443">
+    <title>28.3 Lustre Debugging for Developers</title>
+    <para>The procedures in this section may be useful to developers debugging Lustre code.</para>
+    <section remap="h3">
+      <title>28.3.1 Adding Debugging to the Lustre Source Code</title>
+      <para>The debugging infrastructure provides a number of macros that can be used in Lustre source code to aid in debugging or reporting serious errors.</para>
+      <para>To use these macros, you will need to set the <literal>DEBUG_SUBSYSTEM</literal> variable at the top of the file as shown below:</para>
+      <screen>#define DEBUG_SUBSYSTEM S_PORTALS</screen>
+      <para>A list of available macros with descriptions is provided in the table below.</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Macro</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>LBUG</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>A panic-style assertion in the kernel which causes Lustre to dump its circular log to the <literal>/tmp/lustre-log</literal> file. This file can be retrieved after a reboot. LBUG freezes the thread to allow capture of the panic stack. A system reboot is needed to clear the thread.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>LASSERT</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Validates a given expression as true, otherwise calls LBUG. The failed expression is printed on the console, although the values that make up the expression are not printed.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>LASSERTF</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Similar to LASSERT but allows a free-format message to be printed, like <literal>printf/printk</literal>.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>CDEBUG</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>The basic, most commonly used debug macro that takes just one more argument than standard <literal>printf</literal> - the debug type. This message adds to the debug log with the debug mask set accordingly. Later, when a user retrieves the log for troubleshooting, they can filter based on this type.</para>
+                <para><literal>CDEBUG(D_INFO, &quot;This is my debug message: the number is %d\n&quot;, number)</literal>.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>CERROR</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para> Behaves similarly to <literal>CDEBUG</literal>, but unconditionally prints the message in the debug log and to the console. This is appropriate for serious errors or fatal conditions:</para>
+                <para><literal>CERROR(&quot;Something very bad has happened, and the return code is %d.\n&quot;, rc);</literal></para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold"><literal>ENTRY</literal> and <literal>EXIT</literal></emphasis></para>
+              </entry>
+              <entry>
+                <para> Add messages to aid in call tracing (takes no arguments). When using these macros, cover all exit conditions to avoid confusion when the debug log reports that a function was entered, but never exited.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold"><literal>LDLM_DEBUG</literal> and <literal>LDLM_DEBUG_NOLOCK</literal></emphasis></para>
+              </entry>
+              <entry>
+                <para>Used when tracing MDS and VFS operations for locking. These macros build a thin trace that shows the protocol exchanges between nodes.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">DEBUG_REQ</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Prints information about the given <literal>ptlrpc_request</literal> structure.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">OBD_FAIL_CHECK</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Allows insertion of failure points into the Lustre code. This is useful to generate regression tests that can hit a very specific sequence of events. This works in conjunction with &quot;<literal>sysctl -w lustre.fail_loc={fail_loc}</literal>&quot; to set a specific failure point for which a given <literal>OBD_FAIL_CHECK</literal> will test.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">OBD_FAIL_TIMEOUT</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Similar to <literal>OBD_FAIL_CHECK</literal>. Useful to simulate hung, blocked or busy processes or network devices. If the given <literal>fail_loc</literal> is hit, <literal>OBD_FAIL_TIMEOUT</literal> waits for the specified number of seconds.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">OBD_RACE</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Similar to <literal>OBD_FAIL_CHECK</literal>. Useful to have multiple processes execute the same code concurrently to provoke locking races. The first process to hit <literal>OBD_RACE</literal> sleeps until a second process hits <literal>OBD_RACE</literal>, then both processes continue.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">OBD_FAIL_ONCE</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>A flag set on a <literal>lustre.fail_loc</literal> breakpoint to cause the <literal>OBD_FAIL_CHECK</literal> condition to be hit only one time. Otherwise, a <literal>fail_loc</literal> is permanent until it is cleared with &quot;<literal>sysctl -w lustre.fail_loc=0</literal>&quot;.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para><literal>
+                     <emphasis role="bold">OBD_FAIL_RAND</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Has <literal>OBD_FAIL_CHECK</literal> fail randomly; on average every (1 / lustre.fail_val) times.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>OBD_FAIL_SKIP</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Has <literal>OBD_FAIL_CHECK</literal> succeed <literal>lustre.fail_val</literal> times, and then fail permanently or once with <literal>OBD_FAIL_ONCE</literal>.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">OBD_FAIL_SOME</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Has <literal>OBD_FAIL_CHECK</literal> fail <literal>lustre.fail_val</literal> times, and then succeed.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+    </section>
+    <section remap="h3">
+      <title>28.3.2 Accessing a <literal>ptlrpc</literal> Request History</title>
+      <para>Each service maintains a request history, which can be useful for first occurrence troubleshooting.</para>
+      <para><literal>ptlrpc</literal> is an RPC protocol layered on LNET that deals with stateful servers and has semantics and built-in support for recovery.</para>
+      <para>A <literal>prlrpc</literal> request history works as follows:</para>
+      <orderedlist>
+        <listitem>
+          <para><literal>request_in_callback()</literal> adds the new request to the service&apos;s request history.</para>
+        </listitem>
+        <listitem>
+          <para>When a request buffer becomes idle, it is added to the service&apos;s request buffer history list.</para>
+        </listitem>
+        <listitem>
+          <para>Buffers are culled from the service&apos;s request buffer history if it has grown above</para>
+          <para><literal>req_buffer_history_max</literal> and its reqs are removed from the service&apos;s request history.</para>
+        </listitem>
+      </orderedlist>
+      <para>Request history is accessed and controlled using the following /proc files under the service directory:</para>
+      <itemizedlist>
+        <listitem>
+          <literal>
+            <para>req_buffer_history_len</para>
+          </literal>
+          <para>Number of request buffers currently in the history</para>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <literal>
+            <para>req_buffer_history_max</para>
+          </literal>
+          <para>Maximum number of request buffers to keep</para>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <literal>
+            <para>req_history</para>
+          </literal>
+          <para>The request history</para>
+        </listitem>
+      </itemizedlist>
+      <para>Requests in the history include &quot;live&quot; requests that are currently being handled. Each line in <literal>req_history</literal> looks like:</para>
+      <screen>&lt;seq&gt;:&lt;target NID&gt;:&lt;client ID&gt;:&lt;xid&gt;:&lt;length&gt;:&lt;phase&gt; &lt;svc specific&gt; </screen>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Parameter</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>seq</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para> Request sequence number</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">target NID</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para> Destination <literal>NID</literal> of the incoming request</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">client ID</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para> Client <literal>PID</literal> and <literal>NID</literal></para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>xid</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para> <literal>rq_xid</literal></para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>length</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para> Size of the request message</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>phase</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para><itemizedlist>
+                    <listitem>
+                      <para>New (waiting to be handled or could not be unpacked)</para>
+                    </listitem>
+                    <listitem>
+                      <para>Interpret (unpacked or being handled)</para>
+                    </listitem>
+                    <listitem>
+                      <para>Complete (handled)</para>
+                    </listitem>
+                  </itemizedlist></para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>svc specific</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Service-specific request printout. Currently, the only service that does this is the OST (which prints the opcode if the message has been unpacked successfully</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+    </section>
+    <section remap="h3">
+      <title>28.3.3 Finding Memory Leaks Using <literal>leak_finder.pl</literal></title>
+      <para>Memory leaks can occur in code when memory has been allocated and then not freed once it is no longer required. The <literal>leak_finder.pl</literal> program provides a way to find memory leaks.</para>
+      <para>Before running this program, you must turn on debugging to collect all <literal>malloc</literal> and free entries. Run:</para>
+      <screen>sysctl -w lnet.debug=+malloc </screen>
+      <para>Then complete the following steps:</para>
+      <orderedlist>
+        <listitem>
+          <para>Dump the log into a user-specified log file using lctl (see <xref linkend="dbdoclet.50438274_62472">Using the <literal>lctl</literal> Tool to View Debug Messages</xref>).</para>
+        </listitem>
+        <listitem>
+          <para>Run the leak finder on the newly-created log dump:</para>
+          <screen>perl leak_finder.pl &lt;ascii-logname&gt;</screen>
+        </listitem>
+      </orderedlist>
+      <para>The output is:</para>
+      <screen>malloced 8bytes at a3116744 (called pathcopy) 
  (lprocfs_status.c:lprocfs_add_vars:80) 
  freed 8bytes at a3116744 (called pathcopy) 
  (lprocfs_status.c:lprocfs_add_vars:80) 
  </screen>
-        <para>The tool displays the following output to show the leaks found:</para>
-        <screen>Leak:32bytes allocated at a23a8fc(service.c:ptlrpc_init_svc:144,debug file \
-line 241)
-</screen>
-      </section>
+      <para>The tool displays the following output to show the leaks found:</para>
+      <screen>Leak:32bytes allocated at a23a8fc(service.c:ptlrpc_init_svc:144,debug file line 241)</screen>
+    </section>
    </section>
  </chapter>
diff --git a/LustreProc.xml b/LustreProc.xml

index cafc116..6ec2582 100644 (file)
--- a/LustreProc.xml
+++ b/LustreProc.xml
@@ -1,52 +1,47 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" xml:id='lustreproc'>
+<?xml version='1.0' encoding='UTF-8'?>
+<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustreproc">
    <info>
-    <title xml:id='lustreproc.title'>LustreProc</title>
+    <title xml:id="lustreproc.title">LustreProc</title>
    </info>
-  <para>The /proc file system acts as an interface to internal data structures in the kernel. The /proc variables can be used to control aspects of Lustre performance and provide information.</para>
+  <para>The <literal>/proc</literal> file system acts as an interface to internal data structures in the kernel. The <literal>/proc</literal> variables can be used to control aspects of Lustre performance and provide information.</para>
    <para>This chapter describes Lustre /proc entries and includes the following sections:</para>
-  <itemizedlist><listitem>
+  <itemizedlist>
+    <listitem>
        <para><xref linkend="dbdoclet.50438271_90999"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438271_78950"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438271_83523"/></para>
      </listitem>
-
-</itemizedlist>
-
-    <section xml:id="dbdoclet.50438271_90999">
-      <title>31.1 Proc Entries for Lustre</title>
-      <para>This section describes /proc entries for Lustre.</para>
-      <section remap="h3">
-        <title>31.1.1 Locating Lustre <anchor xml:id="dbdoclet.50438271_marker-1296151" xreflabel=""/>File Systems and Servers</title>
-        <para>Use the proc files on the MGS to locate the following:</para>
-        <itemizedlist><listitem>
-            <para> All known file systems</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># cat /proc/fs/lustre/mgs/MGS/filesystems
+  </itemizedlist>
+  <section xml:id="dbdoclet.50438271_90999">
+    <title>31.1 Proc Entries for Lustre</title>
+    <para>This section describes <literal>/proc</literal> entries for Lustre.</para>
+    <section remap="h3">
+      <title>31.1.1 Locating Lustre File Systems and Servers</title>
+      <para>Use the proc files on the MGS to locate the following:</para>
+      <itemizedlist>
+        <listitem>
+          <para> All known file systems</para>
+          <screen># cat /proc/fs/lustre/mgs/MGS/filesystems
  spfs
-lustre
-</screen>
-        <itemizedlist><listitem>
-            <para> The server names participating in a file system (for each file system that has at least one server running)</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># cat /proc/fs/lustre/mgs/MGS/live/spfs
+lustre</screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para> The server names participating in a file system (for each file system that has at least one server running)</para>
+          <screen># cat /proc/fs/lustre/mgs/MGS/live/spfs
  fsname: spfs
  flags: 0x0         gen: 7
  spfs-MDT0000
-spfs-OST0000
-</screen>
-        <para>All servers are named according to this convention: &lt;fsname&gt;-&lt;MDT|OST&gt;&lt;XXXX&gt; This can be shown for live servers under /proc/fs/lustre/devices:</para>
-        <screen># cat /proc/fs/lustre/devices 
+spfs-OST0000</screen>
+        </listitem>
+      </itemizedlist>
+      <para>All servers are named according to this convention: <literal>&lt;fsname&gt;-&lt;MDT|OST&gt;&lt;XXXX&gt;</literal>. This can be shown for live servers under <literal>/proc/fs/lustre/devices</literal>:</para>
+      <screen># cat /proc/fs/lustre/devices 
  0 UP mgs MGS MGS 11
  1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
  2 UP mdt MDS MDS_uuid 3
@@ -57,516 +52,705 @@ spfs-OST0000
  7 UP lov lustre-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04
  8 UP mdc lustre-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
  9 UP osc lustre-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
-10 UP osc lustre-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
-</screen>
-        <para>Or from the device label at any time:</para>
-        <screen># e2label /dev/sda
-lustre-MDT0000
-</screen>
-      </section>
-      <section remap="h3">
-        <title>31.1.2 Lustre <anchor xml:id="dbdoclet.50438271_marker-1296153" xreflabel=""/>Timeouts</title>
-        <para>Lustre uses two types of timeouts.</para>
-        <itemizedlist><listitem>
-            <para>LND timeouts that ensure point-to-point communications complete in finite time in the presence of failures. These timeouts are logged with the S_LND flag set. They may <emphasis>not</emphasis> be printed as console messages, so you should check the Lustre log for D_NETERROR messages, or enable printing of D_NETERROR messages to the console (echo + neterror &gt; /proc/sys/lnet/printk).</para>
-          </listitem>
-
-</itemizedlist>
-        <para>Congested routers can be a source of spurious LND timeouts. To avoid this, increase the number of LNET router buffers to reduce back-pressure and/or increase LND timeouts on all nodes on all connected networks. You should also consider increasing the total number of LNET router nodes in the system so that the aggregate router bandwidth matches the aggregate server bandwidth.</para>
-        <itemizedlist><listitem>
-            <para>Lustre timeouts that ensure Lustre RPCs complete in finite time in the presence of failures. These timeouts should <emphasis>always</emphasis> be printed as console messages. If Lustre timeouts are not accompanied by LNET timeouts, then you need to increase the lustre timeout on both servers and clients.</para>
-          </listitem>
-
-</itemizedlist>
-        <para>Specific Lustre timeouts are described below.</para>
-        <para><emphasis role="bold">/proc/sys/lustre/timeout</emphasis></para>
-        <para>This is the time period that a client waits for a server to complete an RPC (default is 100s). Servers wait half of this time for a normal client RPC to complete and a quarter of this time for a single bulk request (read or write of up to 1 MB) to complete. The client pings recoverable targets (MDS and OSTs) at one quarter of the timeout, and the server waits one and a half times the timeout before evicting a client for being &quot;stale.&quot;</para>
-                <note><para>Lustre sends periodic 'PING' messages to servers with which it had no communication for a specified period of time. Any network activity on the file system that triggers network traffic toward servers also works as a health check.</para></note>
-        <para><emphasis role="bold">/proc/sys/lustre/ldlm_timeout</emphasis></para>
-        <para>This is the time period for which a server will wait for a client to reply to an initial AST (lock cancellation request) where default is 20s for an OST and 6s for an MDS. If the client replies to the AST, the server will give it a normal timeout (half of the client timeout) to flush any dirty data and release the lock.</para>
-        <para><emphasis role="bold">/proc/sys/lustre/fail_loc</emphasis></para>
-        <para>This is the internal debugging failure hook.</para>
-        <para>See lustre/include/linux/obd_support.h for the definitions of individual failure locations. The default value is 0 (zero).</para>
-        <screen>sysctl -w lustre.fail_loc=0x80000122 # drop a single reply
-</screen>
-        <para><emphasis role="bold">/proc/sys/lustre/dump_on_timeout</emphasis></para>
-        <para>This triggers dumps of the Lustre debug log when timeouts occur. The default value is 0 (zero).</para>
-        <para><emphasis role="bold">/proc/sys/lustre/dump_on_eviction</emphasis></para>
-        <para>This triggers dumps of the Lustre debug log when an eviction occurs. The default value is 0 (zero). By default, debug logs are dumped to the /tmp folder; this location can be changed via /proc.</para>
-      </section>
-      <section remap="h3">
-        <title>31.1.3 Adaptive <anchor xml:id="dbdoclet.50438271_marker-1293380" xreflabel=""/>Timeouts</title>
-        <para>Lustre offers an adaptive mechanism to set RPC timeouts. The adaptive timeouts feature (enabled, by default) causes servers to track actual RPC completion times, and to report estimated completion times for future RPCs back to clients. The clients use these estimates to set their future RPC timeout values. If server request processing slows down for any reason, the RPC completion estimates increase, and the clients allow more time for RPC completion.</para>
-        <para>If RPCs queued on the server approach their timeouts, then the server sends an early reply to the client, telling the client to allow more time. In this manner, clients avoid RPC timeouts and disconnect/reconnect cycles. Conversely, as a server speeds up, RPC timeout values decrease, allowing faster detection of non-responsive servers and faster attempts to reconnect to a server&apos;s failover partner.</para>
-        <para>In previous Lustre versions, the static obd_timeout (/proc/sys/lustre/timeout) value was used as the maximum completion time for all RPCs; this value also affected the client-server ping interval and initial recovery timer. Now, with adaptive timeouts, obd_timeout is only used for the ping interval and initial recovery estimate. When a client reconnects during recovery, the server uses the client&apos;s timeout value to reset the recovery wait period; i.e., the server learns how long the client had been willing to wait, and takes this into account when adjusting the recovery period.</para>
-        <section remap="h4">
-          <title>31.1.3.1 Configuring <anchor xml:id="dbdoclet.50438271_marker-1293381" xreflabel=""/>Adaptive Timeouts</title>
-          <para>One of the goals of adaptive timeouts is to relieve users from having to tune the obd_timeout value. In general, obd_timeout should no longer need to be changed. However, there are several parameters related to adaptive timeouts that users can set. In most situations, the default values should be used.</para>
-          <para>The following parameters can be set persistently system-wide using lctl conf_param on the MGS. For example, lctl conf_param work1.sys.at_max=1500 sets the at_max value for all servers and clients using the work1 file system.</para>
-                  <note><para>Nodes using multiple Lustre file systems must use the same at_* values for all file systems.)</para></note>
-           <informaltable frame="all">
-            <tgroup cols="2">
-              <colspec colname="c1" colwidth="50*"/>
-              <colspec colname="c2" colwidth="50*"/>
-              <thead>
-                <row>
-                  <entry><para><emphasis role="bold">Parameter</emphasis></para></entry>
-                  <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-                </row>
-              </thead>
-              <tbody>
-                <row>
-                  <entry><para> <emphasis role="bold">at_min</emphasis></para></entry>
-                  <entry><para> Sets the minimum adaptive timeout (in seconds). Default value is 0. The at_min parameter is the minimum processing time that a server will report. Clients base their timeouts on this value, but they do not use this value directly. If you experience cases in which, for unknown reasons, the adaptive timeout value is too short and clients time out their RPCs (usually due to temporary network outages), then you can increase the at_min value to compensate for this. Ideally, users should leave at_min set to its default.</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">at_max</emphasis></para></entry>
-                  <entry><para> Sets the maximum adaptive timeout (in seconds). The at_max parameter is an upper-limit on the service time estimate, and is used as a &apos;failsafe&apos; in case of rogue/bad/buggy code that would lead to never-ending estimate increases. If at_max is reached, an RPC request is considered &apos;broken&apos; and should time out.</para><para>Setting at_max to 0 causes adaptive timeouts to be disabled and the old fixed-timeout method (obd_timeout) to be used. This is the default value in Lustre 1.6.5.</para><para> </para>
-                      
-                      <note><para>It is possible that slow hardware might validly cause the service estimate to increase beyond the default value of at_max. In this case, you should increase at_max to the maximum time you are willing to wait for an RPC completion.</para></note></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">at_history</emphasis></para></entry>
-                  <entry><para> Sets a time period (in seconds) within which adaptive timeouts remember the slowest event that occurred. Default value is 600.</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">at_early_margin</emphasis></para></entry>
-                  <entry><para> Sets how far before the deadline Lustre sends an early reply. Default value is 5<footnote><para>This default was chosen as a reasonable time in which to send a reply from the point at which it was sent.</para></footnote>.</para></entry>
-          
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">at_extra</emphasis></para></entry>
-                  <entry><para> Sets the incremental amount of time that a server asks for, with each early reply. The server does not know how much time the RPC will take, so it asks for a fixed value. Default value is 30<footnote><para>This default was chosen as a balance between sending too many early replies for the same RPC and overestimating the actual completion time</para></footnote>. When a server finds a queued request about to time out (and needs to send an early reply out), the server adds the at_extra value. If the time expires, the Lustre client enters recovery status and reconnects to restore it to normal status.</para><para>If you see multiple early replies for the same RPC asking for multiple 30-second increases, change the at_extra value to a larger number to cut down on early replies sent and, therefore, network load.</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">ldlm_enqueue_min</emphasis></para></entry>
-                  <entry><para> Sets the minimum lock enqueue time. Default value is 100. The ldlm_enqueue time is the maximum of the measured enqueue estimate (influenced by at_min and at_max parameters), multiplied by a weighting factor, and the ldlm_enqueue_min setting. LDLM lock enqueues were based on the obd_timeout value; now they have a dedicated minimum value. Lock enqueues increase as the measured enqueue times increase (similar to adaptive timeouts).</para></entry>
-                </row>
-              </tbody>
-            </tgroup>
-          </informaltable>
-          <para>Adaptive timeouts are enabled, by default. To disable adaptive timeouts, at run time, set at_max to 0. On the MGS, run:</para>
-          <screen>$ lctl conf_param &lt;fsname&gt;.sys.at_max=0
-</screen>
-                  <note><para>Changing adaptive timeouts status at runtime may cause transient timeout, reconnect, recovery, etc.</para></note>
-        </section>
-        <section remap="h4">
-          <title>31.1.3.2 Interpreting <anchor xml:id="dbdoclet.50438271_marker-1293383" xreflabel=""/>Adaptive Timeouts Information</title>
-          <para>Adaptive timeouts information can be read from /proc/fs/lustre/*/timeouts files (for each service and client) or with the lctl command.</para>
-          <para>This is an example from the /proc/fs/lustre/*/timeouts files:</para>
-          <screen>cfs21:~# cat /proc/fs/lustre/ost/OSS/ost_io/timeouts
-</screen>
-          <para>This is an example using the lctl command:</para>
-          <screen>$ lctl get_param -n ost.*.ost_io.timeouts
-</screen>
-          <para>This is the sample output:</para>
-          <screen>service : cur 33  worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2
-</screen>
-          <para>The ost_io service on this node is currently reporting an estimate of 33 seconds. The worst RPC service time was 34 seconds, and it happened 26 minutes ago.</para>
-          <para>The output also provides a history of service times. In the example, there are 4 &quot;bins&quot; of adaptive_timeout_history, with the maximum RPC time in each bin reported. In 0-150 seconds, the maximum RPC time was 1, with the same result in 150-300 seconds. From 300-450 seconds, the worst (maximum) RPC time was 33 seconds, and from 450-600s the worst time was 2 seconds. The current estimated service time is the maximum value of the 4 bins (33 seconds in this example).</para>
-          <para>Service times (as reported by the servers) are also tracked in the client OBDs:</para>
-          <screen>cfs21:# lctl get_param osc.*.timeouts
-last reply : 1193428639, 0d0h00m00s ago
-network    : cur   1  worst   2 (at 1193427053, 0d0h26m26s ago)   1   1   1\
-   1
-portal 6   : cur  33  worst  34 (at 1193427052, 0d0h26m27s ago)  33  33  33\
-   2
-portal 28  : cur   1  worst   1 (at 1193426141, 0d0h41m38s ago)   1   1   1\
-   1
-portal 7   : cur   1  worst   1 (at 1193426141, 0d0h41m38s ago)   1   0   1\
-   1
-portal 17  : cur   1  worst   1 (at 1193426177, 0d0h41m02s ago)   1   0   0\
-   1
-</screen>
-          <para>In this case, RPCs to portal 6, the OST_IO_PORTAL (see lustre/include/lustre/lustre_idl.h), shows the history of what the ost_io portal has reported as the service estimate.</para>
-          <para>Server statistic files also show the range of estimates in the normal min/max/sum/sumsq manner.</para>
-          <screen>cfs21:~# lctl get_param mdt.*.mdt.stats
-...
-req_timeout               6 samples [sec] 1 10 15 105
-...
-</screen>
-        </section>
-      </section>
-      <section remap="h3">
-        <title>31.1.4 LNET <anchor xml:id="dbdoclet.50438271_marker-1296164" xreflabel=""/>Information</title>
-        <para>This section describes /proc entries for LNET information.</para>
-        <para><emphasis role="bold">/proc/sys/lnet/peers</emphasis></para>
-        <para>Shows all NIDs known to this node and also gives information on the queue state.</para>
-        <screen># cat /proc/sys/lnet/peers
-nid                        refs            state           max             \
-rtr             min             tx              min             queue
-0@lo                       1               ~rtr            0               \
-0               0               0               0               0
-192.168.10.35@tcp  1               ~rtr            8               8       \
-        8               8               6               0
-192.168.10.36@tcp  1               ~rtr            8               8       \
-        8               8               6               0
-192.168.10.37@tcp  1               ~rtr            8               8       \
-        8               8               6               0
-</screen>
-        <para>The fields are explained below:</para>
+10 UP osc lustre-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05</screen>
+      <para>Or from the device label at any time:</para>
+      <screen># e2label /dev/sda
+lustre-MDT0000</screen>
+    </section>
+    <section remap="h3">
+      <title>31.1.2 Lustre Timeouts</title>
+      <para>Lustre uses two types of timeouts.</para>
+      <itemizedlist>
+        <listitem>
+          <para>LND timeouts that ensure point-to-point communications complete in finite time in the presence of failures. These timeouts are logged with the <literal>S_LND</literal> flag set. They may <emphasis>not</emphasis> be printed as console messages, so you should check the Lustre log for <literal>D_NETERROR</literal> messages, or enable printing of <literal>D_NETERROR</literal> messages to the console (<literal>echo + neterror &gt; /proc/sys/lnet/printk</literal>).</para>
+        </listitem>
+      </itemizedlist>
+      <para>Congested routers can be a source of spurious LND timeouts. To avoid this, increase the number of LNET router buffers to reduce back-pressure and/or increase LND timeouts on all nodes on all connected networks. You should also consider increasing the total number of LNET router nodes in the system so that the aggregate router bandwidth matches the aggregate server bandwidth.</para>
+      <itemizedlist>
+        <listitem>
+          <para>Lustre timeouts that ensure Lustre RPCs complete in finite time in the presence of failures. These timeouts should <emphasis>always</emphasis> be printed as console messages. If Lustre timeouts are not accompanied by LNET timeouts, then you need to increase the lustre timeout on both servers and clients.</para>
+        </listitem>
+      </itemizedlist>
+      <para>Specific Lustre timeouts are described below.</para>
+      <para><literal>
+          <emphasis role="bold">/proc/sys/lustre/timeout</emphasis>
+        </literal></para>
+      <para>This is the time period that a client waits for a server to complete an RPC (default is 100s). Servers wait half of this time for a normal client RPC to complete and a quarter of this time for a single bulk request (read or write of up to 1 MB) to complete. The client pings recoverable targets (MDS and OSTs) at one quarter of the timeout, and the server waits one and a half times the timeout before evicting a client for being &quot;stale.&quot;</para>
+      <note>
+        <para>Lustre sends periodic &apos;PING&apos; messages to servers with which it had no communication for a specified period of time. Any network activity on the file system that triggers network traffic toward servers also works as a health check.</para>
+      </note>
+      <para><literal>
+          <emphasis role="bold">/proc/sys/lustre/ldlm_timeout</emphasis>
+        </literal></para>
+      <para>This is the time period for which a server will wait for a client to reply to an initial AST (lock cancellation request) where default is 20s for an OST and 6s for an MDS. If the client replies to the AST, the server will give it a normal timeout (half of the client timeout) to flush any dirty data and release the lock.</para>
+      <para><literal>
+          <emphasis role="bold">/proc/sys/lustre/fail_loc</emphasis>
+        </literal></para>
+      <para>This is the internal debugging failure hook.</para>
+      <para>See <literal>lustre/include/linux/obd_support.h</literal> for the definitions of individual failure locations. The default value is 0 (zero).</para>
+      <screen>sysctl -w lustre.fail_loc=0x80000122 # drop a single reply</screen>
+      <para><literal>
+          <emphasis role="bold">/proc/sys/lustre/dump_on_timeout</emphasis>
+        </literal></para>
+      <para>This triggers dumps of the Lustre debug log when timeouts occur. The default value is 0 (zero).</para>
+      <para><literal>
+          <emphasis role="bold">/proc/sys/lustre/dump_on_eviction</emphasis>
+        </literal></para>
+      <para>This triggers dumps of the Lustre debug log when an eviction occurs. The default value is 0 (zero). By default, debug logs are dumped to the /tmp folder; this location can be changed via /proc.</para>
+    </section>
+    <section remap="h3">
+      <title>31.1.3 Adaptive Timeouts</title>
+      <para>Lustre offers an adaptive mechanism to set RPC timeouts. The adaptive timeouts feature (enabled, by default) causes servers to track actual RPC completion times, and to report estimated completion times for future RPCs back to clients. The clients use these estimates to set their future RPC timeout values. If server request processing slows down for any reason, the RPC completion estimates increase, and the clients allow more time for RPC completion.</para>
+      <para>If RPCs queued on the server approach their timeouts, then the server sends an early reply to the client, telling the client to allow more time. In this manner, clients avoid RPC timeouts and disconnect/reconnect cycles. Conversely, as a server speeds up, RPC timeout values decrease, allowing faster detection of non-responsive servers and faster attempts to reconnect to a server&apos;s failover partner.</para>
+      <para>In previous Lustre versions, the static obd_timeout (<literal>/proc/sys/lustre/timeout</literal>) value was used as the maximum completion time for all RPCs; this value also affected the client-server ping interval and initial recovery timer. Now, with adaptive timeouts, obd_timeout is only used for the ping interval and initial recovery estimate. When a client reconnects during recovery, the server uses the client&apos;s timeout value to reset the recovery wait period; i.e., the server learns how long the client had been willing to wait, and takes this into account when adjusting the recovery period.</para>
+      <section remap="h4">
+        <title>31.1.3.1 Configuring Adaptive Timeouts</title>
+        <para>One of the goals of adaptive timeouts is to relieve users from having to tune the <literal>obd_timeout</literal> value. In general, <literal>obd_timeout</literal> should no longer need to be changed. However, there are several parameters related to adaptive timeouts that users can set. In most situations, the default values should be used.</para>
+        <para>The following parameters can be set persistently system-wide using <literal>lctl conf_param</literal> on the MGS. For example, <literal>lctl conf_param work1.sys.at_max=1500</literal> sets the at_max value for all servers and clients using the work1 file system.</para>
+        <note>
+          <para>Nodes using multiple Lustre file systems must use the same <literal>at_*</literal> values for all file systems.)</para>
+        </note>
          <informaltable frame="all">
            <tgroup cols="2">
              <colspec colname="c1" colwidth="50*"/>
              <colspec colname="c2" colwidth="50*"/>
              <thead>
                <row>
-                <entry><para><emphasis role="bold">Field</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
+                <entry>
+                  <para><emphasis role="bold">Parameter</emphasis></para>
+                </entry>
+                <entry>
+                  <para><emphasis role="bold">Description</emphasis></para>
+                </entry>
                </row>
              </thead>
              <tbody>
                <row>
-                <entry><para> <emphasis role="bold">refs</emphasis></para></entry>
-                <entry><para> A reference count (principally used for debugging)</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">state</emphasis></para></entry>
-                <entry><para> Only valid to refer to routers. Possible values:</para><itemizedlist><listitem>
-                      <para> ~ rtr (indicates this node is not a router)</para>
-                    </listitem>
-<listitem>
-                      <para> up/down (indicates this node is a router)</para>
-                    </listitem>
-<listitem>
-                      <para> auto_fail must be enabled</para>
-                    </listitem>
-</itemizedlist></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">max</emphasis></para></entry>
-                <entry><para> Maximum number of concurrent sends from this peer</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">rtr</emphasis></para></entry>
-                <entry><para> Routing buffer credits.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">min</emphasis></para></entry>
-                <entry><para> Minimum routing buffer credits seen.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">tx</emphasis></para></entry>
-                <entry><para> Send credits.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">min</emphasis></para></entry>
-                <entry><para> Minimum send credits seen.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">queue</emphasis></para></entry>
-                <entry><para> Total bytes in active/queued sends.</para></entry>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">at_min</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Sets the minimum adaptive timeout (in seconds). Default value is 0. The at_min parameter is the minimum processing time that a server will report. Clients base their timeouts on this value, but they do not use this value directly. If you experience cases in which, for unknown reasons, the adaptive timeout value is too short and clients time out their RPCs (usually due to temporary network outages), then you can increase the at_min value to compensate for this. Ideally, users should leave at_min set to its default.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">at_max</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Sets the maximum adaptive timeout (in seconds). The <literal>at_max</literal> parameter is an upper-limit on the service time estimate, and is used as a &apos;failsafe&apos; in case of rogue/bad/buggy code that would lead to never-ending estimate increases. If at_max is reached, an RPC request is considered &apos;broken&apos; and should time out.</para>
+                  <para>Setting at_max to 0 causes adaptive timeouts to be disabled and the old fixed-timeout method (<literal>obd_timeout</literal>) to be used. This is the default value in Lustre 1.6.5.</para>
+                  <note>
+                    <para>It is possible that slow hardware might validly cause the service estimate to increase beyond the default value of at_max. In this case, you should increase at_max to the maximum time you are willing to wait for an RPC completion.</para>
+                  </note>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">at_history</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Sets a time period (in seconds) within which adaptive timeouts remember the slowest event that occurred. Default value is 600.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">at_early_margin</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Sets how far before the deadline Lustre sends an early reply. Default value is 5<footnote>
+                      <para>This default was chosen as a reasonable time in which to send a reply from the point at which it was sent.</para>
+                    </footnote>.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">at_extra</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Sets the incremental amount of time that a server asks for, with each early reply. The server does not know how much time the RPC will take, so it asks for a fixed value. Default value is 30<footnote>
+                      <para>This default was chosen as a balance between sending too many early replies for the same RPC and overestimating the actual completion time</para>
+                    </footnote>. When a server finds a queued request about to time out (and needs to send an early reply out), the server adds the at_extra value. If the time expires, the Lustre client enters recovery status and reconnects to restore it to normal status.</para>
+                  <para>If you see multiple early replies for the same RPC asking for multiple 30-second increases, change the at_extra value to a larger number to cut down on early replies sent and, therefore, network load.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">ldlm_enqueue_min</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para> Sets the minimum lock enqueue time. Default value is 100. The <literal>ldlm_enqueue</literal> time is the maximum of the measured enqueue estimate (influenced by at_min and at_max parameters), multiplied by a weighting factor, and the <literal>ldlm_enqueue_min</literal> setting. LDLM lock enqueues were based on the <literal>obd_timeout</literal> value; now they have a dedicated minimum value. Lock enqueues increase as the measured enqueue times increase (similar to adaptive timeouts).</para>
+                </entry>
                </row>
              </tbody>
            </tgroup>
          </informaltable>
-        <para>Credits work like a semaphore. At start they are initialized to allow a certain number of operations (8 in this example). LNET keeps a track of the minimum value so that you can see how congested a resource was.</para>
-        <para>If rtr/tx is less than max, there are operations in progress. The number of operations is equal to rtr or tx subtracted from max.</para>
-        <para>If rtr/tx is greater that max, there are operations blocking.</para>
-        <para>LNET also limits concurrent sends and router buffers allocated to a single peer so that no peer can occupy all these resources.</para>
-        <para><emphasis role="bold">/proc/sys/lnet/nis</emphasis></para>
-        <screen># cat /proc/sys/lnet/nis
-nid                                refs            peer            max     \
-        tx              min
-0@lo                               3               0               0       \
-        0               0
-192.168.10.34@tcp          4               8               256             \
-256             252
+        <para>Adaptive timeouts are enabled, by default. To disable adaptive timeouts, at run time, set <literal>at_max</literal> to 0. On the MGS, run:</para>
+        <screen>$ lctl conf_param &lt;fsname&gt;.sys.at_max=0</screen>
+        <note>
+          <para>Changing adaptive timeouts status at runtime may cause transient timeout, reconnect, recovery, etc.</para>
+        </note>
+      </section>
+      <section remap="h4">
+        <title>31.1.3.2 Interpreting Adaptive Timeouts Information</title>
+        <para>Adaptive timeouts information can be read from <literal>/proc/fs/lustre/*/timeouts</literal> files (for each service and client) or with the lctl command.</para>
+        <para>This is an example from the <literal>/proc/fs/lustre/*/timeouts</literal> files:</para>
+        <screen>cfs21:~# cat /proc/fs/lustre/ost/OSS/ost_io/timeouts</screen>
+        <para>This is an example using the <literal>lctl</literal> command:</para>
+        <screen>$ lctl get_param -n ost.*.ost_io.timeouts</screen>
+        <para>This is the sample output:</para>
+        <screen>service : cur 33  worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2</screen>
+        <para>The <literal>ost_io</literal> service on this node is currently reporting an estimate of 33 seconds. The worst RPC service time was 34 seconds, and it happened 26 minutes ago.</para>
+        <para>The output also provides a history of service times. In the example, there are 4 &quot;bins&quot; of <literal>adaptive_timeout_history</literal>, with the maximum RPC time in each bin reported. In 0-150 seconds, the maximum RPC time was 1, with the same result in 150-300 seconds. From 300-450 seconds, the worst (maximum) RPC time was 33 seconds, and from 450-600s the worst time was 2 seconds. The current estimated service time is the maximum value of the 4 bins (33 seconds in this example).</para>
+        <para>Service times (as reported by the servers) are also tracked in the client OBDs:</para>
+        <screen>cfs21:# lctl get_param osc.*.timeouts
+last reply : 1193428639, 0d0h00m00s ago
+network    : cur   1  worst   2 (at 1193427053, 0d0h26m26s ago)   1   1   1   1
+portal 6   : cur  33  worst  34 (at 1193427052, 0d0h26m27s ago)  33  33  33   2
+portal 28  : cur   1  worst   1 (at 1193426141, 0d0h41m38s ago)   1   1   1   1
+portal 7   : cur   1  worst   1 (at 1193426141, 0d0h41m38s ago)   1   0   1   1
+portal 17  : cur   1  worst   1 (at 1193426177, 0d0h41m02s ago)   1   0   0   1
  </screen>
-        <para>Shows the current queue health on this node. The fields are explained below:</para>
-        <informaltable frame="all">
-          <tgroup cols="2">
-            <colspec colname="c1" colwidth="50*"/>
-            <colspec colname="c2" colwidth="50*"/>
-            <thead>
-              <row>
-                <entry><para><emphasis role="bold">Field</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry><para> <emphasis role="bold">nid</emphasis></para></entry>
-                <entry><para> Network interface</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">refs</emphasis></para></entry>
-                <entry><para> Internal reference counter</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">peer</emphasis></para></entry>
-                <entry><para> Number of peer-to-peer send credits on this NID. Credits are used to size buffer pools</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">max</emphasis></para></entry>
-                <entry><para> Total number of send credits on this NID.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">tx</emphasis></para></entry>
-                <entry><para> Current number of send credits available on this NID.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">min</emphasis></para></entry>
-                <entry><para> Lowest number of send credits available on this NID.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">queue</emphasis></para></entry>
-                <entry><para> Total bytes in active/queued sends.</para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
-        <para>Subtracting max - tx yields the number of sends currently active. A large or increasing number of active sends may indicate a problem.</para>
-        <screen># cat /proc/sys/lnet/nis
-nid                                refs            peer            max     \
-        tx              min
-0@lo                               2               0               0       \
-        0               0
-10.67.73.173@tcp           4               8               256             \
-256             253
+        <para>In this case, RPCs to portal 6, the <literal>OST_IO_PORTAL</literal> (see <literal>lustre/include/lustre/lustre_idl.h</literal>), shows the history of what the <literal>ost_io</literal> portal has reported as the service estimate.</para>
+        <para>Server statistic files also show the range of estimates in the normal min/max/sum/sumsq manner.</para>
+        <screen>cfs21:~# lctl get_param mdt.*.mdt.stats
+...
+req_timeout               6 samples [sec] 1 10 15 105
+...
  </screen>
        </section>
-      <section remap="h3">
-        <title>31.1.5 Free Space <anchor xml:id="dbdoclet.50438271_marker-1296165" xreflabel=""/>Distribution</title>
-        <para>Free-space stripe weighting, as set, gives a priority of &quot;0&quot; to free space (versus trying to place the stripes &quot;widely&quot; -- nicely distributed across OSSs and OSTs to maximize network balancing). To adjust this priority (as a percentage), use the qos_prio_free proc tunable:</para>
-        <screen>$ cat /proc/fs/lustre/lov/&lt;fsname&gt;-mdtlov/qos_prio_free
+    </section>
+    <section remap="h3">
+      <title>31.1.4 LNET Information</title>
+      <para>This section describes<literal> /proc</literal> entries for LNET information.</para>
+      <para><literal>
+          <emphasis role="bold">/proc/sys/lnet/peers</emphasis>
+        </literal></para>
+      <para>Shows all NIDs known to this node and also gives information on the queue state.</para>
+      <screen># cat /proc/sys/lnet/peers
+nid                        refs            state           max             rtr             min             tx              min             queue
+0@lo                       1               ~rtr            0               0               0               0               0               0
+192.168.10.35@tcp  1               ~rtr            8               8               8               8               6               0
+192.168.10.36@tcp  1               ~rtr            8               8               8               8               6               0
+192.168.10.37@tcp  1               ~rtr            8               8               8               8               6               0</screen>
+      <para>The fields are explained below:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>refs</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>A reference count (principally used for debugging)</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>state</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Only valid to refer to routers. Possible values:</para>
+                <itemizedlist>
+                  <listitem>
+                    <para>~ rtr (indicates this node is not a router)</para>
+                  </listitem>
+                  <listitem>
+                    <para>up/down (indicates this node is a router)</para>
+                  </listitem>
+                  <listitem>
+                    <para>auto_fail must be enabled</para>
+                  </listitem>
+                </itemizedlist>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">max</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Maximum number of concurrent sends from this peer</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">rtr</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Routing buffer credits.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>min</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Minimum routing buffer credits seen.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">tx</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Send credits.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>min</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Minimum send credits seen.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>queue</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Total bytes in active/queued sends.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <para>Credits work like a semaphore. At start they are initialized to allow a certain number of operations (8 in this example). LNET keeps a track of the minimum value so that you can see how congested a resource was.</para>
+      <para>If <literal>rtr/tx</literal> is less than max, there are operations in progress. The number of operations is equal to <literal>rtr</literal> or <literal>tx</literal> subtracted from max.</para>
+      <para>If <literal>rtr/tx</literal> is greater that max, there are operations blocking.</para>
+      <para>LNET also limits concurrent sends and router buffers allocated to a single peer so that no peer can occupy all these resources.</para>
+      <para><literal>
+          <emphasis role="bold">/proc/sys/lnet/nis</emphasis>
+        </literal></para>
+      <screen># cat /proc/sys/lnet/nis
+nid                                refs            peer            max             tx              min
+0@lo                               3               0               0               0               0
+192.168.10.34@tcp          4               8               256             256             252
  </screen>
-        <para>Currently, the default is 90%. You can permanently set this value by running this command on the MGS:</para>
-        <screen>$ lctl conf_param &lt;fsname&gt;-MDT0000.lov.qos_prio_free=90
+      <para>Shows the current queue health on this node. The fields are explained below:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">nid</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Network interface</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">refs</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Internal reference counter</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>peer</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Number of peer-to-peer send credits on this NID. Credits are used to size buffer pools</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">max</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Total number of send credits on this NID.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>tx</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Current number of send credits available on this NID.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>min</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Lowest number of send credits available on this NID.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>queue</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Total bytes in active/queued sends.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <para>Subtracting <literal>max</literal> - <literal>tx</literal> yields the number of sends currently active. A large or increasing number of active sends may indicate a problem.</para>
+      <screen># cat /proc/sys/lnet/nis
+nid                                refs            peer            max             tx              min
+0@lo                               2               0               0               0               0
+10.67.73.173@tcp           4               8               256             256             253
  </screen>
-        <para>Setting the priority to 100% means that OSS distribution does not count in the weighting, but the stripe assignment is still done via weighting. If OST 2 has twice as much free space as OST 1, it is twice as likely to be used, but it is NOT guaranteed to be used.</para>
-        <para>Also note that free-space stripe weighting does not activate until two OSTs are imbalanced by more than 20%. Until then, a faster round-robin stripe allocator is used. (The new round-robin order also maximizes network balancing.)</para>
-        <section remap="h4">
-          <title>31.1.5.1 Managing Stripe Allocation</title>
-          <para>The MDS uses two methods to manage stripe allocation and determine which OSTs to use for file object storage:</para>
-          <itemizedlist><listitem>
-              <para><emphasis role="bold">QOS</emphasis></para>
-            </listitem>
-
-</itemizedlist>
-          <para>Quality of Service (QOS) considers an OST's available blocks, speed, and the number of existing objects, etc. Using these criteria, the MDS selects OSTs with more free space more often than OSTs with less free space.</para>
-          <itemizedlist><listitem>
-              <para><emphasis role="bold">RR</emphasis></para>
-            </listitem>
-
-</itemizedlist>
-          <para>Round-Robin (RR) allocates objects evenly across all OSTs. The RR stripe allocator is faster than QOS, and used often because it distributes space usage/load best in most situations, maximizing network balancing and improving performance.</para>
-          <para>Whether QOS or RR is used depends on the setting of the qos_threshold_rr proc tunable. The qos_threshold_rr variable specifies a percentage threshold where the use of QOS or RR becomes more/less likely. The qos_threshold_rr tunable can be set as an integer, from 0 to 100, and results in this stripe allocation behavior:</para>
-          <itemizedlist><listitem>
-              <para> If qos_threshold_rr is set to 0, then QOS is always used</para>
-            </listitem>
-
-<listitem>
-              <para> If qos_threshold_rr is set to 100, then RR is always used</para>
-            </listitem>
-
-<listitem>
-              <para> The larger the qos_threshold_rr setting, the greater the possibility that RR is used instead of QOS</para>
-            </listitem>
-
-</itemizedlist>
-        </section>
+    </section>
+    <section remap="h3">
+      <title>31.1.5 Free Space Distribution</title>
+      <para>Free-space stripe weighting, as set, gives a priority of &quot;0&quot; to free space (versus trying to place the stripes &quot;widely&quot; -- nicely distributed across OSSs and OSTs to maximize network balancing). To adjust this priority (as a percentage), use the <literal>qos_prio_free</literal> proc tunable:</para>
+      <screen>$ cat /proc/fs/lustre/lov/&lt;fsname&gt;-mdtlov/qos_prio_free</screen>
+      <para>Currently, the default is 90%. You can permanently set this value by running this command on the MGS:</para>
+      <screen>$ lctl conf_param &lt;fsname&gt;-MDT0000.lov.qos_prio_free=90</screen>
+      <para>Setting the priority to 100% means that OSS distribution does not count in the weighting, but the stripe assignment is still done via weighting. If OST 2 has twice as much free space as OST 1, it is twice as likely to be used, but it is NOT guaranteed to be used.</para>
+      <para>Also note that free-space stripe weighting does not activate until two OSTs are imbalanced by more than 20%. Until then, a faster round-robin stripe allocator is used. (The new round-robin order also maximizes network balancing.)</para>
+      <section remap="h4">
+        <title>31.1.5.1 Managing Stripe Allocation</title>
+        <para>The MDS uses two methods to manage stripe allocation and determine which OSTs to use for file object storage:</para>
+        <itemizedlist>
+          <listitem>
+            <para><emphasis role="bold">QOS</emphasis></para>
+            <para>Quality of Service (QOS) considers an OST&apos;s available blocks, speed, and the number of existing objects, etc. Using these criteria, the MDS selects OSTs with more free space more often than OSTs with less free space.</para>
+          </listitem>
+        </itemizedlist>
+        <itemizedlist>
+          <listitem>
+            <para><emphasis role="bold">RR</emphasis></para>
+            <para>Round-Robin (RR) allocates objects evenly across all OSTs. The RR stripe allocator is faster than QOS, and used often because it distributes space usage/load best in most situations, maximizing network balancing and improving performance.</para>
+          </listitem>
+        </itemizedlist>
+        <para>Whether QOS or RR is used depends on the setting of the <literal>qos_threshold_rr</literal> proc tunable. The <literal>qos_threshold_rr</literal> variable specifies a percentage threshold where the use of QOS or RR becomes more/less likely. The <literal>qos_threshold_rr</literal> tunable can be set as an integer, from 0 to 100, and results in this stripe allocation behavior:</para>
+        <itemizedlist>
+          <listitem>
+            <para> If <literal>qos_threshold_rr</literal> is set to 0, then QOS is always used</para>
+          </listitem>
+          <listitem>
+            <para> If <literal>qos_threshold_rr</literal> is set to 100, then RR is always used</para>
+          </listitem>
+          <listitem>
+            <para> The larger the <literal>qos_threshold_rr</literal> setting, the greater the possibility that RR is used instead of QOS</para>
+          </listitem>
+        </itemizedlist>
        </section>
      </section>
-    <section xml:id="dbdoclet.50438271_78950">
-      <title>31.2 Lustre I/O <anchor xml:id="dbdoclet.50438271_marker-1290508" xreflabel=""/>Tunables</title>
-      <para>The section describes I/O tunables.</para>
-      <para><emphasis role="bold">/proc/fs/lustre/llite/&lt;fsname&gt;-&lt;uid&gt;/max_cache_mb</emphasis></para>
-      <screen># cat /proc/fs/lustre/llite/lustre-ce63ca00/max_cached_mb 128
-</screen>
-      <para>This tunable is the maximum amount of inactive data cached by the client (default is 3/4 of RAM).</para>
-      <section remap="h3">
-        <title>31.2.1 Client I/O RPC<anchor xml:id="dbdoclet.50438271_marker-1290514" xreflabel=""/> Stream Tunables</title>
-        <para>The Lustre engine always attempts to pack an optimal amount of data into each I/O RPC and attempts to keep a consistent number of issued RPCs in progress at a time. Lustre exposes several tuning variables to adjust behavior according to network conditions and cluster size. Each OSC has its own tree of these tunables. For example:</para>
-        <screen>$ ls -d /proc/fs/lustre/osc/OSC_client_ost1_MNT_client_2 /localhost
+  </section>
+  <section xml:id="dbdoclet.50438271_78950">
+    <title>31.2 Lustre I/O Tunables</title>
+    <para>The section describes I/O tunables.</para>
+    <para><literal>
+        <emphasis role="bold">/proc/fs/lustre/llite/&lt;fsname&gt;-&lt;uid&gt;/max_cache_mb</emphasis>
+      </literal></para>
+    <screen># cat /proc/fs/lustre/llite/lustre-ce63ca00/max_cached_mb 128</screen>
+    <para>This tunable is the maximum amount of inactive data cached by the client (default is 3/4 of RAM).</para>
+    <section remap="h3">
+      <title>31.2.1 Client I/O RPC Stream Tunables</title>
+      <para>The Lustre engine always attempts to pack an optimal amount of data into each I/O RPC and attempts to keep a consistent number of issued RPCs in progress at a time. Lustre exposes several tuning variables to adjust behavior according to network conditions and cluster size. Each OSC has its own tree of these tunables. For example:</para>
+      <screen>$ ls -d /proc/fs/lustre/osc/OSC_client_ost1_MNT_client_2 /localhost
  /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
  /proc/fs/lustre/osc/OSC_uml0_ost2_MNT_localhost
  /proc/fs/lustre/osc/OSC_uml0_ost3_MNT_localhost
  $ ls /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
-blocksizefilesfree max_dirty_mb ost_server_uuid stats
-</screen>
-        <para>... and so on.</para>
-        <para>RPC stream tunables are described below.</para>
-        <para><emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/max_dirty_mb</emphasis></para>
-        <para>This tunable controls how many MBs of dirty data can be written and queued up in the OSC. POSIX file writes that are cached contribute to this count. When the limit is reached, additional writes stall until previously-cached writes are written to the server. This may be changed by writing a single ASCII integer to the file. Only values between 0 and 512 are allowable. If 0 is given, no writes are cached. Performance suffers noticeably unless you use large writes (1 MB or more).</para>
-        <para><emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/cur_dirty_bytes</emphasis></para>
-        <para>This tunable is a read-only value that returns the current amount of bytes written and cached on this OSC.</para>
-        <para><emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/max_pages_per_rpc</emphasis></para>
-        <para>This tunable is the maximum number of pages that will undergo I/O in a single RPC to the OST. The minimum is a single page and the maximum for this setting is platform dependent (256 for i386/x86_64, possibly less for ia64/PPC with larger PAGE_SIZE), though generally amounts to a total of 1 MB in the RPC.</para>
-        <para><emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/max_rpcs_in_flight</emphasis></para>
-        <para>This tunable is the maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC tries to initiate an RPC but finds that it already has the same number of RPCs outstanding, it will wait to issue further RPCs until some complete. The minimum setting is 1 and maximum setting is 32. If you are looking to improve small file I/O performance, increase the max_rpcs_in_flight value.</para>
-        <para>To maximize performance, the value for max_dirty_mb is recommended to be 4 * max_pages_per_rpc * max_rpcs_in_flight.</para>
-                <note><para>The &lt;object name&gt; varies depending on the specific Lustre configuration. For &lt;object name&gt; examples, refer to the sample command output.</para></note>
-      </section>
-      <section remap="h3">
-        <title>31.2.2 Watching the <anchor xml:id="dbdoclet.50438271_marker-1290535" xreflabel=""/>Client RPC Stream</title>
-        <para>The same directory contains a rpc_stats file with a histogram showing the composition of previous RPCs. The histogram can be cleared by writing any value into the rpc_stats file.</para>
-        <screen># cat /proc/fs/lustre/osc/spfs-OST0000-osc-c45f9c00/rpc_stats
-snapshot_time:                                     1174867307.156604 (secs.\
-usecs)
+blocksizefilesfree max_dirty_mb ost_server_uuid stats</screen>
+      <para>... and so on.</para>
+      <para>RPC stream tunables are described below.</para>
+      <para><literal>
+          <emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/max_dirty_mb</emphasis>
+        </literal></para>
+      <para>This tunable controls how many MBs of dirty data can be written and queued up in the OSC. POSIX file writes that are cached contribute to this count. When the limit is reached, additional writes stall until previously-cached writes are written to the server. This may be changed by writing a single ASCII integer to the file. Only values between 0 and 512 are allowable. If 0 is given, no writes are cached. Performance suffers noticeably unless you use large writes (1 MB or more).</para>
+      <para><literal>
+          <emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/cur_dirty_bytes</emphasis>
+        </literal></para>
+      <para>This tunable is a read-only value that returns the current amount of bytes written and cached on this OSC.</para>
+      <para><literal>
+          <emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/max_pages_per_rpc</emphasis>
+        </literal></para>
+      <para>This tunable is the maximum number of pages that will undergo I/O in a single RPC to the OST. The minimum is a single page and the maximum for this setting is platform dependent (256 for i386/x86_64, possibly less for ia64/PPC with larger <literal>PAGE_SIZE</literal>), though generally amounts to a total of 1 MB in the RPC.</para>
+      <para><literal>
+          <emphasis role="bold">/proc/fs/lustre/osc/&lt;object name&gt;/max_rpcs_in_flight</emphasis>
+        </literal></para>
+      <para>This tunable is the maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC tries to initiate an RPC but finds that it already has the same number of RPCs outstanding, it will wait to issue further RPCs until some complete. The minimum setting is 1 and maximum setting is 32. If you are looking to improve small file I/O performance, increase the <literal>max_rpcs_in_flight</literal> value.</para>
+      <para>To maximize performance, the value for <literal>max_dirty_mb</literal> is recommended to be 4 * <literal>max_pages_per_rpc</literal> * <literal>max_rpcs_in_flight</literal>.</para>
+      <note>
+        <para>The <emphasis role="italic">
+            <literal>&lt;object name&gt;</literal>
+          </emphasis> varies depending on the specific Lustre configuration. For <literal>&lt;object name&gt;</literal> examples, refer to the sample command output.</para>
+      </note>
+    </section>
+    <section remap="h3">
+      <title>31.2.2 Watching the Client RPC Stream</title>
+      <para>The same directory contains a <literal>rpc_stats</literal> file with a histogram showing the composition of previous RPCs. The histogram can be cleared by writing any value into the <literal>rpc_stats</literal> file.</para>
+      <screen># cat /proc/fs/lustre/osc/spfs-OST0000-osc-c45f9c00/rpc_stats
+snapshot_time:                                     1174867307.156604 (secs.usecs)
  read RPCs in flight:                               0
  write RPCs in flight:                              0
  pending write pages:                               0
  pending read pages:                                0
                     read                                    write
-pages per rpc              rpcs    %       cum     %       |       rpcs    \
-%       cum     %
-1:                 0       0       0               |       0               \
-0       0
+pages per rpc              rpcs    %       cum     %       |       rpcs    %       cum     %
+1:                 0       0       0               |       0               0       0
   
                     read                                    write
-rpcs in flight             rpcs    %       cum     %       |       rpcs    \
-%       cum     %
-0:                 0       0       0               |       0               \
-0       0
+rpcs in flight             rpcs    %       cum     %       |       rpcs    %       cum     %
+0:                 0       0       0               |       0               0       0
   
                     read                                    write
-offset                     rpcs    %       cum     %       |       rpcs    \
-%       cum     %
-0:                 0       0       0               |       0               \
-0       0
+offset                     rpcs    %       cum     %       |       rpcs    %       cum     %
+0:                 0       0       0               |       0               0       0
  </screen>
-        <para>Where:</para>
-        <informaltable frame="all">
-          <tgroup cols="2">
-            <colspec colname="c1" colwidth="50*"/>
-            <colspec colname="c2" colwidth="50*"/>
-            <thead>
-              <row>
-                <entry><para><emphasis role="bold">Field</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry><para> <emphasis role="bold">{read,write} RPCs in flight</emphasis></para></entry>
-                <entry><para> Number of read/write RPCs issued by the OSC, but not complete at the time of the snapshot. This value should always be less than or equal to max_rpcs_in_flight.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">pending {read,write} pages</emphasis></para></entry>
-                <entry><para> Number of pending read/write pages that have been queued for I/O in the OSC.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">pages per RPC</emphasis></para></entry>
-                <entry><para> When an RPC is sent, the number of pages it consists of is recorded (in order). A single page RPC increments the 0: row.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">RPCs in flight</emphasis></para></entry>
-                <entry><para> When an RPC is sent, the number of other RPCs that are pending is recorded. When the first RPC is sent, the 0: row is incremented. If the first RPC is sent while another is pending, the 1: row is incremented and so on. As each RPC *completes*, the number of pending RPCs is not tabulated.</para><para>This table is a good way to visualize the concurrency of the RPC stream. Ideally, you will see a large clump around the max_rpcs_in_flight value, which shows that the network is being kept busy.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">offset</emphasis></para></entry>
-                <entry><para>  </para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
-      </section>
-      <section remap="h3">
-        <title>31.2.3 Client Read-Write <anchor xml:id="dbdoclet.50438271_marker-1290564" xreflabel=""/>Offset Survey</title>
-        <para>The offset_stats parameter maintains statistics for occurrences where a series of read or write calls from a process did not access the next sequential location. The offset field is reset to 0 (zero) whenever a different file is read/written.</para>
-        <para>Read/write offset statistics are off, by default. The statistics can be activated by writing anything into the offset_stats file.</para>
-        <para>Example:</para>
-        <screen># cat /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats
+      <para>Where:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">{read,write} RPCs in flight</emphasis></para>
+              </entry>
+              <entry>
+                <para>Number of read/write RPCs issued by the OSC, but not complete at the time of the snapshot. This value should always be less than or equal to max_rpcs_in_flight.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">pending {read,write} pages</emphasis></para>
+              </entry>
+              <entry>
+                <para>Number of pending read/write pages that have been queued for I/O in the OSC.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">pages per RPC</emphasis></para>
+              </entry>
+              <entry>
+                <para>When an RPC is sent, the number of pages it consists of is recorded (in order). A single page RPC increments the 0: row.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">RPCs in flight</emphasis></para>
+              </entry>
+              <entry>
+                <para>When an RPC is sent, the number of other RPCs that are pending is recorded. When the first RPC is sent, the 0: row is incremented. If the first RPC is sent while another is pending, the 1: row is incremented and so on. As each RPC *completes*, the number of pending RPCs is not tabulated.</para>
+                <para>This table is a good way to visualize the concurrency of the RPC stream. Ideally, you will see a large clump around the max_rpcs_in_flight value, which shows that the network is being kept busy.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">offset</emphasis></para>
+              </entry>
+              <entry>
+                <para> </para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+    </section>
+    <section remap="h3">
+      <title>31.2.3 Client Read-Write Offset Survey</title>
+      <para>The offset_stats parameter maintains statistics for occurrences where a series of read or write calls from a process did not access the next sequential location. The offset field is reset to 0 (zero) whenever a different file is read/written.</para>
+      <para>Read/write offset statistics are off, by default. The statistics can be activated by writing anything into the <literal>offset_stats</literal> file.</para>
+      <para>Example:</para>
+      <screen># cat /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats
  snapshot_time: 1155748884.591028 (secs.usecs)
-R/W                PID             RANGE START             RANGE END       \
-        SMALLEST EXTENT         LARGEST EXTENT                          OFF\
-SET
-R          8385            0                       128                     \
-128                     128                             0
-R          8385            0                       224                     \
-224                     224                             -128
-W          8385            0                       250                     \
-50                      100                             0
-W          8385            100                     1110                    \
-10                      500                             -150
-W          8384            0                       5233                    \
-5233                    5233                            0
-R          8385            500                     600                     \
-100                     100                             -610
-</screen>
-        <para>Where:</para>
-        <informaltable frame="all">
-          <tgroup cols="2">
-            <colspec colname="c1" colwidth="50*"/>
-            <colspec colname="c2" colwidth="50*"/>
-            <thead>
-              <row>
-                <entry><para><emphasis role="bold">Field</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry><para> <emphasis role="bold">R/W</emphasis></para></entry>
-                <entry><para> Whether the non-sequential call was a read or write</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">PID</emphasis></para></entry>
-                <entry><para> Process ID which made the read/write call.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">Range Start/Range End</emphasis></para></entry>
-                <entry><para> Range in which the read/write calls were sequential.</para><para> </para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">Smallest Extent</emphasis></para></entry>
-                <entry><para> Smallest extent (single read/write) in the corresponding range.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">Largest Extent</emphasis></para></entry>
-                <entry><para> Largest extent (single read/write) in the corresponding range.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">Offset</emphasis></para></entry>
-                <entry><para> Difference from the previous range end to the current range start.</para><para>For example, Smallest-Extent indicates that the writes in the range 100 to 1110 were sequential, with a minimum write of 10 and a maximum write of 500. This range was started with an offset of -150. That means this is the difference between the last entry's range-end and this entry's range-start for the same file.</para><para>The rw_offset_stats file can be cleared by writing to it:</para><screen> 
-echo &gt; /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats
-</screen></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
-      </section>
-      <section remap="h3">
-        <title>31.2.4 Client Read-Write <anchor xml:id="dbdoclet.50438271_marker-1290612" xreflabel=""/>Extents Survey</title>
-        <para><emphasis role="bold">Client-Based I/O Extent Size Survey</emphasis></para>
-        <para>The rw_extent_stats histogram in the llite directory shows you the statistics for the sizes of the read-write I/O extents. This file does not maintain the per-process statistics.</para>
-        <para>Example:</para>
-        <screen>$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats
+R/W                PID             RANGE START             RANGE END               SMALLEST EXTENT         LARGEST EXTENT                          OFFSET
+R          8385            0                       128                     128                     128                             0
+R          8385            0                       224                     224                     224                             -128
+W          8385            0                       250                     50                      100                             0
+W          8385            100                     1110                    10                      500                             -150
+W          8384            0                       5233                    5233                    5233                            0
+R          8385            500                     600                     100                     100                             -610</screen>
+      <para>Where:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">R/W</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Whether the non-sequential call was a read or write</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>PID</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Process ID which made the read/write call.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">Range Start/Range End</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Range in which the read/write calls were sequential.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">Smallest Extent</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Smallest extent (single read/write) in the corresponding range.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">Largest Extent</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Largest extent (single read/write) in the corresponding range.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">Offset</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Difference from the previous range end to the current range start.</para>
+                <para>For example, Smallest-Extent indicates that the writes in the range 100 to 1110 were sequential, with a minimum write of 10 and a maximum write of 500. This range was started with an offset of -150. That means this is the difference between the last entry&apos;s range-end and this entry&apos;s range-start for the same file.</para>
+                <para>The <literal>rw_offset_stats</literal> file can be cleared by writing to it:</para>
+                <screen>echo &gt; /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats</screen>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+    </section>
+    <section remap="h3">
+      <title>31.2.4 Client Read-Write Extents Survey</title>
+      <para><emphasis role="bold">Client-Based I/O Extent Size Survey</emphasis></para>
+      <para>The <literal>rw_extent_stats</literal> histogram in the <literal>llite</literal> directory shows you the statistics for the sizes of the read-write I/O extents. This file does not maintain the per-process statistics.</para>
+      <para>Example:</para>
+      <screen>$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats
  snapshot_time:                     1213828728.348516 (secs.usecs)
                             read            |               write
-extents                    calls   %       cum%    |       calls   %       \
-cum%
+extents                    calls   %       cum%    |       calls   %       cum%
   
  0K - 4K :          0       0       0       |       2       2       2
  4K - 8K :          0       0       0       |       0       0       2
  8K - 16K :         0       0       0       |       0       0       2
-16K - 32K :                0       0       0       |       20      23      \
-26
-32K - 64K :                0       0       0       |       0       0       \
-26
-64K - 128K :               0       0       0       |       51      60      \
-86
-128K - 256K :              0       0       0       |       0       0       \
-86
-256K - 512K :              0       0       0       |       0       0       \
-86
-512K - 1024K :             0       0       0       |       0       0       \
-86
-1M - 2M :          0       0       0       |       11      13      100
- 
-</screen>
-        <para>The file can be cleared by issuing the following command:</para>
-        <screen>$ echo &gt; cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats
-</screen>
-        <para><emphasis role="bold">Per-Process Client I/O Statistics</emphasis></para>
-        <para>The extents_stats_per_process file maintains the I/O extent size statistics on a per-process basis. So you can track the per-process statistics for the last MAX_PER_PROCESS_HIST processes.</para>
-        <para>Example:</para>
-        <screen>$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats_per_process
+16K - 32K :                0       0       0       |       20      23      26
+32K - 64K :                0       0       0       |       0       0       26
+64K - 128K :               0       0       0       |       51      60      86
+128K - 256K :              0       0       0       |       0       0       86
+256K - 512K :              0       0       0       |       0       0       86
+512K - 1024K :             0       0       0       |       0       0       86
+1M - 2M :          0       0       0       |       11      13      100</screen>
+      <para>The file can be cleared by issuing the following command:</para>
+      <screen>$ echo &gt; cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats</screen>
+      <para><emphasis role="bold">Per-Process Client I/O Statistics</emphasis></para>
+      <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size statistics on a per-process basis. So you can track the per-process statistics for the last <literal>MAX_PER_PROCESS_HIST</literal> processes.</para>
+      <para>Example:</para>
+      <screen>$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats_per_process
  snapshot_time:                     1213828762.204440 (secs.usecs)
                             read            |               write
-extents                    calls   %       cum%    |       calls   %       \
-cum%
+extents                    calls   %       cum%    |       calls   %       cum%
   
  PID: 11488
     0K - 4K :       0       0        0      |       0       0       0
@@ -601,23 +785,20 @@ PID: 11429
     0K - 4K :       0       0        0      |       1       100     100
   
  </screen>
-      </section>
-      <section remap="h3">
-        <title>31.2.5 <anchor xml:id="dbdoclet.50438271_55057" xreflabel=""/> Watching the <anchor xml:id="dbdoclet.50438271_marker-1290631" xreflabel=""/>OST Block I/O Stream</title>
-        <para>Similarly, there is a brw_stats histogram in the obdfilter directory which shows you the statistics for number of I/O requests sent to the disk, their size and whether they are contiguous on the disk or not.</para>
-        <screen>cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats 
+    </section>
+    <section xml:id="dbdoclet.50438271_55057">
+      <title>31.2.5 Watching the OST Block I/O Stream</title>
+      <para>Similarly, there is a <literal>brw_stats</literal> histogram in the obdfilter directory which shows you the statistics for number of I/O requests sent to the disk, their size and whether they are contiguous on the disk or not.</para>
+      <screen>cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats 
  snapshot_time:                     1174875636.764630 (secs:usecs)
                             read                            write
-pages per brw              brws    %       cum %   |       rpcs    %       \
-cum %
+pages per brw              brws    %       cum %   |       rpcs    %       cum %
  1:                 0       0       0       |       0       0       0
                             read                                    write
-discont pages              rpcs    %       cum %   |       rpcs    %       \
-cum %
+discont pages              rpcs    %       cum %   |       rpcs    %       cum %
  1:                 0       0       0       |       0       0       0
                             read                                    write
-discont blocks             rpcs    %       cum %   |       rpcs    %       \
-cum %
+discont blocks             rpcs    %       cum %   |       rpcs    %       cum %
  1:                 0       0       0       |       0       0       0
                             read                                    write
  dio frags          rpcs    %       cum %   |       rpcs    %       cum %
@@ -629,780 +810,1086 @@ disk ios in flight rpcs    %       cum %   |       rpcs    %       cum %
  io time (1/1000s)  rpcs    %       cum %   |       rpcs    %       cum %
  1:                 0       0       0       |       0       0       0
                             read                                    write
-disk io size               rpcs    %       cum %   |       rpcs    %       \
-cum %
+disk io size               rpcs    %       cum %   |       rpcs    %       cum %
  1:                 0       0       0       |       0       0       0
                             read                                    write
  </screen>
-        <para>The fields are explained below:</para>
-        <informaltable frame="all">
-          <tgroup cols="2">
-            <colspec colname="c1" colwidth="50*"/>
-            <colspec colname="c2" colwidth="50*"/>
-            <thead>
-              <row>
-                <entry><para><emphasis role="bold">Field</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry><para> <emphasis role="bold">pages per brw</emphasis></para></entry>
-                <entry><para> Number of pages per RPC request, which should match aggregate client rpc_stats.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">discont pages</emphasis></para></entry>
-                <entry><para> Number of discontinuities in the logical file offset of each page in a single RPC.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">discont blocks</emphasis></para></entry>
-                <entry><para> Number of discontinuities in the physical block allocation in the file system for a single RPC.</para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
-        <para>For each Lustre service, the following information is provided:</para>
-        <itemizedlist><listitem>
-            <para> Number of requests</para>
-          </listitem>
-
-<listitem>
-            <para> Request wait time (avg, min, max and std dev)</para>
-          </listitem>
-
-<listitem>
-            <para> Service idle time (% of elapsed time)</para>
-          </listitem>
-
-</itemizedlist>
-        <para>Additionally, data on each Lustre service is provided by service type:</para>
-        <itemizedlist><listitem>
-            <para> Number of requests of this type</para>
-          </listitem>
-
-<listitem>
-            <para> Request service time (avg, min, max and std dev)</para>
-          </listitem>
-
-</itemizedlist>
+      <para>The fields are explained below:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">pages per brw</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Number of pages per RPC request, which should match aggregate client <literal>rpc_stats</literal>.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">discont pages</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Number of discontinuities in the logical file offset of each page in a single RPC.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">discont blocks</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Number of discontinuities in the physical block allocation in the file system for a single RPC.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <para>For each Lustre service, the following information is provided:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Number of requests</para>
+        </listitem>
+        <listitem>
+          <para>Request wait time (avg, min, max and std dev)</para>
+        </listitem>
+        <listitem>
+          <para>Service idle time (% of elapsed time)</para>
+        </listitem>
+      </itemizedlist>
+      <para>Additionally, data on each Lustre service is provided by service type:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Number of requests of this type</para>
+        </listitem>
+        <listitem>
+          <para>Request service time (avg, min, max and std dev)</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section remap="h3">
+      <title>31.2.6 Using File Readahead and Directory Statahead</title>
+      <para>Lustre 1.6.5.1 introduced file readahead and directory statahead functionality that read data into memory in anticipation of a process actually requesting the data. File readahead functionality reads file content data into memory. Directory statahead functionality reads metadata into memory. When readahead and/or statahead work well, a data-consuming process finds that the information it needs is available when requested, and it is unnecessary to wait for network I/O.</para>
+      <section remap="h4">
+        <title>31.2.6.1 Tuning File Readahead</title>
+        <para>File readahead is triggered when two or more sequential reads by an application fail to be satisfied by the Linux buffer cache. The size of the initial readahead is 1 MB. Additional readaheads grow linearly, and increment until the readahead cache on the client is full at 40 MB.</para>
+        <para><literal>
+            <emphasis role="bold">/proc/fs/lustre/llite/&lt;fsname&gt;-&lt;uid&gt;/max_read_ahead_mb</emphasis>
+          </literal></para>
+        <para>This tunable controls the maximum amount of data readahead on a file. Files are read ahead in RPC-sized chunks (1 MB or the size of read() call, if larger) after the second sequential read on a file descriptor. Random reads are done at the size of the read() call only (no readahead). Reads to non-contiguous regions of the file reset the readahead algorithm, and readahead is not triggered again until there are sequential reads again. To disable readahead, set this tunable to 0. The default value is 40 MB.</para>
+        <para><literal>
+            <emphasis role="bold">/proc/fs/lustre/llite/&lt;fsname&gt;-&lt;uid&gt;/max_read_ahead_whole_mb</emphasis>
+          </literal></para>
+        <para>This tunable controls the maximum size of a file that is read in its entirety, regardless of the size of the <literal>read()</literal>.</para>
        </section>
-      <section remap="h3">
-        <title>31.2.6 Using File <anchor xml:id="dbdoclet.50438271_marker-1294292" xreflabel=""/>Readahead and Directory Statahead</title>
-        <para>Lustre 1.6.5.1 introduced file readahead and directory statahead functionality that read data into memory in anticipation of a process actually requesting the data. File readahead functionality reads file content data into memory. Directory statahead functionality reads metadata into memory. When readahead and/or statahead work well, a data-consuming process finds that the information it needs is available when requested, and it is unnecessary to wait for network I/O.</para>
-        <section remap="h4">
-          <title>31.2.6.1 Tuning <anchor xml:id="dbdoclet.50438271_marker-1295183" xreflabel=""/>File Readahead</title>
-          <para>File readahead is triggered when two or more sequential reads by an application fail to be satisfied by the Linux buffer cache. The size of the initial readahead is 1 MB. Additional readaheads grow linearly, and increment until the readahead cache on the client is full at 40 MB.</para>
-          <para><emphasis role="bold">/proc/fs/lustre/llite/&lt;fsname&gt;-&lt;uid&gt;/max_read_ahead_mb</emphasis></para>
-          <para>This tunable controls the maximum amount of data readahead on a file. Files are read ahead in RPC-sized chunks (1 MB or the size of read() call, if larger) after the second sequential read on a file descriptor. Random reads are done at the size of the read() call only (no readahead). Reads to non-contiguous regions of the file reset the readahead algorithm, and readahead is not triggered again until there are sequential reads again. To disable readahead, set this tunable to 0. The default value is 40 MB.</para>
-          <para><emphasis role="bold">/proc/fs/lustre/llite/&lt;fsname&gt;-&lt;uid&gt;/max_read_ahead_whole_mb</emphasis></para>
-          <para>This tunable controls the maximum size of a file that is read in its entirety, regardless of the size of the read().</para>
-        </section>
-        <section remap="h4">
-          <title>31.2.6.2 Tuning Directory <anchor xml:id="dbdoclet.50438271_marker-1295184" xreflabel=""/>Statahead</title>
-          <para>When the ls -l process opens a directory, its process ID is recorded. When the first directory entry is &apos;&apos;stated&apos;&apos; with this recorded process ID, a statahead thread is triggered which stats ahead all of the directory entries, in order. The ls -l process can use the stated directory entries directly, improving performance.</para>
-          <para>/proc/fs/lustre/llite/*/statahead_max</para>
-          <para>This tunable controls whether directory statahead is enabled and the maximum statahead count. By default, statahead is active.</para>
-          <para>To disable statahead, set this tunable to:</para>
-          <para>echo 0 &gt; /proc/fs/lustre/llite/*/statahead_max</para>
-          <para>To set the maximum statahead count (n), set this tunable to:</para>
-          <screen>echo n &gt; /proc/fs/lustre/llite/*/statahead_max
-</screen>
-          <para>The maximum value of n is 8192.</para>
-          <para>/proc/fs/lustre/llite/*/statahead_status</para>
-          <para>This is a read-only interface that indicates the current statahead status.</para>
-        </section>
+      <section remap="h4">
+        <title>31.2.6.2 Tuning Directory Statahead</title>
+        <para>When the <literal>ls -l</literal> process opens a directory, its process ID is recorded. When the first directory entry is &apos;&apos;stated&apos;&apos; with this recorded process ID, a statahead thread is triggered which stats ahead all of the directory entries, in order. The <literal>ls -l</literal> process can use the stated directory entries directly, improving performance.</para>
+        <para><literal>
+            <emphasis role="bold">/proc/fs/lustre/llite/*/statahead_max</emphasis>
+          </literal></para>
+        <para>This tunable controls whether directory <literal>statahead</literal> is enabled and the maximum statahead count. By default, statahead is active.</para>
+        <para>To disable statahead, set this tunable to:</para>
+        <screen>echo 0 &gt; /proc/fs/lustre/llite/*/statahead_max</screen>
+        <para>To set the maximum statahead count (n), set this tunable to:</para>
+        <screen>echo n &gt; /proc/fs/lustre/llite/*/statahead_max</screen>
+        <para>The maximum value of n is 8192.</para>
+        <emphasis role="bold">
+          <literal>
+            <para>/proc/fs/lustre/llite/*/statahead_status</para>
+          </literal>
+        </emphasis>
+        <para>This is a read-only interface that indicates the current statahead status.</para>
        </section>
-      <section remap="h3">
-        <title>31.2.7 OSS <anchor xml:id="dbdoclet.50438271_marker-1296183" xreflabel=""/>Read Cache</title>
-        <para>The OSS read cache feature provides read-only caching of data on an OSS. This functionality uses the regular Linux page cache to store the data. Just like caching from a regular filesytem in Linux, OSS read cache uses as much physical memory as is allocated.</para>
-        <para>OSS read cache improves Lustre performance in these situations:</para>
-        <itemizedlist><listitem>
-            <para> Many clients are accessing the same data set (as in HPC applications and when diskless clients boot from Lustre)</para>
-          </listitem>
-
-<listitem>
-            <para>  One client is storing data while another client is reading it (essentially exchanging data via the OST)</para>
-          </listitem>
-
-<listitem>
-            <para> A client has very limited caching of its own</para>
-          </listitem>
-
-</itemizedlist>
-        <para>OSS read cache offers these benefits:</para>
-        <itemizedlist><listitem>
-            <para> Allows OSTs to cache read data more frequently</para>
+    </section>
+    <section remap="h3">
+      <title>31.2.7 OSS Read Cache</title>
+      <para>The OSS read cache feature provides read-only caching of data on an OSS. This functionality uses the regular Linux page cache to store the data. Just like caching from a regular filesytem in Linux, OSS read cache uses as much physical memory as is allocated.</para>
+      <para>OSS read cache improves Lustre performance in these situations:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Many clients are accessing the same data set (as in HPC applications and when diskless clients boot from Lustre)</para>
+        </listitem>
+        <listitem>
+          <para>One client is storing data while another client is reading it (essentially exchanging data via the OST)</para>
+        </listitem>
+        <listitem>
+          <para>A client has very limited caching of its own</para>
+        </listitem>
+      </itemizedlist>
+      <para>OSS read cache offers these benefits:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Allows OSTs to cache read data more frequently</para>
+        </listitem>
+        <listitem>
+          <para>Improves repeated reads to match network speeds instead of disk speeds</para>
+        </listitem>
+        <listitem>
+          <para>Provides the building blocks for OST write cache (small-write aggregation)</para>
+        </listitem>
+      </itemizedlist>
+      <section remap="h4">
+        <title>31.2.7.1 Using OSS Read Cache</title>
+        <para>OSS read cache is implemented on the OSS, and does not require any special support on the client side. Since OSS read cache uses the memory available in the Linux page cache, you should use I/O patterns to determine the appropriate amount of memory for the cache; if the data is mostly reads, then more cache is required than for writes.</para>
+        <para>OSS read cache is enabled, by default, and managed by the following tunables:</para>
+        <itemizedlist>
+          <listitem>
+            <para><literal>read_cache_enable</literal>  controls whether data read from disk during a read request is kept in memory and available for later read requests for the same data, without having to re-read it from disk. By default, read cache is enabled (<literal>read_cache_enable = 1</literal>).</para>
            </listitem>
-
-<listitem>
-            <para> Improves repeated reads to match network speeds instead of disk speeds</para>
+        </itemizedlist>
+        <para>When the OSS receives a read request from a client, it reads data from disk into its memory and sends the data as a reply to the requests. If read cache is enabled, this data stays in memory after the client&apos;s request is finished, and the OSS skips reading data from disk when subsequent read requests for the same are received. The read cache is managed by the Linux kernel globally across all OSTs on that OSS, and the least recently used cache pages will be dropped from memory when the amount of free memory is running low.</para>
+        <para>If read cache is disabled (<literal>read_cache_enable = 0</literal>), then the OSS will discard the data after the client&apos;s read requests are serviced and, for subsequent read requests, the OSS must read the data from disk.</para>
+        <para>To disable read cache on all OSTs of an OSS, run:</para>
+        <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0</screen>
+        <para>To re-enable read cache on one OST, run:</para>
+        <screen>root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1</screen>
+        <para>To check if read cache is enabled on all OSTs on an OSS, run:</para>
+        <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable</screen>
+        <itemizedlist>
+          <listitem>
+            <para><literal>writethrough_cache_enable</literal>  controls whether data sent to the OSS as a write request is kept in the read cache and available for later reads, or if it is discarded from cache when the write is completed. By default, writethrough cache is enabled (<literal>writethrough_cache_enable = 1</literal>).</para>
            </listitem>
-
-<listitem>
-            <para> Provides the building blocks for OST write cache (small-write aggregation)</para>
+        </itemizedlist>
+        <para>When the OSS receives write requests from a client, it receives data from the client into its memory and writes the data to disk. If writethrough cache is enabled, this data stays in memory after the write request is completed, allowing the OSS to skip reading this data from disk if a later read request, or partial-page write request, for the same data is received.</para>
+        <para>If writethrough cache is disabled (<literal>writethrough_cache_enabled = 0</literal>), then the OSS discards the data after the client&apos;s write request is completed, and for subsequent read request, or partial-page write request, the OSS must re-read the data from disk.</para>
+        <para>Enabling writethrough cache is advisable if clients are doing small or unaligned writes that would cause partial-page updates, or if the files written by one node are immediately being accessed by other nodes. Some examples where this might be useful include producer-consumer I/O models or shared-file writes with a different node doing I/O not aligned on 4096-byte boundaries. Disabling writethrough cache is advisable in the case where files are mostly written to the file system but are not re-read within a short time period, or files are only written and re-read by the same node, regardless of whether the I/O is aligned or not.</para>
+        <para>To disable writethrough cache on all OSTs of an OSS, run:</para>
+        <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0</screen>
+        <para>To re-enable writethrough cache on one OST, run:</para>
+        <screen>root@oss1# lctl set_param \
+obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
+        <para>To check if writethrough cache is</para>
+        <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=1</screen>
+        <itemizedlist>
+          <listitem>
+            <para><literal>readcache_max_filesize</literal>  controls the maximum size of a file that both the read cache and writethrough cache will try to keep in memory. Files larger than <literal>readcache_max_filesize</literal> will not be kept in cache for either reads or writes.</para>
            </listitem>
-
-</itemizedlist>
-        <section remap="h4">
-          <title>31.2.7.1 Using OSS Read Cache</title>
-          <para>OSS read cache is implemented on the OSS, and does not require any special support on the client side. Since OSS read cache uses the memory available in the Linux page cache, you should use I/O patterns to determine the appropriate amount of memory for the cache; if the data is mostly reads, then more cache is required than for writes.</para>
-          <para>OSS read cache is enabled, by default, and managed by the following tunables:</para>
-          <itemizedlist><listitem>
-              <para>read_cache_enable  controls whether data read from disk during a read request is kept in memory and available for later read requests for the same data, without having to re-read it from disk. By default, read cache is enabled (read_cache_enable = 1).</para>
-            </listitem>
-
-</itemizedlist>
-          <para>When the OSS receives a read request from a client, it reads data from disk into its memory and sends the data as a reply to the requests. If read cache is enabled, this data stays in memory after the client's request is finished, and the OSS skips reading data from disk when subsequent read requests for the same are received. The read cache is managed by the Linux kernel globally across all OSTs on that OSS, and the least recently used cache pages will be dropped from memory when the amount of free memory is running low.</para>
-          <para>If read cache is disabled (read_cache_enable = 0), then the OSS will discard the data after the client's read requests are serviced and, for subsequent read requests, the OSS must read the data from disk.</para>
-          <para>To disable read cache on all OSTs of an OSS, run:</para>
-          <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0
-</screen>
-          <para>To re-enable read cache on one OST, run:</para>
-          <screen>root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1
-</screen>
-          <para>To check if read cache is enabled on all OSTs on an OSS, run:</para>
-          <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable
-</screen>
-          <itemizedlist><listitem>
-              <para>writethrough_cache_enable  controls whether data sent to the OSS as a write request is kept in the read cache and available for later reads, or if it is discarded from cache when the write is completed. By default, writethrough cache is enabled (writethrough_cache_enable = 1).</para>
-            </listitem>
-
-</itemizedlist>
-          <para>When the OSS receives write requests from a client, it receives data from the client into its memory and writes the data to disk. If writethrough cache is enabled, this data stays in memory after the write request is completed, allowing the OSS to skip reading this data from disk if a later read request, or partial-page write request, for the same data is received.</para>
-          <para>If writethrough cache is disabled (writethrough_cache_enabled = 0), then the OSS discards the data after the client's write request is completed, and for subsequent read request, or partial-page write request, the OSS must re-read the data from disk.</para>
-          <para>Enabling writethrough cache is advisable if clients are doing small or unaligned writes that would cause partial-page updates, or if the files written by one node are immediately being accessed by other nodes. Some examples where this might be useful include producer-consumer I/O models or shared-file writes with a different node doing I/O not aligned on 4096-byte boundaries. Disabling writethrough cache is advisable in the case where files are mostly written to the file system but are not re-read within a short time period, or files are only written and re-read by the same node, regardless of whether the I/O is aligned or not.</para>
-          <para>To disable writethrough cache on all OSTs of an OSS, run:</para>
-          <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0
-</screen>
-          <para>To re-enable writethrough cache on one OST, run:</para>
-          <screen>root@oss1# lctl set_param \obdfilter.{OST_name}.writethrough_cache_enable=1
-</screen>
-          <para>To check if writethrough cache is</para>
-          <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=1
-</screen>
-          <itemizedlist><listitem>
-              <para>readcache_max_filesize  controls the maximum size of a file that both the read cache and writethrough cache will try to keep in memory. Files larger than readcache_max_filesize will not be kept in cache for either reads or writes.</para>
-            </listitem>
-
-</itemizedlist>
-          <para>This can be very useful for workloads where relatively small files are repeatedly accessed by many clients, such as job startup files, executables, log files, etc., but large files are read or written only once. By not putting the larger files into the cache, it is much more likely that more of the smaller files will remain in cache for a longer time.</para>
-          <para>When setting readcache_max_filesize, the input value can be specified in bytes, or can have a suffix to indicate other binary units such as <emphasis role="bold">K</emphasis>ilobytes, <emphasis role="bold">M</emphasis>egabytes, <emphasis role="bold">G</emphasis>igabytes, <emphasis role="bold">T</emphasis>erabytes, or <emphasis role="bold">P</emphasis>etabytes.</para>
-          <para>To limit the maximum cached file size to 32MB on all OSTs of an OSS, run:</para>
-          <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M
-</screen>
-          <para>To disable the maximum cached file size on an OST, run:</para>
-          <screen>root@oss1# lctl set_param \obdfilter.{OST_name}.readcache_max_filesize=-1
-</screen>
-          <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
-          <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize
-</screen>
-        </section>
+        </itemizedlist>
+        <para>This can be very useful for workloads where relatively small files are repeatedly accessed by many clients, such as job startup files, executables, log files, etc., but large files are read or written only once. By not putting the larger files into the cache, it is much more likely that more of the smaller files will remain in cache for a longer time.</para>
+        <para>When setting <literal>readcache_max_filesize</literal>, the input value can be specified in bytes, or can have a suffix to indicate other binary units such as <emphasis role="bold">K</emphasis>ilobytes, <emphasis role="bold">M</emphasis>egabytes, <emphasis role="bold">G</emphasis>igabytes, <emphasis role="bold">T</emphasis>erabytes, or <emphasis role="bold">P</emphasis>etabytes.</para>
+        <para>To limit the maximum cached file size to 32MB on all OSTs of an OSS, run:</para>
+        <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M</screen>
+        <para>To disable the maximum cached file size on an OST, run:</para>
+        <screen>root@oss1# lctl set_param \
+obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
+        <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
+        <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize</screen>
        </section>
-      <section remap="h3">
-        <title>31.2.8 OSS Asynchronous Journal Commit</title>
-        <para>The OSS asynchronous journal commit feature synchronously writes data to disk without forcing a journal flush. This reduces the number of seeks and significantly improves performance on some hardware.</para>
-                <note><para>Asynchronous journal commit cannot work with O_DIRECT writes, a journal flush is still forced.</para></note>
-        <para>When asynchronous journal commit is enabled, client nodes keep data in the page cache (a page reference). Lustre clients monitor the last committed transaction number (transno) in messages sent from the OSS to the clients. When a client sees that the last committed transno reported by the OSS is &gt;=bulk write transno, it releases the reference on the corresponding pages. To avoid page references being held for too long on clients after a bulk write, a 7 second ping request is scheduled (jbd commit time is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity to report the last committed transno.</para>
-        <para>If the OSS crashes before the journal commit occurs, then the intermediate data is lost. However, new OSS recovery functionality (introduced in the asynchronous journal commit feature), causes clients to replay their write requests and compensate for the missing disk updates by restoring the state of the file system.</para>
-        <para>To enable asynchronous journal commit, set the sync_journal parameter to zero (sync_journal=0):</para>
-        <screen>$ lctl set_param obdfilter.*.sync_journal=0 
-obdfilter.lol-OST0001.sync_journal=0
-</screen>
-        <para>By default, sync_journal is disabled (sync_journal=1), which forces a journal flush after every bulk write.</para>
-        <para>When asynchronous journal commit is used, clients keep a page reference until the journal transaction commits. This can cause problems when a client receives a blocking callback, because pages need to be removed from the page cache, but they cannot be removed because of the extra page reference.</para>
-        <para>This problem is solved by forcing a journal flush on lock cancellation. When this happens, the client is granted the metadata blocks that have hit the disk, and it can safely release the page reference before processing the blocking callback. The parameter which controls this action is sync_on_lock_cancel, which can be set to the following values:</para>
-        <para>always: Always force a journal flush on lock cancellation</para>
-        <para>blocking: Force a journal flush only when the local cancellation is due to a blocking callback</para>
-        <para>never: Do not force any journal flush</para>
-        <para>Here is an example of sync_on_lock_cancel being set not to force a journal flush:</para>
-        <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
-obdfilter.lol-OST0001.sync_on_lock_cancel=never
+    </section>
+    <section remap="h3">
+      <title>31.2.8 OSS Asynchronous Journal Commit</title>
+      <para>The OSS asynchronous journal commit feature synchronously writes data to disk without forcing a journal flush. This reduces the number of seeks and significantly improves performance on some hardware.</para>
+      <note>
+        <para>Asynchronous journal commit cannot work with O_DIRECT writes, a journal flush is still forced.</para>
+      </note>
+      <para>When asynchronous journal commit is enabled, client nodes keep data in the page cache (a page reference). Lustre clients monitor the last committed transaction number (transno) in messages sent from the OSS to the clients. When a client sees that the last committed transno reported by the OSS is &gt;=bulk write transno, it releases the reference on the corresponding pages. To avoid page references being held for too long on clients after a bulk write, a 7 second ping request is scheduled (jbd commit time is 5 seconds) after the bulk write reply is received, so the OSS has an opportunity to report the last committed transno.</para>
+      <para>If the OSS crashes before the journal commit occurs, then the intermediate data is lost. However, new OSS recovery functionality (introduced in the asynchronous journal commit feature), causes clients to replay their write requests and compensate for the missing disk updates by restoring the state of the file system.</para>
+      <para>To enable asynchronous journal commit, set the <literal>sync_journal parameter</literal> to zero (<literal>sync_journal=0</literal>):</para>
+      <screen>$ lctl set_param obdfilter.*.sync_journal=0 
+obdfilter.lol-OST0001.sync_journal=0</screen>
+      <para>By default, <literal>sync_journal</literal> is disabled (<literal>sync_journal=1</literal>), which forces a journal flush after every bulk write.</para>
+      <para>When asynchronous journal commit is used, clients keep a page reference until the journal transaction commits. This can cause problems when a client receives a blocking callback, because pages need to be removed from the page cache, but they cannot be removed because of the extra page reference.</para>
+      <para>This problem is solved by forcing a journal flush on lock cancellation. When this happens, the client is granted the metadata blocks that have hit the disk, and it can safely release the page reference before processing the blocking callback. The parameter which controls this action is <literal>sync_on_lock_cancel</literal>, which can be set to the following values:</para>
+      <itemizedlist>
+        <listitem>
+          <para><literal>always</literal>: Always force a journal flush on lock cancellation</para>
+        </listitem>
+        <listitem>
+          <para><literal>blocking</literal>: Force a journal flush only when the local cancellation is due to a blocking callback</para>
+        </listitem>
+        <listitem>
+          <para><literal>never</literal>: Do not force any journal flush</para>
+        </listitem>
+      </itemizedlist>
+      <para>Here is an example of <literal>sync_on_lock_cancel</literal> being set not to force a journal flush:</para>
+      <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
+obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
+      <para>By default, <literal>sync_on_lock_cancel</literal> is set to never, because asynchronous journal commit is disabled by default.</para>
+      <para>When asynchronous journal commit is enabled (<literal>sync_journal=0</literal>), <literal>sync_on_lock_cancel</literal> is automatically set to always, if it was previously set to never.</para>
+      <para>Similarly, when asynchronous journal commit is disabled, (<literal>sync_journal=1</literal>), <literal>sync_on_lock_cancel</literal> is enforced to never.</para>
+    </section>
+    <section remap="h3">
+      <title>31.2.9 <literal>mballoc</literal> History</title>
+      <para><literal>
+          <emphasis role="bold">/proc/fs/ldiskfs/sda/mb_history</emphasis>
+        </literal></para>
+      <para>Multi-Block-Allocate (<literal>mballoc</literal>), enables Lustre to ask <literal>ldiskfs</literal> to allocate multiple blocks with a single request to the block allocator. Typically, an <literal>ldiskfs</literal> file system allocates only one block per time. Each <literal>mballoc</literal>-enabled partition has this file. This is sample output:</para>
+      <screen>pid  inode   goal            result          found   grps    cr      \   merge   tail    broken
+2838       139267  17/12288/1      17/12288/1      1       0       0       \   M       1       8192
+2838       139267  17/12289/1      17/12289/1      1       0       0       \   M       0       0
+2838       139267  17/12290/1      17/12290/1      1       0       0       \   M       1       2
+2838       24577   3/12288/1       3/12288/1       1       0       0       \   M       1       8192
+2838       24578   3/12288/1       3/771/1         1       1       1       \           0       0
+2838       32769   4/12288/1       4/12288/1       1       0       0       \   M       1       8192
+2838       32770   4/12288/1       4/12289/1       13      1       1       \           0       0
+2838       32771   4/12288/1       5/771/1         26      2       1       \           0       0
+2838       32772   4/12288/1       5/896/1         31      2       1       \           1       128
+2838       32773   4/12288/1       5/897/1         31      2       1       \           0       0
+2828       32774   4/12288/1       5/898/1         31      2       1       \           1       2
+2838       32775   4/12288/1       5/899/1         31      2       1       \           0       0
+2838       32776   4/12288/1       5/900/1         31      2       1       \           1       4
+2838       32777   4/12288/1       5/901/1         31      2       1       \           0       0
+2838       32778   4/12288/1       5/902/1         31      2       1       \           1       2</screen>
+      <para>The parameters are described below:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Parameter</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>pid</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Process that made the allocation.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>inode</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>inode number allocated blocks</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>goal</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Initial request that came to <literal>mballoc</literal> (group/block-in-group/number-of-blocks)</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>result</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>What <literal>mballoc</literal> actually found for this request.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>found</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Number of free chunks <literal>mballoc</literal> found and measured before the final decision.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>grps</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Number of groups <literal>mballoc</literal> scanned to satisfy the request.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>cr</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Stage at which <literal>mballoc</literal> found the result:</para>
+                <para><emphasis role="bold">0</emphasis> - best in terms of resource allocation. The request was 1MB or larger and was satisfied directly via the kernel buddy allocator.</para>
+                <para><emphasis role="bold">1</emphasis> - regular stage (good at resource consumption)</para>
+                <para><emphasis role="bold">2</emphasis> - fs is quite fragmented (not that bad at resource consumption)</para>
+                <para><emphasis role="bold">3</emphasis> - fs is very fragmented (worst at resource consumption)</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>queue</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Total bytes in active/queued sends.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>merge</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Whether the request hit the goal. This is good as extents code can now merge new blocks to existing extent, eliminating the need for extents tree growth.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>tail</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Number of blocks left free after the allocation breaks large free chunks.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>broken</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>How large the broken chunk was.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <para>Most customers are probably interested in found/cr. If cr is 0 1 and found is less than 100, then <literal>mballoc</literal> is doing quite well.</para>
+      <para>Also, number-of-blocks-in-request (third number in the goal triple) can tell the number of blocks requested by the <literal>obdfilter</literal>. If the <literal>obdfilter</literal> is doing a lot of small requests (just few blocks), then either the client is processing input/output to a lot of small files, or something may be wrong with the client (because it is better if client sends large input/output requests). This can be investigated with the OSC <literal>rpc_stats</literal> or OST <literal>brw_stats</literal> mentioned above.</para>
+      <para>Number of groups scanned (<literal>grps</literal> column) should be small. If it reaches a few dozen often, then either your disk file system is pretty fragmented or <literal>mballoc</literal> is doing something wrong in the group selection part.</para>
+    </section>
+    <section remap="h3">
+      <title>31.2.10 <literal>mballoc3</literal> Tunables</title>
+      <para>Lustre version 1.6.1 and later includes <literal>mballoc3</literal>, which was built on top of <literal>mballoc2</literal>. By default, mballoc3 is enabled, and adds these features:</para>
+      <itemizedlist>
+        <listitem>
+          <para> Pre-allocation for single files (helps to resist fragmentation)</para>
+        </listitem>
+        <listitem>
+          <para> Pre-allocation for a group of files (helps to pack small files into large, contiguous chunks)</para>
+        </listitem>
+        <listitem>
+          <para> Stream allocation (helps to decrease the seek rate)</para>
+        </listitem>
+      </itemizedlist>
+      <para>The following <literal>mballoc3</literal> tunables are available:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>stats</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Enables/disables the collection of statistics. Collected statistics can be found</para>
+                <para>in <literal>/proc/fs/ldiskfs2/&lt;dev&gt;/mb_history</literal>.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">max_to_scan</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a final decision to avoid livelock.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">min_to_scan</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Minimum number of free chunks that <literal>mballoc</literal> finds before a final decision. This is useful for a very small request, to resist fragmentation of big free chunks.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">order2_req</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>For requests equal to 2^N (where N &gt;= <literal>order2_req</literal>), a very fast search via buddy structures is used.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">stream_req</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Requests smaller or equal to this value are packed together to form large write I/Os.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <para>The following tunables, providing more control over allocation policy, will be available in the next version:</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Field</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">
+                    <literal>stats</literal>
+                  </emphasis></para>
+              </entry>
+              <entry>
+                <para>Enables/disables the collection of statistics. Collected statistics can be found in <literal>/proc/fs/ldiskfs2/&lt;dev&gt;/mb_history</literal>.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">max_to_scan</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a final decision to avoid livelock.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">min_to_scan</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>Minimum number of free chunks that <literal>mballoc</literal> finds before a final decision. This is useful for a very small request, to resist fragmentation of big free chunks.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">order2_req</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>For requests equal to 2^N (where N &gt;= <literal>order2_req</literal>), a very fast search via buddy structures is used.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">small_req</emphasis>
+                  </literal></para>
+              </entry>
+              <entry morerows="1">
+                <para>All requests are divided into 3 categories:</para>
+                <para>&lt; small_req (packed together to form large, aggregated requests)</para>
+                <para>&lt; large_req (allocated mostly in linearly)</para>
+                <para>&gt; large_req (very large requests so the arm seek does not matter)</para>
+                <para>The idea is that we try to pack small requests to form large requests, and then place all large requests (including compound from the small ones) close to one another, causing as few arm seeks as possible.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">large_req</emphasis>
+                  </literal></para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">prealloc_table</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>The amount of space to preallocate depends on the current file size. The idea is that for small files we do not need 1 MB preallocations and for large files, 1 MB preallocations are not large enough; it is better to preallocate 4 MB.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> <literal>
+                    <emphasis role="bold">group_prealloc</emphasis>
+                  </literal></para>
+              </entry>
+              <entry>
+                <para>The amount of space preallocated for small requests to be grouped.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+    </section>
+    <section remap="h3">
+      <title>31.2.11 Locking</title>
+      <para><literal>
+          <emphasis role="bold">/proc/fs/lustre/ldlm/ldlm/namespaces/&lt;OSC name|MDC name&gt;/lru_size</emphasis>
+        </literal></para>
+      <para>The <literal>lru_size</literal> parameter is used to control the number of client-side locks in an LRU queue. LRU size is dynamic, based on load. This optimizes the number of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute nodes vs. backup nodes).</para>
+      <para>The total number of locks available is a function of the server&apos;s RAM. The default limit is 50 locks/1 MB of RAM. If there is too much memory pressure, then the LRU size is shrunk. The number of locks on the server is limited to {number of OST/MDT on node} * {number of clients} * {client lru_size}.</para>
+      <itemizedlist>
+        <listitem>
+          <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0. In this case, the <literal>lru_size</literal> parameter shows the current number of locks being used on the export. (In Lustre 1.6.5.1 and later, LRU sizing is enabled, by default.)</para>
+        </listitem>
+        <listitem>
+          <para>To specify a maximum number of locks, set the lru_size parameter to a value &gt; 0 (former numbers are okay, 100 * CPU_NR). We recommend that you only increase the LRU size on a few login nodes where users access the file system interactively.</para>
+        </listitem>
+      </itemizedlist>
+      <para>To clear the LRU on a single client, and as a result flush client cache, without changing the <literal>lru_size</literal> value:</para>
+      <screen>$ lctl set_param ldlm.namespaces.&lt;osc_name|mdc_name&gt;.lru_size=clear</screen>
+      <para>If you shrink the LRU size below the number of existing unused locks, then the unused locks are canceled immediately. Use echo clear to cancel all locks without changing the value.</para>
+      <note>
+        <para>Currently, the lru_size parameter can only be set temporarily with <literal>lctl set_param</literal>; it cannot be set permanently.</para>
+      </note>
+      <para>To disable LRU sizing, run this command on the Lustre clients:</para>
+      <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))</screen>
+      <para>Replace <literal>NR_CPU</literal> value with the number of CPUs on the node.</para>
+      <para>To determine the number of locks being granted:</para>
+      <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
+    </section>
+    <section xml:id="dbdoclet.50438271_87260">
+      <title>31.2.12 Setting MDS and OSS Thread Counts</title>
+      <para>MDS and OSS thread counts (minimum and maximum) can be set via the <literal>{min,max}_thread_count tunable</literal>. For each service, a new <literal>/proc/fs/lustre/{service}/*/thread_{min,max,started}</literal> entry is created. The tunable, <literal>{service}.thread_{min,max,started}</literal>, can be used to set the minimum and maximum thread counts or get the current number of running threads for the following services.</para>
+      <informaltable frame="all">
+        <tgroup cols="2">
+          <colspec colname="c1" colwidth="50*"/>
+          <colspec colname="c2" colwidth="50*"/>
+          <tbody>
+            <row>
+              <entry>
+                <para> <emphasis role="bold">Service</emphasis></para>
+              </entry>
+              <entry>
+                <para> <emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> mdt.MDS.mds</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>normal metadata ops</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> mdt.MDS.mds_readpage</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>metadata readdir</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> mdt.MDS.mds_setattr</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>metadata setattr</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> ost.OSS.ost</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>normal data</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> ost.OSS.ost_io</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>bulk data IO</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> ost.OSS.ost_create</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>OST object pre-creation service</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> ldlm.services.ldlm_canceld</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>DLM lock cancel</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <literal>
+                  <para> ldlm.services.ldlm_cbd</para>
+                </literal>
+              </entry>
+              <entry>
+                <para>DLM lock grant</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+      <itemizedlist>
+        <listitem>
+          <para>To temporarily set this tunable, run:</para>
+          <screen># lctl {get,set}_param {service}.thread_{min,max,started} </screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para>To permanently set this tunable, run:</para>
+          <screen># lctl conf_param {service}.thread_{min,max,started} </screen>
+          <para>The following examples show how to set thread counts and get the number of running threads for the ost_io service.</para>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para>To get the number of running threads, run:</para>
+          <screen># lctl get_param ost.OSS.ost_io.threads_started</screen>
+          <para>The command output will be similar to this:</para>
+          <screen>ost.OSS.ost_io.threads_started=128</screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para>To set the maximum number of threads (512), run:</para>
+          <screen># lctl get_param ost.OSS.ost_io.threads_max</screen>
+          <para>The command output will be:</para>
+          <screen>ost.OSS.ost_io.threads_max=512</screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para> To set the maximum thread count to 256 instead of 512 (to avoid overloading the storage or for an array with requests), run:</para>
+          <screen># lctl set_param ost.OSS.ost_io.threads_max=256</screen>
+          <para>The command output will be:</para>
+          <screen>ost.OSS.ost_io.threads_max=256</screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para> To check if the new <literal>threads_max</literal> setting is active, run:</para>
+          <screen># lctl get_param ost.OSS.ost_io.threads_max</screen>
+          <para>The command output will be similar to this:</para>
+          <screen>ost.OSS.ost_io.threads_max=256</screen>
+        </listitem>
+      </itemizedlist>
+      <note>
+        <para>Currently, the maximum thread count setting is advisory because Lustre does not reduce the number of service threads in use, even if that number exceeds the <literal>threads_max</literal> value. Lustre does not stop service threads once they are started.</para>
+      </note>
+    </section>
+  </section>
+  <section xml:id="dbdoclet.50438271_83523">
+    <title>31.3 Debug</title>
+    <para><literal>
+        <emphasis role="bold">/proc/sys/lnet/debug</emphasis>
+      </literal></para>
+    <para>By default, Lustre generates a detailed log of all operations to aid in debugging. The level of debugging can affect the performance or speed you achieve with Lustre. Therefore, it is useful to reduce this overhead by turning down the debug level<footnote>
+        <para>This controls the level of Lustre debugging kept in the internal log buffer. It does not alter the level of debugging that goes to syslog.</para>
+      </footnote> to improve performance. Raise the debug level when you need to collect the logs for debugging problems. The debugging mask can be set with &quot;symbolic names&quot; instead of the numerical values that were used in prior releases. The new symbolic format is shown in the examples below.</para>
+    <note>
+      <para>All of the commands below must be run as root; note the <literal>#</literal> nomenclature.</para>
+    </note>
+    <para>To verify the debug level used by examining the <literal>sysctl</literal> that controls debugging, run:</para>
+    <screen># sysctl lnet.debug 
+lnet.debug = ioctl neterror warning error emerg ha config console</screen>
+    <para>To turn off debugging (except for network error debugging), run this command on all concerned nodes:</para>
+    <screen># sysctl -w lnet.debug=&quot;neterror&quot; 
+lnet.debug = neterror</screen>
+    <para>To turn off debugging completely, run this command on all concerned nodes:</para>
+    <screen># sysctl -w lnet.debug=0 
+lnet.debug = 0</screen>
+    <para>To set an appropriate debug level for a production environment, run:</para>
+    <screen># sysctl -w lnet.debug=&quot;warning dlmtrace error emerg ha rpctrace vfstrace&quot; 
+lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace</screen>
+    <para>The flags above collect enough high-level information to aid debugging, but they do not cause any serious performance impact.</para>
+    <para>To clear all flags and set new ones, run:</para>
+    <screen># sysctl -w lnet.debug=&quot;warning&quot; 
+lnet.debug = warning</screen>
+    <para>To add new flags to existing ones, prefix them with a &quot;<literal>+</literal>&quot;:</para>
+    <screen># sysctl -w lnet.debug=&quot;+neterror +ha&quot; 
+lnet.debug = +neterror +ha
+# sysctl lnet.debug 
+lnet.debug = neterror warning ha</screen>
+    <para>To remove flags, prefix them with a &quot;<literal>-</literal>&quot;:</para>
+    <screen># sysctl -w lnet.debug=&quot;-ha&quot; 
+lnet.debug = -ha
+# sysctl lnet.debug 
+lnet.debug = neterror warning</screen>
+    <para>You can verify and change the debug level using the <literal>/proc</literal> interface in Lustre. To use the flags with <literal>/proc</literal>, run:</para>
+    <screen># cat /proc/sys/lnet/debug 
+neterror warning
+# echo &quot;+ha&quot; &gt; /proc/sys/lnet/debug 
+# cat /proc/sys/lnet/debug 
+neterror warning ha
+# echo &quot;-warning&quot; &gt; /proc/sys/lnet/debug
+# cat /proc/sys/lnet/debug 
+neterror ha</screen>
+    <para><literal>
+        <emphasis role="bold">/proc/sys/lnet/subsystem_debug</emphasis>
+      </literal></para>
+    <para>This controls the debug logs for subsystems (see <literal>S_*</literal> definitions).</para>
+    <para><literal>
+        <emphasis role="bold">/proc/sys/lnet/debug_path</emphasis>
+      </literal></para>
+    <para>This indicates the location where debugging symbols should be stored for <literal>gdb</literal>. The default is set to <literal>/r/tmp/lustre-log-localhost.localdomain</literal>.</para>
+    <para>These values can also be set via <literal>sysctl -w lnet.debug={value}</literal></para>
+    <note>
+      <para>The above entries only exist when Lustre has already been loaded.</para>
+    </note>
+    <para><literal>
+        <emphasis role="bold">/proc/sys/lnet/panic_on_lbug</emphasis>
+      </literal></para>
+    <para>This causes Lustre to call &apos;&apos;panic&apos;&apos; when it detects an internal problem (an <literal>LBUG</literal>); panic crashes the node. This is particularly useful when a kernel crash dump utility is configured. The crash dump is triggered when the internal inconsistency is detected by Lustre.</para>
+    <para><literal>
+        <emphasis role="bold">/proc/sys/lnet/upcall</emphasis>
+      </literal></para>
+    <para>This allows you to specify the path to the binary which will be invoked when an <literal>LBUG</literal> is encountered. This binary is called with four parameters. The first one is the string &apos;&apos;<literal>LBUG</literal>&apos;&apos;. The second one is the file where the <literal>LBUG</literal> occurred. The third one is the function name. The fourth one is the line number in the file.</para>
+    <section remap="h3">
+      <title>31.3.1 RPC Information for Other OBD Devices</title>
+      <para>Some OBD devices maintain a count of the number of RPC events that they process. Sometimes these events are more specific to operations of the device, like llite, than actual raw RPC counts.</para>
+      <screen>$ find /proc/fs/lustre/ -name stats
+/proc/fs/lustre/osc/lustre-OST0001-osc-ce63ca00/stats
+/proc/fs/lustre/osc/lustre-OST0000-osc-ce63ca00/stats
+/proc/fs/lustre/osc/lustre-OST0001-osc/stats
+/proc/fs/lustre/osc/lustre-OST0000-osc/stats
+/proc/fs/lustre/mdt/MDS/mds_readpage/stats
+/proc/fs/lustre/mdt/MDS/mds_setattr/stats
+/proc/fs/lustre/mdt/MDS/mds/stats
+/proc/fs/lustre/mds/lustre-MDT0000/exports/ab206805-0630-6647-8543-d24265c91a3d/stats
+/proc/fs/lustre/mds/lustre-MDT0000/exports/08ac6584-6c4a-3536-2c6d-b36cf9cbdaa0/stats
+/proc/fs/lustre/mds/lustre-MDT0000/stats
+/proc/fs/lustre/ldlm/services/ldlm_canceld/stats
+/proc/fs/lustre/ldlm/services/ldlm_cbd/stats
+/proc/fs/lustre/llite/lustre-ce63ca00/stats
  </screen>
-        <para>By default, sync_on_lock_cancel is set to never, because asynchronous journal commit is disabled by default.</para>
-        <para>When asynchronous journal commit is enabled (sync_journal=0), sync_on_lock_cancel is automatically set to always, if it was previously set to never.</para>
-        <para>Similarly, when asynchronous journal commit is disabled, (sync_journal=1), sync_on_lock_cancel is enforced to never.</para>
-      </section>
-      <section remap="h3">
-        <title>31.2.9 mballoc <anchor xml:id="dbdoclet.50438271_marker-1297045" xreflabel=""/>History</title>
-        <para><emphasis role="bold">/proc/fs/ldiskfs/sda/mb_history</emphasis></para>
-        <para>Multi-Block-Allocate (mballoc), enables Lustre to ask ldiskfs to allocate multiple blocks with a single request to the block allocator. Typically, an ldiskfs file system allocates only one block per time. Each mballoc-enabled partition has this file. This is sample output:</para>
-        <screen>pid  inode   goal            result          found   grps    cr      \   me\
-rge   tail    broken
-2838       139267  17/12288/1      17/12288/1      1       0       0       \
-\   M       1       8192
-2838       139267  17/12289/1      17/12289/1      1       0       0       \
-\   M       0       0
-2838       139267  17/12290/1      17/12290/1      1       0       0       \
-\   M       1       2
-2838       24577   3/12288/1       3/12288/1       1       0       0       \
-\   M       1       8192
-2838       24578   3/12288/1       3/771/1         1       1       1       \
-\           0       0
-2838       32769   4/12288/1       4/12288/1       1       0       0       \
-\   M       1       8192
-2838       32770   4/12288/1       4/12289/1       13      1       1       \
-\           0       0
-2838       32771   4/12288/1       5/771/1         26      2       1       \
-\           0       0
-2838       32772   4/12288/1       5/896/1         31      2       1       \
-\           1       128
-2838       32773   4/12288/1       5/897/1         31      2       1       \
-\           0       0
-2828       32774   4/12288/1       5/898/1         31      2       1       \
-\           1       2
-2838       32775   4/12288/1       5/899/1         31      2       1       \
-\           0       0
-2838       32776   4/12288/1       5/900/1         31      2       1       \
-\           1       4
-2838       32777   4/12288/1       5/901/1         31      2       1       \
-\           0       0
-2838       32778   4/12288/1       5/902/1         31      2       1       \
-\           1       2
+      <section remap="h4">
+        <title>31.3.1.1 Interpreting OST Statistics</title>
+        <note>
+          <para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref linkend="dbdoclet.50438273_80593"/> (CollectL).</para>
+        </note>
+        <para>The OST .../stats files can be used to track client statistics (client activity) for each OST. It is possible to get a periodic dump of values from these file (for example, every 10 seconds), that show the RPC rates (similar to iostat) by using the <literal>llstat.pl</literal> tool:</para>
+        <screen># llstat /proc/fs/lustre/osc/lustre-OST0000-osc/stats 
+/usr/bin/llstat: STATS on 09/14/07 /proc/fs/lustre/osc/lustre-OST0000-osc/stats on 192.168.10.34@tcp
+snapshot_time                      1189732762.835363
+ost_create                 1
+ost_get_info                       1
+ost_connect                        1
+ost_set_info                       1
+obd_ping                   212</screen>
+        <para>To clear the statistics, give the <literal>-c</literal> option to <literal>llstat.pl</literal>. To specify how frequently the statistics should be cleared (in seconds), use an integer for the <literal>-i</literal> option. This is sample output with <literal>-c</literal> and <literal>-i10</literal> options used, providing statistics every 10s):</para>
+        <screen>$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
+ 
+/usr/bin/llstat: STATS on 06/06/07 /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
+snapshot_time                              1181074093.276072
+ 
+/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
+Name               Cur.Count       Cur.Rate        #Events Unit            \last               min             avg             max             stddev
+req_waittime       8               0               8       [usec]          2078\               34              259.75          868             317.49
+req_qdepth 8               0               8       [reqs]          1\          0               0.12            1               0.35
+req_active 8               0               8       [reqs]          11\                 1               1.38            2               0.52
+reqbuf_avail       8               0               8       [bufs]          511\                63              63.88           64              0.35
+ost_write  8               0               8       [bytes]         1697677\    72914           212209.62       387579          91874.29
+ 
+/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
+Name               Cur.Count       Cur.Rate        #Events Unit            \last               min             avg             max             stddev
+req_waittime       31              3               39      [usec]          30011\              34              822.79          12245           2047.71
+req_qdepth 31              3               39      [reqs]          0\          0               0.03            1               0.16
+req_active 31              3               39      [reqs]          58\         1               1.77            3               0.74
+reqbuf_avail       31              3               39      [bufs]          1977\               63              63.79           64              0.41
+ost_write  30              3               38      [bytes]         10284679\   15019           315325.16       910694          197776.51
+ 
+/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
+Name               Cur.Count       Cur.Rate        #Events Unit            \last               min             avg             max             stddev
+req_waittime       21              2               60      [usec]          14970\              34              784.32          12245           1878.66
+req_qdepth 21              2               60      [reqs]          0\          0               0.02            1               0.13
+req_active 21              2               60      [reqs]          33\                 1               1.70            3               0.70
+reqbuf_avail       21              2               60      [bufs]          1341\               63              63.82           64              0.39
+ost_write  21              2               59      [bytes]         7648424\    15019           332725.08       910694          180397.87
  </screen>
-        <para>The parameters are described below:</para>
+        <para>Where:</para>
          <informaltable frame="all">
            <tgroup cols="2">
              <colspec colname="c1" colwidth="50*"/>
              <colspec colname="c2" colwidth="50*"/>
              <thead>
                <row>
-                <entry><para><emphasis role="bold">Parameter</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
+                <entry>
+                  <para><emphasis role="bold">Parameter</emphasis></para>
+                </entry>
+                <entry>
+                  <para><emphasis role="bold">Description</emphasis></para>
+                </entry>
                </row>
              </thead>
              <tbody>
                <row>
-                <entry><para> <emphasis role="bold">pid</emphasis></para></entry>
-                <entry><para> Process that made the allocation.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">inode</emphasis></para></entry>
-                <entry><para> inode number allocated blocks</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">goal</emphasis></para></entry>
-                <entry><para> Initial request that came to mballoc (group/block-in-group/number-of-blocks)</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">result</emphasis></para></entry>
-                <entry><para> What mballoc actually found for this request.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">found</emphasis></para></entry>
-                <entry><para> Number of free chunks mballoc found and measured before the final decision.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">grps</emphasis></para></entry>
-                <entry><para> Number of groups mballoc scanned to satisfy the request.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">cr</emphasis></para></entry>
-                <entry><para> Stage at which mballoc found the result:</para><para><emphasis role="bold">0</emphasis> - best in terms of resource allocation. The request was 1MB or larger and was satisfied directly via the kernel buddy allocator.</para><para><emphasis role="bold">1</emphasis> - regular stage (good at resource consumption)</para><para><emphasis role="bold">2</emphasis> - fs is quite fragmented (not that bad at resource consumption)</para><para><emphasis role="bold">3</emphasis> - fs is very fragmented (worst at resource consumption)</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">queue</emphasis></para></entry>
-                <entry><para> Total bytes in active/queued sends.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">merge</emphasis></para></entry>
-                <entry><para> Whether the request hit the goal. This is good as extents code can now merge new blocks to existing extent, eliminating the need for extents tree growth.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">tail</emphasis></para></entry>
-                <entry><para> Number of blocks left free after the allocation breaks large free chunks.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">broken</emphasis></para></entry>
-                <entry><para> How large the broken chunk was.</para></entry>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">Cur. Count</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Number of events of each type sent in the last interval (in this example, 10s)</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">Cur. Rate</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Number of events per second in the last interval</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">#Events</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Total number of such events since the system started</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <emphasis role="bold">
+                      <literal>Unit</literal>
+                    </emphasis></para>
+                </entry>
+                <entry>
+                  <para>Unit of measurement for that statistic (microseconds, requests, buffers)</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <emphasis role="bold">
+                      <literal>last</literal>
+                    </emphasis></para>
+                </entry>
+                <entry>
+                  <para>Average rate of these events (in units/event) for the last interval during which they arrived. For instance, in the above mentioned case of <literal>ost_destroy</literal> it took an average of 736 microseconds per destroy for the 400 object destroys in the previous 10 seconds.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <emphasis role="bold">
+                      <literal>min</literal>
+                    </emphasis></para>
+                </entry>
+                <entry>
+                  <para>Minimum rate (in units/events) since the service started</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <emphasis role="bold">
+                      <literal>avg</literal>
+                    </emphasis></para>
+                </entry>
+                <entry>
+                  <para>Average rate</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <emphasis role="bold">
+                      <literal>max</literal>
+                    </emphasis></para>
+                </entry>
+                <entry>
+                  <para>Maximum rate</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <emphasis role="bold">
+                      <literal>stddev</literal>
+                    </emphasis></para>
+                </entry>
+                <entry>
+                  <para>Standard deviation (not measured in all cases)</para>
+                </entry>
                </row>
              </tbody>
            </tgroup>
          </informaltable>
-        <para>Most customers are probably interested in found/cr. If cr is 0 1 and found is less than 100, then mballoc is doing quite well.</para>
-        <para>Also, number-of-blocks-in-request (third number in the goal triple) can tell the number of blocks requested by the obdfilter. If the obdfilter is doing a lot of small requests (just few blocks), then either the client is processing input/output to a lot of small files, or something may be wrong with the client (because it is better if client sends large input/output requests). This can be investigated with the OSC rpc_stats or OST brw_stats mentioned above.</para>
-        <para>Number of groups scanned (grps column) should be small. If it reaches a few dozen often, then either your disk file system is pretty fragmented or mballoc is doing something wrong in the group selection part.</para>
-      </section>
-      <section remap="h3">
-        <title>31.2.10 mballoc3<anchor xml:id="dbdoclet.50438271_marker-1290795" xreflabel=""/> Tunables</title>
-        <para>Lustre version 1.6.1 and later includes mballoc3, which was built on top of mballoc2. By default, mballoc3 is enabled, and adds these features:</para>
-        <itemizedlist><listitem>
-            <para> Pre-allocation for single files (helps to resist fragmentation)</para>
-          </listitem>
-
-<listitem>
-            <para> Pre-allocation for a group of files (helps to pack small files into large, contiguous chunks)</para>
-          </listitem>
-
-<listitem>
-            <para> Stream allocation (helps to decrease the seek rate)</para>
-          </listitem>
-
-</itemizedlist>
-        <para>The following mballoc3 tunables are available:</para>
+        <para>The events common to all services are:</para>
          <informaltable frame="all">
            <tgroup cols="2">
              <colspec colname="c1" colwidth="50*"/>
              <colspec colname="c2" colwidth="50*"/>
              <thead>
                <row>
-                <entry><para><emphasis role="bold">Field</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
+                <entry>
+                  <para><emphasis role="bold">Parameter</emphasis></para>
+                </entry>
+                <entry>
+                  <para><emphasis role="bold">Description</emphasis></para>
+                </entry>
                </row>
              </thead>
              <tbody>
                <row>
-                <entry><para> <emphasis role="bold">stats</emphasis></para></entry>
-                <entry><para> Enables/disables the collection of statistics. Collected statistics can be found</para><para>in /proc/fs/ldiskfs2/&lt;dev&gt;/mb_history.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">max_to_scan</emphasis></para></entry>
-                <entry><para> Maximum number of free chunks that mballoc finds before a final decision to avoid livelock.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">min_to_scan</emphasis></para></entry>
-                <entry><para> Minimum number of free chunks that mballoc finds before a final decision. This is useful for a very small request, to resist fragmentation of big free chunks.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">order2_req</emphasis></para></entry>
-                <entry><para> For requests equal to 2^N (where N &gt;= order2_req), a very fast search via buddy structures is used.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">stream_req</emphasis></para></entry>
-                <entry><para> Requests smaller or equal to this value are packed together to form large write I/Os.</para></entry>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">req_waittime</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Amount of time a request waited in the queue before being handled by an available server thread.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">req<literal>_</literal>qdepth</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Number of requests waiting to be handled in the queue for this service.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">req_active</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Number of requests currently being handled.</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">reqbuf_avail</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Number of unsolicited lnet request buffers for this service.</para>
+                </entry>
                </row>
              </tbody>
            </tgroup>
          </informaltable>
-        <para>The following tunables, providing more control over allocation policy, will be available in the next version:</para>
+        <para>Some service-specific events of interest are:</para>
          <informaltable frame="all">
            <tgroup cols="2">
              <colspec colname="c1" colwidth="50*"/>
              <colspec colname="c2" colwidth="50*"/>
              <thead>
                <row>
-                <entry><para><emphasis role="bold">Field</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
+                <entry>
+                  <para><emphasis role="bold">Parameter</emphasis></para>
+                </entry>
+                <entry>
+                  <para><emphasis role="bold">Description</emphasis></para>
+                </entry>
                </row>
              </thead>
              <tbody>
                <row>
-                <entry><para> <emphasis role="bold">stats</emphasis></para></entry>
-                <entry><para> Enables/disables the collection of statistics. Collected statistics can be found in /proc/fs/ldiskfs2/&lt;dev&gt;/mb_history.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">max_to_scan</emphasis></para></entry>
-                <entry><para> Maximum number of free chunks that mballoc finds before a final decision to avoid livelock.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">min_to_scan</emphasis></para></entry>
-                <entry><para> Minimum number of free chunks that mballoc finds before a final decision. This is useful for a very small request, to resist fragmentation of big free chunks.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">order2_req</emphasis></para></entry>
-                <entry><para> For requests equal to 2^N (where N &gt;= order2_req), a very fast search via buddy structures is used.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">small_req</emphasis></para></entry>
-                <entry morerows="1"><para> All requests are divided into 3 categories:</para><para>&lt; small_req (packed together to form large, aggregated requests)</para><para>&lt; large_req (allocated mostly in linearly)</para><para>&gt; large_req (very large requests so the arm seek does not matter)</para><para>The idea is that we try to pack small requests to form large requests, and then place all large requests (including compound from the small ones) close to one another, causing as few arm seeks as possible.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">large_req</emphasis></para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">prealloc_table</emphasis></para></entry>
-                <entry><para> The amount of space to preallocate depends on the current file size. The idea is that for small files we do not need 1 MB preallocations and for large files, 1 MB preallocations are not large enough; it is better to preallocate 4 MB.</para></entry>
-              </row>
-              <row>
-                <entry><para> <emphasis role="bold">group_prealloc</emphasis></para></entry>
-                <entry><para> The amount of space preallocated for small requests to be grouped.</para></entry>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">ldlm_enqueue</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Time it takes to enqueue a lock (this includes file open on the MDS)</para>
+                </entry>
+              </row>
+              <row>
+                <entry>
+                  <para> <literal>
+                      <emphasis role="bold">mds_reint</emphasis>
+                    </literal></para>
+                </entry>
+                <entry>
+                  <para>Time it takes to process an MDS modification record (includes create, <literal>mkdir</literal>, <literal>unlink</literal>, <literal>rename</literal> and <literal>setattr</literal>)</para>
+                </entry>
                </row>
              </tbody>
            </tgroup>
          </informaltable>
        </section>
-      <section remap="h3">
-        <title>31.2.11 <anchor xml:id="dbdoclet.50438271_13474" xreflabel=""/>Lo<anchor xml:id="dbdoclet.50438271_marker-1290874" xreflabel=""/>cking</title>
-        <para><emphasis role="bold">/proc/fs/lustre/ldlm/ldlm/namespaces/&lt;OSC name|MDC name&gt;/lru_size</emphasis></para>
-        <para>The lru_size parameter is used to control the number of client-side locks in an LRU queue. LRU size is dynamic, based on load. This optimizes the number of locks available to nodes that have different workloads (e.g., login/build nodes vs. compute nodes vs. backup nodes).</para>
-        <para>The total number of locks available is a function of the server's RAM. The default limit is 50 locks/1 MB of RAM. If there is too much memory pressure, then the LRU size is shrunk. The number of locks on the server is limited to {number of OST/MDT on node} * {number of clients} * {client lru_size}.</para>
-        <itemizedlist><listitem>
-            <para> To enable automatic LRU sizing, set the lru_size parameter to 0. In this case, the lru_size parameter shows the current number of locks being used on the export. (In Lustre 1.6.5.1 and later, LRU sizing is enabled, by default.)</para>
-          </listitem>
-
-<listitem>
-            <para> To specify a maximum number of locks, set the lru_size parameter to a value &gt; 0 (former numbers are okay, 100 * CPU_NR). We recommend that you only increase the LRU size on a few login nodes where users access the file system interactively.</para>
-          </listitem>
-
-</itemizedlist>
-        <para>To clear the LRU on a single client, and as a result flush client cache, without changing the lru_size value:</para>
-        <screen>$ lctl set_param ldlm.namespaces.&lt;osc_name|mdc_name&gt;.lru_size=clear
-</screen>
-        <para>If you shrink the LRU size below the number of existing unused locks, then the unused locks are canceled immediately. Use echo clear to cancel all locks without changing the value.</para>
-                <note><para>Currently, the lru_size parameter can only be set temporarily with lctl set_param; it cannot be set permanently.</para></note>
-        <para>To disable LRU sizing, run this command on the Lustre clients:</para>
-        <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))
-</screen>
-        <para>Replace NR_CPU value with the number of CPUs on the node.</para>
-        <para>To determine the number of locks being granted:</para>
-        <screen>$ lctl get_param ldlm.namespaces.*.pool.limit
-</screen>
-      </section>
-      <section remap="h3">
-        <title>31.2.12 <anchor xml:id="dbdoclet.50438271_87260" xreflabel=""/>Setting MDS and OSS Thread Counts</title>
-        <para>MDS and OSS thread counts (minimum and maximum) can be set via the {min,max}_thread_count tunable. For each service, a new /proc/fs/lustre/{service}/*/thread_{min,max,started} entry is created. The tunable, {service}.thread_{min,max,started}, can be used to set the minimum and maximum thread counts or get the current number of running threads for the following services.</para>
-        <informaltable frame="all">
-          <tgroup cols="2">
-            <colspec colname="c1" colwidth="50*"/>
-            <colspec colname="c2" colwidth="50*"/>
-            <tbody>
-              <row>
-                <entry><para> <emphasis role="bold">Service</emphasis></para></entry>
-                <entry><para> <emphasis role="bold">Description</emphasis></para></entry>
-              </row>
-              <row>
-                <entry><para> mdt.MDS.mds</para></entry>
-                <entry><para> normal metadata ops</para></entry>
-              </row>
-              <row>
-                <entry><para> mdt.MDS.mds_readpage</para></entry>
-                <entry><para> metadata readdir</para></entry>
-              </row>
-              <row>
-                <entry><para> mdt.MDS.mds_setattr</para></entry>
-                <entry><para> metadata setattr</para></entry>
-              </row>
-              <row>
-                <entry><para> ost.OSS.ost</para></entry>
-                <entry><para> normal data</para></entry>
-              </row>
-              <row>
-                <entry><para> ost.OSS.ost_io</para></entry>
-                <entry><para> bulk data IO</para></entry>
-              </row>
-              <row>
-                <entry><para> ost.OSS.ost_create</para></entry>
-                <entry><para> OST object pre-creation service</para></entry>
-              </row>
-              <row>
-                <entry><para> ldlm.services.ldlm_canceld</para></entry>
-                <entry><para> DLM lock cancel</para></entry>
-              </row>
-              <row>
-                <entry><para> ldlm.services.ldlm_cbd</para></entry>
-                <entry><para> DLM lock grant</para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
-        <itemizedlist><listitem>
-            <para> To temporarily set this tunable, run:</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># lctl {get,set}_param {service}.thread_{min,max,started} 
-</screen>
-        <itemizedlist><listitem>
-            <para> To permanently set this tunable, run:</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># lctl conf_param {service}.thread_{min,max,started} </screen>
-        <para>The following examples show how to set thread counts and get the number of running threads for the ost_io service.</para>
-        <itemizedlist><listitem>
-            <para> To get the number of running threads, run:</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># lctl get_param ost.OSS.ost_io.threads_started</screen>
-        <para>The command output will be similar to this:</para>
-        <screen>ost.OSS.ost_io.threads_started=128
-</screen>
-        <itemizedlist><listitem>
-            <para> To set the maximum number of threads (512), run:</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># lctl get_param ost.OSS.ost_io.threads_max
-</screen>
-        <para>The command output will be:</para>
-        <screen>ost.OSS.ost_io.threads_max=512
-</screen>
-        <itemizedlist><listitem>
-            <para> To set the maximum thread count to 256 instead of 512 (to avoid overloading the storage or for an array with requests), run:</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># lctl set_param ost.OSS.ost_io.threads_max=256
-</screen>
-        <para>The command output will be:</para>
-        <screen>ost.OSS.ost_io.threads_max=256
-</screen>
-        <itemizedlist><listitem>
-            <para> To check if the new threads_max setting is active, run:</para>
-          </listitem>
-
-</itemizedlist>
-        <screen># lctl get_param ost.OSS.ost_io.threads_max
-</screen>
-        <para>The command output will be similar to this:</para>
-        <screen>ost.OSS.ost_io.threads_max=256
-</screen>
-                <note><para>Currently, the maximum thread count setting is advisory because Lustre does not reduce the number of service threads in use, even if that number exceeds the threads_max value. Lustre does not stop service threads once they are started.</para></note>
-      </section>
-    </section>
-    <section xml:id="dbdoclet.50438271_83523">
-      <title>31.3 Debug</title>
-      <para><emphasis role="bold">/proc/sys/lnet/debug</emphasis></para>
-      <para>By default, Lustre generates a detailed log of all operations to aid in debugging. The level of debugging can affect the performance or speed you achieve with Lustre. Therefore, it is useful to reduce this overhead by turning down the debug level<footnote><para>This controls the level of Lustre debugging kept in the internal log buffer. It does not alter the level of debugging that goes to syslog.</para></footnote> to improve performance. Raise the debug level when you need to collect the logs for debugging problems. The debugging mask can be set with &quot;symbolic names&quot; instead of the numerical values that were used in prior releases. The new symbolic format is shown in the examples below.</para>
-              <note><para>All of the commands below must be run as root; note the # nomenclature.</para></note>
-      <para>To verify the debug level used by examining the sysctl that controls debugging, run:</para>
-      <screen># sysctl lnet.debug 
-lnet.debug = ioctl neterror warning error emerg ha config console
-</screen>
-      <para>To turn off debugging (except for network error debugging), run this command on all concerned nodes:</para>
-      <screen># sysctl -w lnet.debug=&quot;neterror&quot; 
-lnet.debug = neterror
-</screen>
-      <para>To turn off debugging completely, run this command on all concerned nodes:</para>
-      <screen># sysctl -w lnet.debug=0 
-lnet.debug = 0
-</screen>
-      <para>To set an appropriate debug level for a production environment, run:</para>
-      <screen># sysctl -w lnet.debug=&quot;warning dlmtrace error emerg ha rpctrace vfstrace&quot; 
-lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace
-</screen>
-      <para>The flags above collect enough high-level information to aid debugging, but they do not cause any serious performance impact.</para>
-      <para>To clear all flags and set new ones, run:</para>
-      <screen># sysctl -w lnet.debug=&quot;warning&quot; 
-lnet.debug = warning
-</screen>
-      <para>To add new flags to existing ones, prefix them with a &quot;+&quot;:</para>
-      <screen># sysctl -w lnet.debug=&quot;+neterror +ha&quot; 
-lnet.debug = +neterror +ha
-# sysctl lnet.debug 
-lnet.debug = neterror warning ha
-</screen>
-      <para>To remove flags, prefix them with a &quot;-&quot;:</para>
-      <screen># sysctl -w lnet.debug=&quot;-ha&quot; 
-lnet.debug = -ha
-# sysctl lnet.debug 
-lnet.debug = neterror warning
-</screen>
-      <para>You can verify and change the debug level using the /proc interface in Lustre. To use the flags with /proc, run:</para>
-      <screen># cat /proc/sys/lnet/debug 
-neterror warning
-# echo &quot;+ha&quot; &gt; /proc/sys/lnet/debug 
-# cat /proc/sys/lnet/debug 
-neterror warning ha
-# echo &quot;-warning&quot; &gt; /proc/sys/lnet/debug
-# cat /proc/sys/lnet/debug 
-neterror ha
-</screen>
-      <para><emphasis role="bold">/proc/sys/lnet/subsystem_debug</emphasis></para>
-      <para>This controls the debug logs3 for subsystems (see S_* definitions).</para>
-      <para><emphasis role="bold">/proc/sys/lnet/debug_path</emphasis></para>
-      <para>This indicates the location where debugging symbols should be stored for gdb. The default is set to /r/tmp/lustre-log-localhost.localdomain.</para>
-      <para>These values can also be set via sysctl -w lnet.debug={value}</para>
-              <note><para>The above entries only exist when Lustre has already been loaded.</para></note>
-      <para><emphasis role="bold">/proc/sys/lnet/panic_on_lbug</emphasis></para>
-      <para>This causes Lustre to call &apos;&apos;panic&apos;&apos; when it detects an internal problem (an LBUG); panic crashes the node. This is particularly useful when a kernel crash dump utility is configured. The crash dump is triggered when the internal inconsistency is detected by Lustre.</para>
-      <para><emphasis role="bold">/proc/sys/lnet/upcall</emphasis></para>
-      <para>This allows you to specify the path to the binary which will be invoked when an LBUG is encountered. This binary is called with four parameters. The first one is the string &apos;&apos;LBUG&apos;&apos;. The second one is the file where the LBUG occurred. The third one is the function name. The fourth one is the line number in the file.</para>
-      <section remap="h3">
-        <title>31.3.1 RPC Information for Other OBD Devices</title>
-        <para>Some OBD devices maintain a count of the number of RPC events that they process. Sometimes these events are more specific to operations of the device, like llite, than actual raw RPC counts.</para>
-        <screen>$ find /proc/fs/lustre/ -name stats
-/proc/fs/lustre/osc/lustre-OST0001-osc-ce63ca00/stats
-/proc/fs/lustre/osc/lustre-OST0000-osc-ce63ca00/stats
-/proc/fs/lustre/osc/lustre-OST0001-osc/stats
-/proc/fs/lustre/osc/lustre-OST0000-osc/stats
-/proc/fs/lustre/mdt/MDS/mds_readpage/stats
-/proc/fs/lustre/mdt/MDS/mds_setattr/stats
-/proc/fs/lustre/mdt/MDS/mds/stats
-/proc/fs/lustre/mds/lustre-MDT0000/exports/ab206805-0630-6647-8543-d24265c9\
-1a3d/stats
-/proc/fs/lustre/mds/lustre-MDT0000/exports/08ac6584-6c4a-3536-2c6d-b36cf9cb\
-daa0/stats
-/proc/fs/lustre/mds/lustre-MDT0000/stats
-/proc/fs/lustre/ldlm/services/ldlm_canceld/stats
-/proc/fs/lustre/ldlm/services/ldlm_cbd/stats
-/proc/fs/lustre/llite/lustre-ce63ca00/stats
-</screen>
-        <section remap="h4">
-          <title>31.3.1.1 Interpreting OST Statistics</title>
-          <note><para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref linkend="dbdoclet.50438273_80593"/> (CollectL).</para></note>
-
-          <para>The OST .../stats files can be used to track client statistics (client activity) for each OST. It is possible to get a periodic dump of values from these file (for example, every 10 seconds), that show the RPC rates (similar to iostat) by using the llstat.pl tool:</para>
-          <screen># llstat /proc/fs/lustre/osc/lustre-OST0000-osc/stats 
-/usr/bin/llstat: STATS on 09/14/07 /proc/fs/lustre/osc/lustre-OST0000-osc/s\
-tats on 192.168.10.34@tcp
-snapshot_time                      1189732762.835363
-ost_create                 1
-ost_get_info                       1
-ost_connect                        1
-ost_set_info                       1
-obd_ping                   212
-</screen>
-          <para>To clear the statistics, give the -c option to llstat.pl. To specify how frequently the statistics should be cleared (in seconds), use an integer for the -i option. This is sample output with -c and -i10 options used, providing statistics every 10s):</para>
-          <screen>$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
- 
-/usr/bin/llstat: STATS on 06/06/07 /proc/fs/lustre/ost/OSS/ost_io/ stats on\
- 192.168.16.35@tcp
-snapshot_time                              1181074093.276072
- 
-/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
-Name               Cur.Count       Cur.Rate        #Events Unit            \
-\last               min             avg             max             stddev
-req_waittime       8               0               8       [usec]          \
-2078\               34              259.75          868             317.49
-req_qdepth 8               0               8       [reqs]          1\      \
-    0               0.12            1               0.35
-req_active 8               0               8       [reqs]          11\     \
-            1               1.38            2               0.52
-reqbuf_avail       8               0               8       [bufs]          \
-511\                63              63.88           64              0.35
-ost_write  8               0               8       [bytes]         1697677\\
-    72914           212209.62       387579          91874.29
- 
-/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
-Name               Cur.Count       Cur.Rate        #Events Unit            \
-\last               min             avg             max             stddev
-req_waittime       31              3               39      [usec]          \
-30011\              34              822.79          12245           2047.71
-req_qdepth 31              3               39      [reqs]          0\      \
-    0               0.03            1               0.16
-req_active 31              3               39      [reqs]          58\     \
-    1               1.77            3               0.74
-reqbuf_avail       31              3               39      [bufs]          \
-1977\               63              63.79           64              0.41
-ost_write  30              3               38      [bytes]         10284679\
-\   15019           315325.16       910694          197776.51
- 
-/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
-Name               Cur.Count       Cur.Rate        #Events Unit            \
-\last               min             avg             max             stddev
-req_waittime       21              2               60      [usec]          \
-14970\              34              784.32          12245           1878.66
-req_qdepth 21              2               60      [reqs]          0\      \
-    0               0.02            1               0.13
-req_active 21              2               60      [reqs]          33\     \
-            1               1.70            3               0.70
-reqbuf_avail       21              2               60      [bufs]          \
-1341\               63              63.82           64              0.39
-ost_write  21              2               59      [bytes]         7648424\\
-    15019           332725.08       910694          180397.87
-</screen>
-          <para>Where:</para>
-          <informaltable frame="all">
-            <tgroup cols="2">
-              <colspec colname="c1" colwidth="50*"/>
-              <colspec colname="c2" colwidth="50*"/>
-              <thead>
-                <row>
-                  <entry><para><emphasis role="bold">Parameter</emphasis></para></entry>
-                  <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-                </row>
-              </thead>
-              <tbody>
-                <row>
-                  <entry><para> <emphasis role="bold">Cur. Count</emphasis></para></entry>
-                  <entry><para> Number of events of each type sent in the last interval (in this example, 10s)</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">Cur. Rate</emphasis></para></entry>
-                  <entry><para> Number of events per second in the last interval</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">#Events</emphasis></para></entry>
-                  <entry><para> Total number of such events since the system started</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">Unit</emphasis></para></entry>
-                  <entry><para> Unit of measurement for that statistic (microseconds, requests, buffers)</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">last</emphasis></para></entry>
-                  <entry><para> Average rate of these events (in units/event) for the last interval during which they arrived. For instance, in the above mentioned case of ost_destroy it took an average of 736 microseconds per destroy for the 400 object destroys in the previous 10 seconds.</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">min</emphasis></para></entry>
-                  <entry><para> Minimum rate (in units/events) since the service started</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">avg</emphasis></para></entry>
-                  <entry><para> Average rate</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">max</emphasis></para></entry>
-                  <entry><para> Maximum rate</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">stddev</emphasis></para></entry>
-                  <entry><para> Standard deviation (not measured in all cases)</para></entry>
-                </row>
-              </tbody>
-            </tgroup>
-          </informaltable>
-          <para>The events common to all services are:</para>
-          <informaltable frame="all">
-            <tgroup cols="2">
-              <colspec colname="c1" colwidth="50*"/>
-              <colspec colname="c2" colwidth="50*"/>
-              <thead>
-                <row>
-                  <entry><para><emphasis role="bold">Parameter</emphasis></para></entry>
-                  <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-                </row>
-              </thead>
-              <tbody>
-                <row>
-                  <entry><para> <emphasis role="bold">req_waittime</emphasis></para></entry>
-                  <entry><para> Amount of time a request waited in the queue before being handled by an available server thread.</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">req_qdepth</emphasis></para></entry>
-                  <entry><para> Number of requests waiting to be handled in the queue for this service.</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">req_active</emphasis></para></entry>
-                  <entry><para> Number of requests currently being handled.</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">reqbuf_avail</emphasis></para></entry>
-                  <entry><para> Number of unsolicited lnet request buffers for this service.</para></entry>
-                </row>
-              </tbody>
-            </tgroup>
-          </informaltable>
-          <para>Some service-specific events of interest are:</para>
-          <informaltable frame="all">
-            <tgroup cols="2">
-              <colspec colname="c1" colwidth="50*"/>
-              <colspec colname="c2" colwidth="50*"/>
-              <thead>
-                <row>
-                  <entry><para><emphasis role="bold">Parameter</emphasis></para></entry>
-                  <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-                </row>
-              </thead>
-              <tbody>
-                <row>
-                  <entry><para> <emphasis role="bold">ldlm_enqueue</emphasis></para></entry>
-                  <entry><para> Time it takes to enqueue a lock (this includes file open on the MDS)</para></entry>
-                </row>
-                <row>
-                  <entry><para> <emphasis role="bold">mds_reint</emphasis></para></entry>
-                  <entry><para> Time it takes to process an MDS modification record (includes create, mkdir, unlink, rename and setattr)</para></entry>
-                </row>
-              </tbody>
-            </tgroup>
-          </informaltable>
-        </section>
-        <section remap="h4">
-          <title>31.3.1.2 Interpreting MDT Statistics</title>
-          <note><para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref linkend="dbdoclet.50438273_80593"/> (CollectL).</para></note>
-          <para>The MDT .../stats files can be used to track MDT statistics for the MDS. Here is sample output for an MDT stats file:</para>
-          <screen># cat /proc/fs/lustre/mds/*-MDT0000/stats 
+      <section remap="h4">
+        <title>31.3.1.2 Interpreting MDT Statistics</title>
+        <note>
+          <para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref linkend="dbdoclet.50438273_80593"/> (CollectL).</para>
+        </note>
+        <para>The MDT .../stats files can be used to track MDT statistics for the MDS. Here is sample output for an MDT stats file:</para>
+        <screen># cat /proc/fs/lustre/mds/*-MDT0000/stats 
  snapshot_time                              1244832003.676892 secs.usecs 
  open                                       2 samples [reqs] 
  close                                      1 samples [reqs] 
@@ -1416,7 +1903,7 @@ getattr                                    3 samples [reqs]
  llog_init                          6 samples [reqs] 
  notify                                     16 samples [reqs]
  </screen>
-        </section>
        </section>
      </section>
+  </section>
  </chapter>
diff --git a/LustreRecovery.xml b/LustreRecovery.xml

index cbf8b5f..b18a41b 100644 (file)
--- a/LustreRecovery.xml
+++ b/LustreRecovery.xml
@@ -1,313 +1,311 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" xml:id='lustrerecovery'>
+<?xml version='1.0' encoding='UTF-8'?>
+<!-- This document was created with Syntext Serna Free. -->
+<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustrerecovery">
    <info>
-    <title  xml:id='lustrerecovery.title'>Lustre Recovery</title>
+    <title xml:id="lustrerecovery.title">Lustre Recovery</title>
    </info>
    <para>This chapter describes how recovery is implemented in Lustre and includes the following sections:</para>
-  <itemizedlist><listitem>
+  <itemizedlist>
+    <listitem>
        <para><xref linkend="dbdoclet.50438268_58047"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438268_65824"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438268_23736"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438268_80068"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438268_83826"/></para>
      </listitem>
-
-</itemizedlist>
-
-    <section xml:id="dbdoclet.50438268_58047">
-      <title>30.1 Recovery Overview</title>
-      <para>Lustre&apos;s recovery feature is responsible for dealing with node or network failure and returning the cluster to a consistent, performant state. Because Lustre allows servers to perform asynchronous update operations to the on-disk file system (i.e., the server can reply without waiting for the update to synchronously commit to disk), the clients may have state in memory that is newer than what the server can recover from disk after a crash.</para>
-      <para>A handful of different types of failures can cause recovery to occur:</para>
-      <itemizedlist><listitem>
-          <para> Client (compute node) failure</para>
+  </itemizedlist>
+  <section xml:id="dbdoclet.50438268_58047">
+    <title>30.1 Recovery Overview</title>
+    <para>Lustre&apos;s recovery feature is responsible for dealing with node or network failure and returning the cluster to a consistent, performant state. Because Lustre allows servers to perform asynchronous update operations to the on-disk file system (i.e., the server can reply without waiting for the update to synchronously commit to disk), the clients may have state in memory that is newer than what the server can recover from disk after a crash.</para>
+    <para>A handful of different types of failures can cause recovery to occur:</para>
+    <itemizedlist>
+      <listitem>
+        <para> Client (compute node) failure</para>
+      </listitem>
+      <listitem>
+        <para> MDS failure (and failover)</para>
+      </listitem>
+      <listitem>
+        <para> OST failure (and failover)</para>
+      </listitem>
+      <listitem>
+        <para> Transient network partition</para>
+      </listitem>
+    </itemizedlist>
+    <para>Currently, all Lustre failure and recovery operations are based on the concept of connection failure; all imports or exports associated with a given connection are considered to fail if any of them fail.</para>
+    <para>For information on Lustre recovery, see <xref linkend="dbdoclet.50438268_65824"/>. For information on recovering from a corrupt file system, see <xref linkend="dbdoclet.50438268_83826"/>. For information on resolving orphaned objects, a common issue after recovery, see <xref linkend="dbdoclet.50438225_13916"/>.</para>
+    <section remap="h3">
+      <title>30.1.1 Client Failure</title>
+      <para>Recovery from client failure in Lustre is based on lock revocation and other resources, so surviving clients can continue their work uninterrupted. If a client fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM) or fails to communicate with the server in a long period of time (i.e., no pings), the client is forcibly removed from the cluster (evicted). This enables other clients to acquire locks blocked by the dead client&apos;s locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition, as well as an actual client node system failure. <xref linkend="dbdoclet.50438268_96876"/> describes this case in more detail.</para>
+    </section>
+    <section xml:id="dbdoclet.50438268_43796">
+      <title>30.1.2 Client Eviction</title>
+      <para>If a client is not behaving properly from the server&apos;s point of view, it will be evicted. This ensures that the whole file system can continue to function in the presence of failed or misbehaving clients. An evicted client must invalidate all locks, which in turn, results in all cached inodes becoming invalidated and all cached data being flushed.</para>
+      <para>Reasons why a client might be evicted:</para>
+      <itemizedlist>
+        <listitem>
+          <para> Failure to respond to a server request in a timely manner</para>
+          <itemizedlist>
+            <listitem>
+              <para> Blocking lock callback (i.e., client holds lock that another client/server wants)</para>
+            </listitem>
+            <listitem>
+              <para> Lock completion callback (i.e., client is granted lock previously held by another client)</para>
+            </listitem>
+            <listitem>
+              <para> Lock glimpse callback (i.e., client is asked for size of object by another client)</para>
+            </listitem>
+            <listitem>
+              <para> Server shutdown notification (with simplified interoperability)</para>
+            </listitem>
+          </itemizedlist>
          </listitem>
-
-<listitem>
-          <para> MDS failure (and failover)</para>
+        <listitem>
+          <para> Failure to ping the server in a timely manner, unless the server is receiving no RPC traffic at all (which may indicate a network partition).</para>
          </listitem>
-
-<listitem>
-          <para> OST failure (and failover)</para>
+      </itemizedlist>
+    </section>
+    <section remap="h3">
+      <title>30.1.3 MDS Failure (Failover)</title>
+      <para>Highly-available (HA) Lustre operation requires that the metadata server have a peer configured for failover, including the use of a shared storage device for the MDT backing file system. The actual mechanism for detecting peer failure, power off (STONITH) of the failed peer (to prevent it from continuing to modify the shared disk), and takeover of the Lustre MDS service on the backup node depends on external HA software such as Heartbeat. It is also possible to have MDS recovery with a single MDS node. In this case, recovery will take as long as is needed for the single MDS to be restarted.</para>
+      <para>When clients detect an MDS failure (either by timeouts of in-flight requests or idle-time ping messages), they connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
+      <para>The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the <literal>--failnode=</literal> option to <literal>mkfs.lustre</literal> or <literal>tunefs.lustre</literal>), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see <xref linkend="dbdoclet.50438268_65824"/>.</para>
+      <para>Transaction numbers are used to ensure that operations are replayed in the order they were originally performed, so that they are guaranteed to succeed and present the same filesystem state as before the failure. In addition, clients inform the new server of their existing lock state (including locks that have not yet been granted). All metadata and lock replay must complete before new, non-recovery operations are permitted. In addition, only clients that were connected at the time of MDS failure are permitted to reconnect during the recovery window, to avoid the introduction of state changes that might conflict with what is being replayed by previously-connected clients.</para>
+    </section>
+    <section remap="h3">
+      <title>30.1.4 OST Failure (Failover)</title>
+      <para>When an OST fails or has communication problems with the client, the default action is that the corresponding OSC enters recovery, and I/O requests going to that OST are blocked waiting for OST recovery or failover. It is possible to administratively mark the OSC as <emphasis>inactive</emphasis> on the client, in which case file operations that involve the failed OST will return an IO error (<literal>-EIO</literal>). Otherwise, the application waits until the OST has recovered or the client process is interrupted (e.g. ,with <emphasis>CTRL-C</emphasis>).</para>
+      <para>The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see <xref linkend="troubleshootingrecovery"/> (Working with Orphaned Objects).</para>
+      <para>While the OSC to OST operation recovery protocol is the same as that between the MDC and MDT using the Metadata Replay protocol, typically the OST commits bulk write operations to disk synchronously and each reply indicates that the request is already committed and the data does not need to be saved for recovery. In some cases, the OST replies to the client before the operation is committed to disk (e.g. truncate, destroy, setattr, and I/O operations in very new versions of Lustre), and normal replay and resend handling is done, including resending of the bulk writes. In this case, the client keeps a copy of the data available in memory until the server indicates that the write has committed to disk.</para>
+      <para>To force an OST recovery, unmount the OST and then mount it again. If the OST was connected to clients before it failed, then a recovery process starts after the remount, enabling clients to reconnect to the OST and replay transactions in their queue. When the OST is in recovery mode, all new client connections are refused until the recovery finishes. The recovery is complete when either all previously-connected clients reconnect and their transactions are replayed or a client connection attempt times out. If a connection attempt times out, then all clients waiting to reconnect (and their transactions) are lost.</para>
+      <note>
+        <para>If you know an OST will not recover a previously-connected client (if, for example, the client has crashed), you can manually abort the recovery using this command:</para>
+        <para><screen>lctl --device &lt;OST device number&gt; abort_recovery</screen></para>
+        <para>To determine an OST&apos;s device number and device name, run the <literal>lctl dl</literal> command. Sample <literal>lctl dl</literal> command output is shown below:</para>
+        <screen>7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 </screen>
+        <para>In this example, 7 is the OST device number. The device name is <literal>ddn_data-OST0009</literal>. In most instances, the device name can be used in place of the device number.</para>
+      </note>
+    </section>
+    <section xml:id="dbdoclet.50438268_96876">
+      <title>30.1.5 Network Partition</title>
+      <para>Network failures may be transient. To avoid invoking recovery, the client tries, initially, to re-send any timed out request to the server. If the resend also fails, the client tries to re-establish a connection to the server. Clients can detect harmless partition upon reconnect if the server has not had any reason to evict the client.</para>
+      <para>If a request was processed by the server, but the reply was dropped (i.e., did not arrive back at the client), the server must reconstruct the reply when the client resends the request, rather than performing the same request twice.</para>
+    </section>
+    <section remap="h3">
+      <title>30.1.6 Failed Recovery</title>
+      <para>In the case of failed recovery, a client is evicted by the server and must reconnect after having flushed its saved state related to that server, as described in <xref linkend="dbdoclet.50438268_43796">Client Eviction</xref>, above. Failed recovery might occur for a number of reasons, including:</para>
+      <itemizedlist>
+        <listitem>
+          <para> Failure of recovery</para>
+          <itemizedlist>
+            <listitem>
+              <para> Recovery fails if the operations of one client directly depend on the operations of another client that failed to participate in recovery. Otherwise, Version Based Recovery (VBR) allows recovery to proceed for all of the connected clients, and only missing clients are evicted.</para>
+            </listitem>
+            <listitem>
+              <para> Manual abort of recovery</para>
+            </listitem>
+          </itemizedlist>
          </listitem>
-
-<listitem>
-          <para> Transient network partition</para>
+        <listitem>
+          <para> Manual eviction by the administrator</para>
          </listitem>
-
-</itemizedlist>
-      <para>Currently, all Lustre failure and recovery operations are based on the concept of connection failure; all imports or exports associated with a given connection are considered to fail if any of them fail.</para>
-      <para>For information on Lustre recovery, see <xref linkend="dbdoclet.50438268_65824"/>. For information on recovering from a corrupt file system, see <xref linkend="dbdoclet.50438268_83826"/>. For information on resolving orphaned objects, a common issue after recovery, see <xref linkend='troubleshootingrecovery'/> (Working with Orphaned Objects).</para>
-      <section remap="h3">
-        <title>30.1.1 <anchor xml:id="dbdoclet.50438268_96839" xreflabel=""/>Client <anchor xml:id="dbdoclet.50438268_marker-1287394" xreflabel=""/>Failure</title>
-        <para>Recovery from client failure in Lustre is based on lock revocation and other resources, so surviving clients can continue their work uninterrupted. If a client fails to timely respond to a blocking lock callback from the Distributed Lock Manager (DLM) or fails to communicate with the server in a long period of time (i.e., no pings), the client is forcibly removed from the cluster (evicted). This enables other clients to acquire locks blocked by the dead client&apos;s locks, and also frees resources (file handles, export data) associated with that client. Note that this scenario can be caused by a network partition, as well as an actual client node system failure. <xref linkend="dbdoclet.50438268_96876"/> describes this case in more detail.</para>
-      </section>
-      <section remap="h3">
-        <title>30.1.2 <anchor xml:id="dbdoclet.50438268_43796" xreflabel=""/>Client <anchor xml:id="dbdoclet.50438268_marker-1292164" xreflabel=""/>Eviction</title>
-        <para>If a client is not behaving properly from the server&apos;s point of view, it will be evicted. This ensures that the whole file system can continue to function in the presence of failed or misbehaving clients. An evicted client must invalidate all locks, which in turn, results in all cached inodes becoming invalidated and all cached data being flushed.</para>
-        <para>Reasons why a client might be evicted:</para>
-        <itemizedlist><listitem>
-            <para> Failure to respond to a server request in a timely manner</para>
-            <itemizedlist><listitem>
-                <para> Blocking lock callback (i.e., client holds lock that another client/server wants)</para>
-              </listitem>
-
-<listitem>
-                <para> Lock completion callback (i.e., client is granted lock previously held by another client)</para>
-              </listitem>
-
-<listitem>
-                <para> Lock glimpse callback (i.e., client is asked for size of object by another client)</para>
-              </listitem>
-
-<listitem>
-                <para> Server shutdown notification (with simplified interoperability)</para>
-              </listitem>
-
-</itemizedlist>
-          </listitem>
-<listitem>
-            <para> Failure to ping the server in a timely manner, unless the server is receiving no RPC traffic at all (which may indicate a network partition).</para>
-          </listitem>
-
-</itemizedlist>
-      </section>
-      <section remap="h3">
-        <title>30.1.3 <anchor xml:id="dbdoclet.50438268_37508" xreflabel=""/>MDS Failure <anchor xml:id="dbdoclet.50438268_marker-1287397" xreflabel=""/>(Failover)</title>
-        <para>Highly-available (HA) Lustre operation requires that the metadata server have a peer configured for failover, including the use of a shared storage device for the MDT backing file system. The actual mechanism for detecting peer failure, power off (STONITH) of the failed peer (to prevent it from continuing to modify the shared disk), and takeover of the Lustre MDS service on the backup node depends on external HA software such as Heartbeat. It is also possible to have MDS recovery with a single MDS node. In this case, recovery will take as long as is needed for the single MDS to be restarted.</para>
-        <para>When clients detect an MDS failure (either by timeouts of in-flight requests or idle-time ping messages), they connect to the new backup MDS and use the Metadata Replay protocol. Metadata Replay is responsible for ensuring that the backup MDS re-acquires state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
-        <para>The reconnection to a new (or restarted) MDS is managed by the file system configuration loaded by the client when the file system is first mounted. If a failover MDS has been configured (using the --failnode= option to mkfs.lustre or tunefs.lustre), the client tries to reconnect to both the primary and backup MDS until one of them responds that the failed MDT is again available. At that point, the client begins recovery. For more information, see <xref linkend="dbdoclet.50438268_65824"/>.</para>
-        <para>Transaction numbers are used to ensure that operations are replayed in the order they were originally performed, so that they are guaranteed to succeed and present the same filesystem state as before the failure. In addition, clients inform the new server of their existing lock state (including locks that have not yet been granted). All metadata and lock replay must complete before new, non-recovery operations are permitted. In addition, only clients that were connected at the time of MDS failure are permitted to reconnect during the recovery window, to avoid the introduction of state changes that might conflict with what is being replayed by previously-connected clients.</para>
-      </section>
-      <section remap="h3">
-        <title>30.1.4 <anchor xml:id="dbdoclet.50438268_28881" xreflabel=""/>OST <anchor xml:id="dbdoclet.50438268_marker-1289240" xreflabel=""/>Failure (Failover)</title>
-        <para>When an OST fails or has communication problems with the client, the default action is that the corresponding OSC enters recovery, and I/O requests going to that OST are blocked waiting for OST recovery or failover. It is possible to administratively mark the OSC as <emphasis>inactive</emphasis> on the client, in which case file operations that involve the failed OST will return an IO error (-EIO). Otherwise, the application waits until the OST has recovered or the client process is interrupted (e.g. ,with <emphasis>CTRL-C</emphasis>).</para>
-        <para>The MDS (via the LOV) detects that an OST is unavailable and skips it when assigning objects to new files. When the OST is restarted or re-establishes communication with the MDS, the MDS and OST automatically perform orphan recovery to destroy any objects that belong to files that were deleted while the OST was unavailable. For more information, see <xref linkend='troubleshootingrecovery'/> (Working with Orphaned Objects).</para>
-        <para>While the OSC to OST operation recovery protocol is the same as that between the MDC and MDT using the Metadata Replay protocol, typically the OST commits bulk write operations to disk synchronously and each reply indicates that the request is already committed and the data does not need to be saved for recovery. In some cases, the OST replies to the client before the operation is committed to disk (e.g. truncate, destroy, setattr, and I/O operations in very new versions of Lustre), and normal replay and resend handling is done, including resending of the bulk writes. In this case, the client keeps a copy of the data available in memory until the server indicates that the write has committed to disk.</para>
-        <para>To force an OST recovery, unmount the OST and then mount it again. If the OST was connected to clients before it failed, then a recovery process starts after the remount, enabling clients to reconnect to the OST and replay transactions in their queue. When the OST is in recovery mode, all new client connections are refused until the recovery finishes. The recovery is complete when either all previously-connected clients reconnect and their transactions are replayed or a client connection attempt times out. If a connection attempt times out, then all clients waiting to reconnect (and their transactions) are lost.</para>
-                <note><para>If you know an OST will not recover a previously-connected client (if, for example, the client has crashed), you can manually abort the recovery using this command:</para><para>lctl --device &lt;OST device number&gt; abort_recovery To determine an OST's device number and device name, run the lctl dl command. Sample lctl dl command output is shown below:</para><para>7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 In this example, 7 is the OST device number. The device name is ddn_data-OST0009. In most instances, the device name can be used in place of the device number.</para></note>
-      </section>
-      <section remap="h3">
-        <title>30.1.5 <anchor xml:id="dbdoclet.50438268_96876" xreflabel=""/>Network <anchor xml:id="dbdoclet.50438268_marker-1289388" xreflabel=""/>Partition</title>
-        <para>Network failures may be transient. To avoid invoking recovery, the client tries, initially, to re-send any timed out request to the server. If the resend also fails, the client tries to re-establish a connection to the server. Clients can detect harmless partition upon reconnect if the server has not had any reason to evict the client.</para>
-        <para>If a request was processed by the server, but the reply was dropped (i.e., did not arrive back at the client), the server must reconstruct the reply when the client resends the request, rather than performing the same request twice.</para>
-      </section>
-      <section remap="h3">
-        <title>30.1.6 Failed Recovery</title>
-        <para>In the case of failed recovery, a client is evicted by the server and must reconnect after having flushed its saved state related to that server, as described in <link xl:href="LustreRecovery.html#50438268_43796">Client Eviction</link>, above. Failed recovery might occur for a number of reasons, including:</para>
-        <itemizedlist><listitem>
-            <para> Failure of recovery</para>
-            <itemizedlist><listitem>
-                <para> Recovery fails if the operations of one client directly depend on the operations of another client that failed to participate in recovery. Otherwise, Version Based Recovery (VBR) allows recovery to proceed for all of the connected clients, and only missing clients are evicted.</para>
-              </listitem>
-
-<listitem>
-                <para> Manual abort of recovery</para>
-              </listitem>
-
-</itemizedlist>
-          </listitem>
-<listitem>
-            <para> Manual eviction by the administrator</para>
-          </listitem>
-
-</itemizedlist>
-      </section>
+      </itemizedlist>
      </section>
-    <section xml:id="dbdoclet.50438268_65824">
-      <title>30.2 Metadata <anchor xml:id="dbdoclet.50438268_marker-1292175" xreflabel=""/>Replay</title>
-      <para>Highly available Lustre operation requires that the MDS have a peer configured for failover, including the use of a shared storage device for the MDS backing file system. When a client detects an MDS failure, it connects to the new MDS and uses the metadata replay protocol to replay its requests.</para>
-      <para>Metadata replay ensures that the failover MDS re-accumulates state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
-      <section remap="h3">
-        <title>30.2.1 XID Numbers</title>
-        <para>Each request sent by the client contains an XID number, which is a client-unique, monotonically increasing 64-bit integer. The initial value of the XID is chosen so that it is highly unlikely that the same client node reconnecting to the same server after a reboot would have the same XID sequence. The XID is used by the client to order all of the requests that it sends, until such a time that the request is assigned a transaction number. The XID is also used in Reply Reconstruction to uniquely identify per-client requests at the server.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.2 Transaction Numbers</title>
-        <para>Each client request processed by the server that involves any state change (metadata update, file open, write, etc., depending on server type) is assigned a transaction number by the server that is a target-unique, monontonically increasing, server-wide 64-bit integer. The transaction number for each file system-modifying request is sent back to the client along with the reply to that client request. The transaction numbers allow the client and server to unambiguously order every modification to the file system in case recovery is needed.</para>
-        <para>Each reply sent to a client (regardless of request type) also contains the last committed transaction number that indicates the highest transaction number committed to the file system. The ldiskfs backing file system that Lustre uses enforces the requirement that any earlier disk operation will always be committed to disk before a later disk operation, so the last committed transaction number also reports that any requests with a lower transaction number have been committed to disk.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.3 Replay and Resend</title>
-        <para>Lustre recovery can be separated into two distinct types of operations: <emphasis>replay</emphasis> and <emphasis>resend</emphasis>.</para>
-        <para><emphasis>Replay</emphasis> operations are those for which the client received a reply from the server that the operation had been successfully completed. These operations need to be redone in exactly the same manner after a server restart as had been reported before the server failed. Replay can only happen if the server failed; otherwise it will not have lost any state in memory.</para>
-        <para><emphasis>Resend</emphasis> operations are those for which the client never received a reply, so their final state is unknown to the client. The client sends unanswered requests to the server again in XID order, and again awaits a reply for each one. In some cases, resent requests have been handled and committed to disk by the server (possibly also having dependent operations committed), in which case, the server performs reply reconstruction for the lost reply. In other cases, the server did not receive the lost request at all and processing proceeds as with any normal request. These are what happen in the case of a network interruption. It is also possible that the server received the request, but was unable to reply or commit it to disk before failure.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.4 Client Replay List</title>
-        <para>All file system-modifying requests have the potential to be required for server state recovery (replay) in case of a server failure. Replies that have an assigned transaction number that is higher than the last committed transaction number received in any reply from each server are preserved for later replay in a per-server replay list. As each reply is received from the server, it is checked to see if it has a higher last committed transaction number than the previous highest last committed number. Most requests that now have a lower transaction number can safely be removed from the replay list. One exception to this rule is for open requests, which need to be saved for replay until the file is closed so that the MDS can properly reference count open-unlinked files.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.5 Server Recovery</title>
-        <para>A server enters recovery if it was not shut down cleanly. If, upon startup, if any client entries are in the last_rcvd file for any previously connected clients, the server enters recovery mode and waits for these previously-connected clients to reconnect and begin replaying or resending their requests. This allows the server to recreate state that was exposed to clients (a request that completed successfully) but was not committed to disk before failure.</para>
-        <para>In the absence of any client connection attempts, the server waits indefinitely for the clients to reconnect. This is intended to handle the case where the server has a network problem and clients are unable to reconnect and/or if the server needs to be restarted repeatedly to resolve some problem with hardware or software. Once the server detects client connection attempts - either new clients or previously-connected clients - a recovery timer starts and forces recovery to finish in a finite time regardless of whether the previously-connected clients are available or not.</para>
-        <para>If no client entries are present in the last_rcvd file, or if the administrator manually aborts recovery, the server does not wait for client reconnection and proceeds to allow all clients to connect.</para>
-        <para>As clients connect, the server gathers information from each one to determine how long the recovery needs to take. Each client reports its connection UUID, and the server does a lookup for this UUID in the last_rcvd file to determine if this client was previously connected. If not, the client is refused connection and it will retry until recovery is completed. Each client reports its last seen transaction, so the server knows when all transactions have been replayed. The client also reports the amount of time that it was previously waiting for request completion so that the server can estimate how long some clients might need to detect the server failure and reconnect.</para>
-        <para>If the client times out during replay, it attempts to reconnect. If the client is unable to reconnect, REPLAY fails and it returns to DISCON state. It is possible that clients will timeout frequently during REPLAY, so reconnection should not delay an already slow process more than necessary. We can mitigate this by increasing the timeout during replay.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.6 Request Replay</title>
-        <para>If a client was previously connected, it gets a response from the server telling it that the server is in recovery and what the last committed transaction number on disk is. The client can then iterate through its replay list and use this last committed transaction number to prune any previously-committed requests. It replays any newer requests to the server in transaction number order, one at a time, waiting for a reply from the server before replaying the next request.</para>
-        <para>Open requests that are on the replay list may have a transaction number lower than the server&apos;s last committed transaction number. The server processes those open requests immediately. The server then processes replayed requests from all of the clients in transaction number order, starting at the last committed transaction number to ensure that the state is updated on disk in exactly the same manner as it was before the crash. As each replayed request is processed, the last committed transaction is incremented. If the server receives a replay request from a client that is higher than the current last committed transaction, that request is put aside until other clients provide the intervening transactions. In this manner, the server replays requests in the same sequence as they were previously executed on the server until either all clients are out of requests to replay or there is a gap in a sequence.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.7 Gaps in the Replay Sequence</title>
-        <para>In some cases, a gap may occur in the reply sequence. This might be caused by lost replies, where the request was processed and committed to disk but the reply was not received by the client. It can also be caused by clients missing from recovery due to partial network failure or client death.</para>
-        <para>In the case where all clients have reconnected, but there is a gap in the replay sequence the only possibility is that some requests were processed by the server but the reply was lost. Since the client must still have these requests in its resend list, they are processed after recovery is finished.</para>
-        <para>In the case where all clients have not reconnected, it is likely that the failed clients had requests that will no longer be replayed. The VBR feature is used to determine if a request following a transaction gap is safe to be replayed. Each item in the file system (MDS inode or OST object) stores on disk the number of the last transaction in which it was modified. Each reply from the server contains the previous version number of the objects that it affects. During VBR replay, the server matches the previous version numbers in the resend request against the current version number. If the versions match, the request is the next one that affects the object and can be safely replayed. For more information, see <link xl:href="LustreRecovery.html#50438268_80068">Version-based Recovery</link>.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.8 Lock Recovery</title>
-        <para>If all requests were replayed successfully and all clients reconnected, clients then do lock replay locks -- that is, every client sends information about every lock it holds from this server and its state (whenever it was granted or not, what mode, what properties and so on), and then recovery completes successfully. Currently, Lustre does not do lock verification and just trusts clients to present an accurate lock state. This does not impart any security concerns since Lustre 1.x clients are trusted for other information (e.g. user ID) during normal operation also.</para>
-        <para>After all of the saved requests and locks have been replayed, the client sends an MDS_GETSTATUS request with last-replay flag set. The reply to that request is held back until all clients have completed replay (sent the same flagged getstatus request), so that clients don&apos;t send non-recovery requests before recovery is complete.</para>
-      </section>
-      <section remap="h3">
-        <title>30.2.9 Request Resend</title>
-        <para>Once all of the previously-shared state has been recovered on the server (the target file system is up-to-date with client cache and the server has recreated locks representing the locks held by the client), the client can resend any requests that did not receive an earlier reply. This processing is done like normal request processing, and, in some cases, the server may do reply reconstruction.</para>
-      </section>
+  </section>
+  <section xml:id="dbdoclet.50438268_65824">
+    <title>30.2 Metadata Replay</title>
+    <para>Highly available Lustre operation requires that the MDS have a peer configured for failover, including the use of a shared storage device for the MDS backing file system. When a client detects an MDS failure, it connects to the new MDS and uses the metadata replay protocol to replay its requests.</para>
+    <para>Metadata replay ensures that the failover MDS re-accumulates state resulting from transactions whose effects were made visible to clients, but which were not committed to the disk.</para>
+    <section remap="h3">
+      <title>30.2.1 XID Numbers</title>
+      <para>Each request sent by the client contains an XID number, which is a client-unique, monotonically increasing 64-bit integer. The initial value of the XID is chosen so that it is highly unlikely that the same client node reconnecting to the same server after a reboot would have the same XID sequence. The XID is used by the client to order all of the requests that it sends, until such a time that the request is assigned a transaction number. The XID is also used in Reply Reconstruction to uniquely identify per-client requests at the server.</para>
      </section>
-    <section xml:id="dbdoclet.50438268_23736">
-      <title>30.3 Reply <anchor xml:id="dbdoclet.50438268_marker-1292176" xreflabel=""/>Reconstruction</title>
-      <para>When a reply is dropped, the MDS needs to be able to reconstruct the reply when the original request is re-sent. This must be done without repeating any non-idempotent operations, while preserving the integrity of the locking system. In the event of MDS failover, the information used to reconstruct the reply must be serialized on the disk in transactions that are joined or nested with those operating on the disk.</para>
-      <section remap="h3">
-        <title>30.3.1 Required State</title>
-        <para>For the majority of requests, it is sufficient for the server to store three pieces of data in the last_rcvd file:</para>
-        <itemizedlist><listitem>
-            <para> XID of the request</para>
-          </listitem>
-
-<listitem>
-            <para> Resulting transno (if any)</para>
-          </listitem>
-
-<listitem>
-            <para> Result code (req-&gt;rq_status)</para>
-          </listitem>
-
-</itemizedlist>
-        <para>For open requests, the &quot;disposition&quot; of the open must also be stored.</para>
-      </section>
-      <section remap="h3">
-        <title>30.3.2 Reconstruction of Open Replies</title>
-        <para>An open reply consists of up to three pieces of information (in addition to the contents of the &quot;request log&quot;):</para>
-        <itemizedlist><listitem>
-            <para> File handle</para>
-          </listitem>
-
-<listitem>
-            <para> Lock handle</para>
-          </listitem>
-
-<listitem>
-            <para> mds_body with information about the file created (for O_CREAT)</para>
-          </listitem>
-
-</itemizedlist>
-        <para>The disposition, status and request data (re-sent intact by the client) are sufficient to determine which type of lock handle was granted, whether an open file handle was created, and which resource should be described in the mds_body.</para>
-        <section remap="h5">
-          <title>Finding the File Handle</title>
-          <para>The file handle can be found in the XID of the request and the list of per-export open file handles. The file handle contains the resource/FID.</para>
-        </section>
-        <section remap="h5">
-          <title>Finding the Resource/fid</title>
-          <para>The file handle contains the resource/fid.</para>
-        </section>
-        <section remap="h5">
-          <title>Finding the Lock Handle</title>
-          <para>The lock handle can be found by walking the list of granted locks for the resource looking for one with the appropriate remote file handle (present in the re-sent request). Verify that the lock has the right mode (determined by performing the disposition/request/status analysis above) and is granted to the proper client.</para>
-        </section>
-      </section>
+    <section remap="h3">
+      <title>30.2.2 Transaction Numbers</title>
+      <para>Each client request processed by the server that involves any state change (metadata update, file open, write, etc., depending on server type) is assigned a transaction number by the server that is a target-unique, monontonically increasing, server-wide 64-bit integer. The transaction number for each file system-modifying request is sent back to the client along with the reply to that client request. The transaction numbers allow the client and server to unambiguously order every modification to the file system in case recovery is needed.</para>
+      <para>Each reply sent to a client (regardless of request type) also contains the last committed transaction number that indicates the highest transaction number committed to the file system. The <literal>ldiskfs</literal> backing file system that Lustre uses enforces the requirement that any earlier disk operation will always be committed to disk before a later disk operation, so the last committed transaction number also reports that any requests with a lower transaction number have been committed to disk.</para>
+    </section>
+    <section remap="h3">
+      <title>30.2.3 Replay and Resend</title>
+      <para>Lustre recovery can be separated into two distinct types of operations: <emphasis>replay</emphasis> and <emphasis>resend</emphasis>.</para>
+      <para><emphasis>Replay</emphasis> operations are those for which the client received a reply from the server that the operation had been successfully completed. These operations need to be redone in exactly the same manner after a server restart as had been reported before the server failed. Replay can only happen if the server failed; otherwise it will not have lost any state in memory.</para>
+      <para><emphasis>Resend</emphasis> operations are those for which the client never received a reply, so their final state is unknown to the client. The client sends unanswered requests to the server again in XID order, and again awaits a reply for each one. In some cases, resent requests have been handled and committed to disk by the server (possibly also having dependent operations committed), in which case, the server performs reply reconstruction for the lost reply. In other cases, the server did not receive the lost request at all and processing proceeds as with any normal request. These are what happen in the case of a network interruption. It is also possible that the server received the request, but was unable to reply or commit it to disk before failure.</para>
+    </section>
+    <section remap="h3">
+      <title>30.2.4 Client Replay List</title>
+      <para>All file system-modifying requests have the potential to be required for server state recovery (replay) in case of a server failure. Replies that have an assigned transaction number that is higher than the last committed transaction number received in any reply from each server are preserved for later replay in a per-server replay list. As each reply is received from the server, it is checked to see if it has a higher last committed transaction number than the previous highest last committed number. Most requests that now have a lower transaction number can safely be removed from the replay list. One exception to this rule is for open requests, which need to be saved for replay until the file is closed so that the MDS can properly reference count open-unlinked files.</para>
+    </section>
+    <section remap="h3">
+      <title>30.2.5 Server Recovery</title>
+      <para>A server enters recovery if it was not shut down cleanly. If, upon startup, if any client entries are in the <literal>last_rcvd</literal> file for any previously connected clients, the server enters recovery mode and waits for these previously-connected clients to reconnect and begin replaying or resending their requests. This allows the server to recreate state that was exposed to clients (a request that completed successfully) but was not committed to disk before failure.</para>
+      <para>In the absence of any client connection attempts, the server waits indefinitely for the clients to reconnect. This is intended to handle the case where the server has a network problem and clients are unable to reconnect and/or if the server needs to be restarted repeatedly to resolve some problem with hardware or software. Once the server detects client connection attempts - either new clients or previously-connected clients - a recovery timer starts and forces recovery to finish in a finite time regardless of whether the previously-connected clients are available or not.</para>
+      <para>If no client entries are present in the <literal>last_rcvd</literal> file, or if the administrator manually aborts recovery, the server does not wait for client reconnection and proceeds to allow all clients to connect.</para>
+      <para>As clients connect, the server gathers information from each one to determine how long the recovery needs to take. Each client reports its connection UUID, and the server does a lookup for this UUID in the <literal>last_rcvd</literal> file to determine if this client was previously connected. If not, the client is refused connection and it will retry until recovery is completed. Each client reports its last seen transaction, so the server knows when all transactions have been replayed. The client also reports the amount of time that it was previously waiting for request completion so that the server can estimate how long some clients might need to detect the server failure and reconnect.</para>
+      <para>If the client times out during replay, it attempts to reconnect. If the client is unable to reconnect, <literal>REPLAY</literal> fails and it returns to <literal>DISCON</literal> state. It is possible that clients will timeout frequently during <literal>REPLAY</literal>, so reconnection should not delay an already slow process more than necessary. We can mitigate this by increasing the timeout during replay.</para>
+    </section>
+    <section remap="h3">
+      <title>30.2.6 Request Replay</title>
+      <para>If a client was previously connected, it gets a response from the server telling it that the server is in recovery and what the last committed transaction number on disk is. The client can then iterate through its replay list and use this last committed transaction number to prune any previously-committed requests. It replays any newer requests to the server in transaction number order, one at a time, waiting for a reply from the server before replaying the next request.</para>
+      <para>Open requests that are on the replay list may have a transaction number lower than the server&apos;s last committed transaction number. The server processes those open requests immediately. The server then processes replayed requests from all of the clients in transaction number order, starting at the last committed transaction number to ensure that the state is updated on disk in exactly the same manner as it was before the crash. As each replayed request is processed, the last committed transaction is incremented. If the server receives a replay request from a client that is higher than the current last committed transaction, that request is put aside until other clients provide the intervening transactions. In this manner, the server replays requests in the same sequence as they were previously executed on the server until either all clients are out of requests to replay or there is a gap in a sequence.</para>
+    </section>
+    <section remap="h3">
+      <title>30.2.7 Gaps in the Replay Sequence</title>
+      <para>In some cases, a gap may occur in the reply sequence. This might be caused by lost replies, where the request was processed and committed to disk but the reply was not received by the client. It can also be caused by clients missing from recovery due to partial network failure or client death.</para>
+      <para>In the case where all clients have reconnected, but there is a gap in the replay sequence the only possibility is that some requests were processed by the server but the reply was lost. Since the client must still have these requests in its resend list, they are processed after recovery is finished.</para>
+      <para>In the case where all clients have not reconnected, it is likely that the failed clients had requests that will no longer be replayed. The VBR feature is used to determine if a request following a transaction gap is safe to be replayed. Each item in the file system (MDS inode or OST object) stores on disk the number of the last transaction in which it was modified. Each reply from the server contains the previous version number of the objects that it affects. During VBR replay, the server matches the previous version numbers in the resend request against the current version number. If the versions match, the request is the next one that affects the object and can be safely replayed. For more information, see <xref linkend="dbdoclet.50438268_80068">Version-based Recovery</xref>.</para>
+    </section>
+    <section remap="h3">
+      <title>30.2.8 Lock Recovery</title>
+      <para>If all requests were replayed successfully and all clients reconnected, clients then do lock replay locks -- that is, every client sends information about every lock it holds from this server and its state (whenever it was granted or not, what mode, what properties and so on), and then recovery completes successfully. Currently, Lustre does not do lock verification and just trusts clients to present an accurate lock state. This does not impart any security concerns since Lustre 1.x clients are trusted for other information (e.g. user ID) during normal operation also.</para>
+      <para>After all of the saved requests and locks have been replayed, the client sends an <literal>MDS_GETSTATUS</literal> request with last-replay flag set. The reply to that request is held back until all clients have completed replay (sent the same flagged getstatus request), so that clients don&apos;t send non-recovery requests before recovery is complete.</para>
+    </section>
+    <section remap="h3">
+      <title>30.2.9 Request Resend</title>
+      <para>Once all of the previously-shared state has been recovered on the server (the target file system is up-to-date with client cache and the server has recreated locks representing the locks held by the client), the client can resend any requests that did not receive an earlier reply. This processing is done like normal request processing, and, in some cases, the server may do reply reconstruction.</para>
      </section>
-    <section xml:id="dbdoclet.50438268_80068">
-      <title>30.4 Version-based <anchor xml:id="dbdoclet.50438268_marker-1288580" xreflabel=""/>Recovery</title>
-      <para>The Version-based Recovery (VBR) feature improves Lustre reliability in cases where client requests (RPCs) fail to replay during recovery
-          <footnote><para>There are two scenarios under which client RPCs are not replayed:   (1) Non-functioning or isolated clients do not reconnect, and they cannot replay their RPCs, causing a gap in the replay sequence. These clients get errors and are evicted.   (2) Functioning clients connect, but they cannot replay some or all of their RPCs that occurred after the gap caused by the non-functioning/isolated clients. These clients get errors (caused by the failed clients). With VBR, these requests have a better chance to replay because the &quot;gaps&quot; are only related to specific files that the missing client(s) changed.</para></footnote>.
-          .</para>
-      <para>In pre-VBR versions of Lustre, if the MGS or an OST went down and then recovered, a recovery process was triggered in which clients attempted to replay their requests. Clients were only allowed to replay RPCs in serial order. If a particular client could not replay its requests, then those requests were lost as well as the requests of clients later in the sequence. The &apos;&apos;downstream&apos;&apos; clients never got to replay their requests because of the wait on the earlier client'™s RPCs. Eventually, the recovery period would time out (so the component could accept new requests), leaving some number of clients evicted and their requests and data lost.</para>
-      <para>With VBR, the recovery mechanism does not result in the loss of clients or their data, because changes in inode versions are tracked, and more clients are able to reintegrate into the cluster. With VBR, inode tracking looks like this:</para>
-      <itemizedlist><listitem>
-          <para> Each inode<footnote><para>Usually, there are two inodes, a parent and a child.</para></footnote> stores a version, that is, the number of the last transaction (transno) in which the inode was changed.</para>
+  </section>
+  <section xml:id="dbdoclet.50438268_23736">
+    <title>30.3 Reply Reconstruction</title>
+    <para>When a reply is dropped, the MDS needs to be able to reconstruct the reply when the original request is re-sent. This must be done without repeating any non-idempotent operations, while preserving the integrity of the locking system. In the event of MDS failover, the information used to reconstruct the reply must be serialized on the disk in transactions that are joined or nested with those operating on the disk.</para>
+    <section remap="h3">
+      <title>30.3.1 Required State</title>
+      <para>For the majority of requests, it is sufficient for the server to store three pieces of data in the <literal>last_rcvd</literal> file:</para>
+      <itemizedlist>
+        <listitem>
+          <para> XID of the request</para>
          </listitem>
-
-<listitem>
-          <para> When an inode is about to be changed, a pre-operation version of the inode is saved in the client'™s data.</para>
+        <listitem>
+          <para> Resulting transno (if any)</para>
          </listitem>
-
-<listitem>
-          <para> The client keeps the pre-operation inode version and the post-operation version (transaction number) for replay, and sends them in the event of a server failure.</para>
+        <listitem>
+          <para> Result code (<literal>req-&gt;rq_status</literal>)</para>
          </listitem>
-
-<listitem>
-          <para> If the pre-operation version matches, then the request is replayed. The post-operation version is assigned on all inodes modified in the request.</para>
+      </itemizedlist>
+      <para>For open requests, the &quot;disposition&quot; of the open must also be stored.</para>
+    </section>
+    <section remap="h3">
+      <title>30.3.2 Reconstruction of Open Replies</title>
+      <para>An open reply consists of up to three pieces of information (in addition to the contents of the &quot;request log&quot;):</para>
+      <itemizedlist>
+        <listitem>
+          <para>File handle</para>
          </listitem>
-
-</itemizedlist>
-              <note><para>An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a &apos;&apos;rename&apos;&apos; operation, four different inodes can be modified.</para></note>
-      <para>During normal operation, the server:</para>
-      <itemizedlist><listitem>
-          <para> Updates the versions of all inodes involved in a given operation</para>
+        <listitem>
+          <para>Lock handle</para>
          </listitem>
-
-<listitem>
-          <para> Returns the old and new inode versions to the client with the reply</para>
+        <listitem>
+          <para><literal>mds_body</literal> with information about the file created (for <literal>O_CREAT</literal>)</para>
          </listitem>
-
-</itemizedlist>
-      <para>When the recovery mechanism is underway, VBR follows these steps:</para>
-      <orderedlist><listitem>
-      <para>VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client.</para>
-  </listitem><listitem>
-      <para>The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure.</para>
-  </listitem><listitem>
-      <para>When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted.</para>
-  </listitem></orderedlist>
-      <para>VBR recovery is fully transparent to users. It may lead to slightly longer recovery times if the cluster loses several clients during server recovery.</para>
-      <section remap="h3">
-        <title>30.4.1 <anchor xml:id="dbdoclet.50438268_marker-1288583" xreflabel=""/>VBR Messages</title>
-        <para>The VBR feature is built into the Lustre recovery functionality. It cannot be disabled. These are some VBR messages that may be displayed:</para>
-        <screen>DEBUG_REQ(D_WARNING, req, &quot;Version mismatch during replay\n&quot;);
-</screen>
-        <para>This message indicates why the client was evicted. No action is needed.</para>
-        <screen>CWARN(&quot;%s: version recovery fails, reconnecting\n&quot;);
-</screen>
-        <para>This message indicates why the recovery failed. No action is needed.</para>
+      </itemizedlist>
+      <para>The disposition, status and request data (re-sent intact by the client) are sufficient to determine which type of lock handle was granted, whether an open file handle was created, and which resource should be described in the <literal>mds_body</literal>.</para>
+      <section remap="h5">
+        <title>Finding the File Handle</title>
+        <para>The file handle can be found in the XID of the request and the list of per-export open file handles. The file handle contains the resource/FID.</para>
        </section>
-      <section remap="h3">
-        <title>30.4.2 Tips for <anchor xml:id="dbdoclet.50438268_marker-1288584" xreflabel=""/>Using VBR</title>
-        <para>VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a client'™s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost.</para>
+      <section remap="h5">
+        <title>Finding the Resource/fid</title>
+        <para>The file handle contains the resource/fid.</para>
        </section>
-    </section>
-    <section xml:id="dbdoclet.50438268_83826">
-      <title>30.5 Commit on <anchor xml:id="dbdoclet.50438268_marker-1292182" xreflabel=""/>Share</title>
-      <para>The commit-on-share (COS) feature makes Lustre recovery more reliable by preventing missing clients from causing cascading evictions of other clients. With COS enabled, if some Lustre clients miss the recovery window after a reboot or a server failure, the remaining clients are not evicted.</para>
-              <note><para>The commit-on-share feature is enabled, by default.</para></note>
-      <section remap="h3">
-        <title>30.5.1 Working with Commit on Share</title>
-        <para>To illustrate how COS works, let&apos;s first look at the old recovery scenario. After a service restart, the MDS would boot and enter recovery mode. Clients began reconnecting and replaying their uncommitted transactions. Clients could replay transactions independently as long as their transactions did not depend on each other (one client&apos;s transactions did not depend on a different client&apos;s transactions). The MDS is able to determine whether one transaction is dependent on another transaction via the <link xl:href="LustreRecovery.html#50438268_80068">Version-based Recovery</link> feature.</para>
-        <para>If there was a dependency between client transactions (for example, creating and deleting the same file), and one or more clients did not reconnect in time, then some clients may have been evicted because their transactions depended on transactions from the missing clients. Evictions of those clients caused more clients to be evicted and so on, resulting in &quot;cascading&quot; client evictions.</para>
-        <para>COS addresses the problem of cascading evictions by eliminating dependent transactions between clients. It ensures that one transaction is committed to disk if another client performs a transaction dependent on the first one. With no dependent, uncommitted transactions to apply, the clients replay their requests independently without the risk of being evicted.</para>
+      <section remap="h5">
+        <title>Finding the Lock Handle</title>
+        <para>The lock handle can be found by walking the list of granted locks for the resource looking for one with the appropriate remote file handle (present in the re-sent request). Verify that the lock has the right mode (determined by performing the disposition/request/status analysis above) and is granted to the proper client.</para>
        </section>
-      <section remap="h3">
-        <title>30.5.2 Tuning Commit On Share</title>
-        <para>Commit on Share can be enabled or disabled using the mdt.commit_on_sharing tunable (0/1). This tunable can be set when the MDS is created (mkfs.lustre) or when the Lustre file system is active, using the lctl set/get_param or lctl conf_param commands.</para>
-        <para>To set a default value for COS (disable/enable) when the file system is created, use:</para>
-        <screen>--param mdt.commit_on_sharing=0/1
+    </section>
+  </section>
+  <section xml:id="dbdoclet.50438268_80068">
+    <title>30.4 Version-based Recovery</title>
+    <para>The Version-based Recovery (VBR) feature improves Lustre reliability in cases where client requests (RPCs) fail to replay during recovery
+          <footnote>
+        <para>There are two scenarios under which client RPCs are not replayed:   (1) Non-functioning or isolated clients do not reconnect, and they cannot replay their RPCs, causing a gap in the replay sequence. These clients get errors and are evicted.   (2) Functioning clients connect, but they cannot replay some or all of their RPCs that occurred after the gap caused by the non-functioning/isolated clients. These clients get errors (caused by the failed clients). With VBR, these requests have a better chance to replay because the &quot;gaps&quot; are only related to specific files that the missing client(s) changed.</para>
+      </footnote>.</para>
+    <para>In pre-VBR versions of Lustre, if the MGS or an OST went down and then recovered, a recovery process was triggered in which clients attempted to replay their requests. Clients were only allowed to replay RPCs in serial order. If a particular client could not replay its requests, then those requests were lost as well as the requests of clients later in the sequence. The &apos;&apos;downstream&apos;&apos; clients never got to replay their requests because of the wait on the earlier client&apos;™s RPCs. Eventually, the recovery period would time out (so the component could accept new requests), leaving some number of clients evicted and their requests and data lost.</para>
+    <para>With VBR, the recovery mechanism does not result in the loss of clients or their data, because changes in inode versions are tracked, and more clients are able to reintegrate into the cluster. With VBR, inode tracking looks like this:</para>
+    <itemizedlist>
+      <listitem>
+        <para>Each inode<footnote>
+            <para>Usually, there are two inodes, a parent and a child.</para>
+          </footnote> stores a version, that is, the number of the last transaction (transno) in which the inode was changed.</para>
+      </listitem>
+      <listitem>
+        <para>When an inode is about to be changed, a pre-operation version of the inode is saved in the client&apos;s data.</para>
+      </listitem>
+      <listitem>
+        <para>The client keeps the pre-operation inode version and the post-operation version (transaction number) for replay, and sends them in the event of a server failure.</para>
+      </listitem>
+      <listitem>
+        <para>If the pre-operation version matches, then the request is replayed. The post-operation version is assigned on all inodes modified in the request.</para>
+      </listitem>
+    </itemizedlist>
+    <note>
+      <para>An RPC can contain up to four pre-operation versions, because several inodes can be involved in an operation. In the case of a &apos;&apos;rename&apos;&apos; operation, four different inodes can be modified.</para>
+    </note>
+    <para>During normal operation, the server:</para>
+    <itemizedlist>
+      <listitem>
+        <para> Updates the versions of all inodes involved in a given operation</para>
+      </listitem>
+      <listitem>
+        <para> Returns the old and new inode versions to the client with the reply</para>
+      </listitem>
+    </itemizedlist>
+    <para>When the recovery mechanism is underway, VBR follows these steps:</para>
+    <orderedlist>
+      <listitem>
+        <para>VBR only allows clients to replay transactions if the affected inodes have the same version as during the original execution of the transactions, even if there is gap in transactions due to a missed client.</para>
+      </listitem>
+      <listitem>
+        <para>The server attempts to execute every transaction that the client offers, even if it encounters a re-integration failure.</para>
+      </listitem>
+      <listitem>
+        <para>When the replay is complete, the client and server check if a replay failed on any transaction because of inode version mismatch. If the versions match, the client gets a successful re-integration message. If the versions do not match, then the client is evicted.</para>
+      </listitem>
+    </orderedlist>
+    <para>VBR recovery is fully transparent to users. It may lead to slightly longer recovery times if the cluster loses several clients during server recovery.</para>
+    <section remap="h3">
+      <title>30.4.1 VBR Messages</title>
+      <para>The VBR feature is built into the Lustre recovery functionality. It cannot be disabled. These are some VBR messages that may be displayed:</para>
+      <screen>DEBUG_REQ(D_WARNING, req, &quot;Version mismatch during replay\n&quot;);</screen>
+      <para>This message indicates why the client was evicted. No action is needed.</para>
+      <screen>CWARN(&quot;%s: version recovery fails, reconnecting\n&quot;);</screen>
+      <para>This message indicates why the recovery failed. No action is needed.</para>
+    </section>
+    <section remap="h3">
+      <title>30.4.2 Tips for Using VBR</title>
+      <para>VBR will be successful for clients which do not share data with other client. Therefore, the strategy for reliable use of VBR is to store a client&apos;s data in its own directory, where possible. VBR can recover these clients, even if other clients are lost.</para>
+    </section>
+  </section>
+  <section xml:id="dbdoclet.50438268_83826">
+    <title>30.5 Commit on Share</title>
+    <para>The commit-on-share (COS) feature makes Lustre recovery more reliable by preventing missing clients from causing cascading evictions of other clients. With COS enabled, if some Lustre clients miss the recovery window after a reboot or a server failure, the remaining clients are not evicted.</para>
+    <note>
+      <para>The commit-on-share feature is enabled, by default.</para>
+    </note>
+    <section remap="h3">
+      <title>30.5.1 Working with Commit on Share</title>
+      <para>To illustrate how COS works, let&apos;s first look at the old recovery scenario. After a service restart, the MDS would boot and enter recovery mode. Clients began reconnecting and replaying their uncommitted transactions. Clients could replay transactions independently as long as their transactions did not depend on each other (one client&apos;s transactions did not depend on a different client&apos;s transactions). The MDS is able to determine whether one transaction is dependent on another transaction via the <xref linkend="dbdoclet.50438268_80068"/> feature.</para>
+      <para>If there was a dependency between client transactions (for example, creating and deleting the same file), and one or more clients did not reconnect in time, then some clients may have been evicted because their transactions depended on transactions from the missing clients. Evictions of those clients caused more clients to be evicted and so on, resulting in &quot;cascading&quot; client evictions.</para>
+      <para>COS addresses the problem of cascading evictions by eliminating dependent transactions between clients. It ensures that one transaction is committed to disk if another client performs a transaction dependent on the first one. With no dependent, uncommitted transactions to apply, the clients replay their requests independently without the risk of being evicted.</para>
+    </section>
+    <section remap="h3">
+      <title>30.5.2 Tuning Commit On Share</title>
+      <para>Commit on Share can be enabled or disabled using the <literal>mdt.commit_on_sharing</literal> tunable (0/1). This tunable can be set when the MDS is created (<literal>mkfs.lustre</literal>) or when the Lustre file system is active, using the <literal>lctl set/get_param</literal> or <literal>lctl conf_param</literal> commands.</para>
+      <para>To set a default value for COS (disable/enable) when the file system is created, use:</para>
+      <screen>--param mdt.commit_on_sharing=0/1
  </screen>
-        <para>To disable or enable COS when the file system is running, use:</para>
-        <screen>lctl set_param mdt.*.commit_on_sharing=0/1
+      <para>To disable or enable COS when the file system is running, use:</para>
+      <screen>lctl set_param mdt.*.commit_on_sharing=0/1
  </screen>
-                <note><para>Enabling COS may cause the MDS to do a large number of synchronous disk operations, hurting performance. Placing the ldiskfs journal on a low-latency external device may improve file system performance.</para></note>
-      </section>
+      <note>
+        <para>Enabling COS may cause the MDS to do a large number of synchronous disk operations, hurting performance. Placing the <literal>ldiskfs</literal> journal on a low-latency external device may improve file system performance.</para>
+      </note>
      </section>
+  </section>
  </chapter>
diff --git a/LustreTroubleshooting.xml b/LustreTroubleshooting.xml

index e2a6c5b..b2bdb6a 100644 (file)
--- a/LustreTroubleshooting.xml
+++ b/LustreTroubleshooting.xml
@@ -1,372 +1,477 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" xml:id='lustretroubleshooting'>
+<?xml version='1.0' encoding='UTF-8'?>
+<!-- This document was created with Syntext Serna Free. --><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustretroubleshooting">
    <info>
-    <title xml:id='lustretroubleshooting.title'>Lustre Troubleshooting</title>
+    <title xml:id="lustretroubleshooting.title">Lustre Troubleshooting</title>
    </info>
    <para>This chapter provides information to troubleshoot Lustre, submit a Lustre bug, and Lustre performance tips. It includes the following sections:</para>
-  <itemizedlist><listitem>
+  <itemizedlist>
+    <listitem>
        <para><xref linkend="dbdoclet.50438198_11171"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438198_30989"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438198_93109"/></para>
      </listitem>
-
-</itemizedlist>
-
-    <section xml:id="dbdoclet.50438198_11171">
-      <title>26.1 Lustre Error Messages</title>
-      <para>Several resources are available to help troubleshoot Lustre. This section describes error numbers, error messages and logs.</para>
-      <section remap="h3">
-        <title>26.1.1 Error <anchor xml:id="dbdoclet.50438198_marker-1296744" xreflabel=""/>Numbers</title>
-        <para>Error numbers for Lustre come from the Linux errno.h, and are located in /usr/include/asm/errno.h. Lustre does not use all of the available Linux error numbers. The exact meaning of an error number depends on where it is used. Here is a summary of the basic errors that Lustre users may encounter.</para>
-        <informaltable frame="all">
-          <tgroup cols="3">
-            <colspec colname="c1" colwidth="33*"/>
-            <colspec colname="c2" colwidth="33*"/>
-            <colspec colname="c3" colwidth="33*"/>
-            <thead>
-              <row>
-                <entry><para><emphasis role="bold">Error Number</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Error Name</emphasis></para></entry>
-                <entry><para><emphasis role="bold">Description</emphasis></para></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry><para> -1</para></entry>
-                <entry><para> -EPERM</para></entry>
-                <entry><para> Permission is denied.</para></entry>
-              </row>
-              <row>
-                <entry><para> -2</para></entry>
-                <entry><para> -ENOENT</para></entry>
-                <entry><para> The requested file or directory does not exist.</para></entry>
-              </row>
-              <row>
-                <entry><para> -4</para></entry>
-                <entry><para> -EINTR</para></entry>
-                <entry><para> The operation was interrupted (usually CTRL-C or a killing process).</para></entry>
-              </row>
-              <row>
-                <entry><para> -5</para></entry>
-                <entry><para> -EIO</para></entry>
-                <entry><para> The operation failed with a read or write error.</para></entry>
-              </row>
-              <row>
-                <entry><para> -19</para></entry>
-                <entry><para> -ENODEV</para></entry>
-                <entry><para> No such device is available. The server stopped or failed over.</para></entry>
-              </row>
-              <row>
-                <entry><para> -22</para></entry>
-                <entry><para> -EINVAL</para></entry>
-                <entry><para> The parameter contains an invalid value.</para></entry>
-              </row>
-              <row>
-                <entry><para> -28</para></entry>
-                <entry><para> -ENOSPC</para></entry>
-                <entry><para> The file system is out-of-space or out of inodes. Use lfs df (query the amount of file system space) or lfs df -i (query the number of inodes).</para></entry>
-              </row>
-              <row>
-                <entry><para> -30</para></entry>
-                <entry><para> -EROFS</para></entry>
-                <entry><para> The file system is read-only, likely due to a detected error.</para></entry>
-              </row>
-              <row>
-                <entry><para> -43</para></entry>
-                <entry><para> -EIDRM</para></entry>
-                <entry><para> The UID/GID does not match any known UID/GID on the MDS. Update etc/hosts and etc/group on the MDS to add the missing user or group.</para></entry>
-              </row>
-              <row>
-                <entry><para> -107</para></entry>
-                <entry><para> -ENOTCONN</para></entry>
-                <entry><para> The client is not connected to this server.</para></entry>
-              </row>
-              <row>
-                <entry><para> -110</para></entry>
-                <entry><para> -ETIMEDOUT</para></entry>
-                <entry><para> The operation took too long and timed out.</para></entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </informaltable>
-      </section>
-      <section remap="h3">
-        <title>26.1.2 <anchor xml:id="dbdoclet.50438198_40669" xreflabel=""/>Viewing Error <anchor xml:id="dbdoclet.50438198_marker-1291323" xreflabel=""/>Messages</title>
-        <para>As Lustre code runs on the kernel, single-digit error codes display to the application; these error codes are an indication of the problem. Refer to the kernel console log (dmesg) for all recent kernel messages from that node. On the node, /var/log/messages holds a log of all messages for at least the past day.</para>
-        <para>The error message initiates with &quot;LustreError&quot; in the console log and provides a short description of:</para>
-        <itemizedlist><listitem>
-            <para> What the problem is</para>
-          </listitem>
-
-<listitem>
-            <para> Which process ID had trouble</para>
-          </listitem>
-
-<listitem>
-            <para> Which server node it was communicating with, and so on.</para>
-          </listitem>
-
-</itemizedlist>
-        <para>Lustre logs are dumped to /proc/sys/lnet/debug_path.</para>
-        <para>Collect the first group of messages related to a problem, and any messages that precede &quot;LBUG&quot; or &quot;assertion failure&quot; errors. Messages that mention server nodes (OST or MDS) are specific to that server; you must collect similar messages from the relevant server console logs.</para>
-        <para>Another Lustre debug log holds information for Lustre action for a short period of time which, in turn, depends on the processes on the node to use Lustre. Use the following command to extract debug logs on each of the nodes, run</para>
-        <screen>$ lctl dk &lt;filename&gt;
-</screen>
-                <note><para>LBUG freezes the thread to allow capture of the panic stack. A system reboot is needed to clear the thread.</para></note>
-      </section>
+  </itemizedlist>
+  <section xml:id="dbdoclet.50438198_11171">
+    <title>26.1 Lustre Error Messages</title>
+    <para>Several resources are available to help troubleshoot Lustre. This section describes error numbers, error messages and logs.</para>
+    <section remap="h3">
+      <title>26.1.1 Error Numbers</title>
+      <para>Error numbers for Lustre come from the Linux errno.h, and are located in <literal>/usr/include/asm/errno.h</literal>. Lustre does not use all of the available Linux error numbers. The exact meaning of an error number depends on where it is used. Here is a summary of the basic errors that Lustre users may encounter.</para>
+      <informaltable frame="all">
+        <tgroup cols="3">
+          <colspec colname="c1" colwidth="33*"/>
+          <colspec colname="c2" colwidth="33*"/>
+          <colspec colname="c3" colwidth="33*"/>
+          <thead>
+            <row>
+              <entry>
+                <para><emphasis role="bold">Error Number</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Error Name</emphasis></para>
+              </entry>
+              <entry>
+                <para><emphasis role="bold">Description</emphasis></para>
+              </entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>
+                <para> -1</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -EPERM</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> Permission is denied.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -2</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -ENOENT</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The requested file or directory does not exist.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -4</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -EINTR</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The operation was interrupted (usually CTRL-C or a killing process).</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -5</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -EIO</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The operation failed with a read or write error.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -19</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -ENODEV</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> No such device is available. The server stopped or failed over.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -22</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -EINVAL</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The parameter contains an invalid value.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -28</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -ENOSPC</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The file system is out-of-space or out of inodes. Use <literal>lfs df</literal> (query the amount of file system space) or <literal>lfs df -i</literal> (query the number of inodes).</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -30</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -EROFS</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The file system is read-only, likely due to a detected error.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -43</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -EIDRM</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The UID/GID does not match any known UID/GID on the MDS. Update etc/hosts and etc/group on the MDS to add the missing user or group.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -107</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -ENOTCONN</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The client is not connected to this server.</para>
+              </entry>
+            </row>
+            <row>
+              <entry>
+                <para> -110</para>
+              </entry>
+              <entry>
+                <literal>
+                  <para> -ETIMEDOUT</para>
+                </literal>
+              </entry>
+              <entry>
+                <para> The operation took too long and timed out.</para>
+              </entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+    </section>
+    <section xml:id="dbdoclet.50438198_40669">
+      <title>26.1.2 Viewing Error Messages</title>
+      <para>As Lustre code runs on the kernel, single-digit error codes display to the application; these error codes are an indication of the problem. Refer to the kernel console log (dmesg) for all recent kernel messages from that node. On the node, <literal>/var/log/messages</literal> holds a log of all messages for at least the past day.</para>
+      <para>The error message initiates with &quot;LustreError&quot; in the console log and provides a short description of:</para>
+      <itemizedlist>
+        <listitem>
+          <para>What the problem is</para>
+        </listitem>
+        <listitem>
+          <para>Which process ID had trouble</para>
+        </listitem>
+        <listitem>
+          <para>Which server node it was communicating with, and so on.</para>
+        </listitem>
+      </itemizedlist>
+      <para>Lustre logs are dumped to <literal>/proc/sys/lnet/debug_pat</literal>h.</para>
+      <para>Collect the first group of messages related to a problem, and any messages that precede &quot;LBUG&quot; or &quot;assertion failure&quot; errors. Messages that mention server nodes (OST or MDS) are specific to that server; you must collect similar messages from the relevant server console logs.</para>
+      <para>Another Lustre debug log holds information for Lustre action for a short period of time which, in turn, depends on the processes on the node to use Lustre. Use the following command to extract debug logs on each of the nodes, run</para>
+      <screen>$ lctl dk &lt;filename&gt;
+</screen>
+      <note>
+        <para>LBUG freezes the thread to allow capture of the panic stack. A system reboot is needed to clear the thread.</para>
+      </note>
      </section>
-    <section xml:id="dbdoclet.50438198_30989">
-      <title>26.2 Reporting a Lustre <anchor xml:id="dbdoclet.50438198_marker-1296753" xreflabel=""/>Bug</title>
-      <para>If, after troubleshooting your Lustre system, you cannot resolve the problem, consider reporting a Lustre bug. The process for reporting a bug is described in the Lustre wiki topic <link xl:href="http://wiki.lustre.org/index.php/Reporting_Bugs">Reporting Bugs</link>.</para>
-      <para>You can also post a question to the <link xl:href="http://wiki.lustre.org/index.php/Lustre_Mailing_Lists">lustre-discuss mailing list</link> or search the <link xl:href="http://groups.google.com/group/lustre-discuss-list">lustre-discuss Archives</link> for information about your issue.</para>
-      <para>A Lustre diagnostics tool is available for downloading at: <link xl:href="http://downloads.lustre.org/public/tools/lustre-diagnostics/">http://downloads.lustre.org/public/tools/lustre-diagnostics/</link></para>
-      <para>You can run this tool to capture diagnostics output to include in the reported bug. To run this tool, enter one of these commands:</para>
-      <screen># lustre-diagnostics -t &lt;bugzilla bug #&gt;
+  </section>
+  <section xml:id="dbdoclet.50438198_30989">
+    <title>26.2 Reporting a Lustre Bug</title>
+    <para>If, after troubleshooting your Lustre system, you cannot resolve the problem, consider reporting a Lustre bug. The process for reporting a bug is described in the Lustre wiki topic <ulink xl:href="http://wiki.lustre.org/index.php/Reporting_Bugs">Reporting Bugs</ulink>.</para>
+    <para>You can also post a question to the <ulink xl:href="http://wiki.lustre.org/index.php/Lustre_Mailing_Lists">lustre-discuss mailing list</ulink> or search the <ulink xl:href="http://groups.google.com/group/lustre-discuss-list">lustre-discuss Archives</ulink> for information about your issue.</para>
+    <para>A Lustre diagnostics tool is available for downloading at: <ulink xl:href="http://downloads.lustre.org/public/tools/lustre-diagnostics/">http://downloads.lustre.org/public/tools/lustre-diagnostics/</ulink></para>
+    <para>You can run this tool to capture diagnostics output to include in the reported bug. To run this tool, enter one of these commands:</para>
+    <screen># lustre-diagnostics -t &lt;bugzilla bug #&gt;
  # lustre-diagnostics.
  </screen>
-      <para>Output is sent directly to the terminal. Use normal file redirection to send the output to a file, and then manually attach the file to the bug you are submitting.</para>
+    <para>Output is sent directly to the terminal. Use normal file redirection to send the output to a file, and then manually attach the file to the bug you are submitting.</para>
+  </section>
+  <section xml:id="dbdoclet.50438198_93109">
+    <title>26.3 Common Lustre Problems</title>
+    <para>This section describes how to address common issues encountered with Lustre.</para>
+    <section remap="h3">
+      <title>26.3.1 OST Object is Missing or Damaged</title>
+      <para>If the OSS fails to find an object or finds a damaged object, this message appears:</para>
+      <para><screen>OST object missing or damaged (OST &quot;ost1&quot;, object 98148, error -2)</screen></para>
+      <para>If the reported error is -2 (<literal>-ENOENT</literal>, or &quot;No such file or directory&quot;), then the object is missing. This can occur either because the MDS and OST are out of sync, or because an OST object was corrupted and deleted.</para>
+      <para>If you have recovered the file system from a disk failure by using e2fsck, then unrecoverable objects may have been deleted or moved to /lost+found on the raw OST partition. Because files on the MDS still reference these objects, attempts to access them produce this error.</para>
+      <para>If you have recovered a backup of the raw MDS or OST partition, then the restored partition is very likely to be out of sync with the rest of your cluster. No matter which server partition you restored from backup, files on the MDS may reference objects which no longer exist (or did not exist when the backup was taken); accessing those files produces this error.</para>
+      <para>If neither of those descriptions is applicable to your situation, then it is possible that you have discovered a programming error that allowed the servers to get out of sync. Please report this condition to the Lustre group, and we will investigate.</para>
+      <para>If the reported error is anything else (such as -5, &quot;<literal>I/O error</literal>&quot;), it likely indicates a storage failure. The low-level file system returns this error if it is unable to read from the storage device.</para>
+      <para><emphasis role="bold">Suggested Action</emphasis></para>
+      <para>If the reported error is -2, you can consider checking in <literal>/lost+found</literal> on your raw OST device, to see if the missing object is there. However, it is likely that this object is lost forever, and that the file that references the object is now partially or completely lost. Restore this file from backup, or salvage what you can and delete it.</para>
+      <para>If the reported error is anything else, then you should immediately inspect this server for storage problems.</para>
      </section>
-    <section xml:id="dbdoclet.50438198_93109">
-      <title>26.3 Common Lustre Problems</title>
-      <para>This section describes how to address common issues encountered with Lustre.</para>
-      <section remap="h3">
-        <title>26.3.1 OST Object is <anchor xml:id="dbdoclet.50438198_marker-1291349" xreflabel=""/>Missing or Damaged</title>
-        <para>If the OSS fails to find an object or finds a damaged object, this message appears:</para>
-        <para>OST object missing or damaged (OST &quot;ost1&quot;, object 98148, error -2)</para>
-        <para>If the reported error is -2 (-ENOENT, or &quot;No such file or directory&quot;), then the object is missing. This can occur either because the MDS and OST are out of sync, or because an OST object was corrupted and deleted.</para>
-        <para>If you have recovered the file system from a disk failure by using e2fsck, then unrecoverable objects may have been deleted or moved to /lost+found on the raw OST partition. Because files on the MDS still reference these objects, attempts to access them produce this error.</para>
-        <para>If you have recovered a backup of the raw MDS or OST partition, then the restored partition is very likely to be out of sync with the rest of your cluster. No matter which server partition you restored from backup, files on the MDS may reference objects which no longer exist (or did not exist when the backup was taken); accessing those files produces this error.</para>
-        <para>If neither of those descriptions is applicable to your situation, then it is possible that you have discovered a programming error that allowed the servers to get out of sync. Please report this condition to the Lustre group, and we will investigate.</para>
-        <para>If the reported error is anything else (such as -5, &quot;I/O error&quot;), it likely indicates a storage failure. The low-level file system returns this error if it is unable to read from the storage device.</para>
-        <para><emphasis role="bold">Suggested Action</emphasis></para>
-        <para>If the reported error is -2, you can consider checking in /lost+found on your raw OST device, to see if the missing object is there. However, it is likely that this object is lost forever, and that the file that references the object is now partially or completely lost. Restore this file from backup, or salvage what you can and delete it.</para>
-        <para>If the reported error is anything else, then you should immediately inspect this server for storage problems.</para>
-      </section>
-      <section remap="h3">
-        <title>26.3.2 OSTs <anchor xml:id="dbdoclet.50438198_marker-1291361" xreflabel=""/>Become Read-Only</title>
-        <para>If the SCSI devices are inaccessible to Lustre at the block device level, then ldiskfs remounts the device read-only to prevent file system corruption. This is a normal behavior. The status in /proc/fs/lustre/health_check also shows &quot;not healthy&quot; on the affected nodes.</para>
-        <para>To determine what caused the &quot;not healthy&quot; condition:</para>
-        <itemizedlist><listitem>
-            <para> Examine the consoles of all servers for any error indications</para>
-          </listitem>
-
-<listitem>
-            <para> Examine the syslogs of all servers for any LustreErrors or LBUG</para>
-          </listitem>
-
-<listitem>
-            <para> Check the health of your system hardware and network. (Are the disks working as expected, is the network dropping packets?)</para>
-          </listitem>
-
-<listitem>
-            <para> Consider what was happening on the cluster at the time. Does this relate to a specific user workload or a system load condition? Is the condition reproducible? Does it happen at a specific time (day, week or month)?</para>
-          </listitem>
-
-</itemizedlist>
-        <para>To recover from this problem, you must restart Lustre services using these file systems. There is no other way to know that the I/O made it to disk, and the state of the cache may be inconsistent with what is on disk.</para>
-      </section>
-      <section remap="h3">
-        <title>26.3.3 Identifying a <anchor xml:id="dbdoclet.50438198_marker-1291366" xreflabel=""/>Missing OST</title>
-        <para>If an OST is missing for any reason, you may need to know what files are affected. Although an OST is missing, the files system should be operational. From any mounted client node, generate a list of files that reside on the affected OST. It is advisable to mark the missing OST as 'unavailable' so clients and the MDS do not time out trying to contact it.</para>
-        <para> 1. Generate a list of devices and determine the OST's device number. Run:</para>
-        <screen>$ lctl dl 
-</screen>
-        <para>The lctl dl command output lists the device name and number, along with the device UUID and the number of references on the device.</para>
-        <para> 2. Deactivate the OST (on the OSS at the MDS). Run:</para>
-        <screen>$ lctl --device &lt;OST device name or number&gt; deactivate 
-</screen>
-        <para>The OST device number or device name is generated by the lctl dl command.</para>
-        <para>The deactivate command prevents clients from creating new objects on the specified OST, although you can still access the OST for reading.</para>
-                <note><para>If the OST later becomes available it needs to be reactivated, run:</para><para># lctl --device &lt;OST device name or number&gt; activate</para></note>
-
-         <para> 3. Determine all files that are striped over the missing OST, run:</para>
-        <screen># lfs getstripe -r -O {OST_UUID} /mountpoint
-</screen>
-        <para>This returns a simple list of filenames from the affected file system.</para>
-        <para> 4. If necessary, you can read the valid parts of a striped file, run:</para>
-        <screen># dd if=filename of=new_filename bs=4k conv=sync,noerror
-</screen>
-        <para> 5. You can delete these files with the unlink or munlink command.</para>
-        <screen># unlink|munlink filename {filename ...} 
-</screen>
-                <note><para>There is no functional difference between the unlink and munlink commands. The unlink command is for newer Linux distributions. You can run munlink if unlink is not available.</para><para> When you run the unlink or munlink command, the file on the MDS is permanently removed.</para></note>
-         <para> 6. If you need to know, specifically, which parts of the file are missing data, then you first need to determine the file layout (striping pattern), which includes the index of the missing OST). Run:</para>
-        <screen># lfs getstripe -v {filename}
-</screen>
-        <para> 7. Use this computation is to determine which offsets in the file are affected: [(C*N + X)*S, (C*N + X)*S + S - 1], N = { 0, 1, 2, ...}</para>
-        <para>where:</para>
-        <para>C = stripe count</para>
-        <para>S = stripe size</para>
-        <para>X = index of bad OST for this file</para>
-        <para>For example, for a 2 stripe file, stripe size = 1M, the bad OST is at index 0, and you have holes in the file at: [(2*N + 0)*1M, (2*N + 0)*1M + 1M - 1], N = { 0, 1, 2, ...}</para>
-        <para>If the file system cannot be mounted, currently there is no way that parses metadata directly from an MDS. If the bad OST does not start, options to mount the file system are to provide a loop device OST in its place or replace it with a newly-formatted OST. In that case, the missing objects are created and are read as zero-filled.</para>
-      </section>
-      <section xml:id="dbdoclet.50438198_69657">
-        <title>26.3.4 Fixing a Bad LAST_ID on an OST</title>
-        <para>Each OST contains a LAST_ID file, which holds the last object (pre-)created by the MDS  <footnote><para>The contents of the LAST_ID file must be accurate regarding the actual objects that exist on the OST.</para></footnote>. The MDT contains a lov_objid file, with values that represent the last object the MDS has allocated to a file.</para>
-        <para>During normal operation, the MDT keeps some pre-created (but unallocated) objects on the OST, and the relationship between LAST_ID and lov_objid should be LAST_ID &lt;= lov_objid. Any difference in the file values results in objects being created on the OST when it next connects to the MDS. These objects are never actually allocated to a file, since they are of 0 length (empty), but they do no harm. Creating empty objects enables the OST to catch up to the MDS, so normal operations resume.</para>
-        <para>However, in the case where lov_objid &lt; LAST_ID, bad things can happen as the MDS is not aware of objects that have already been allocated on the OST, and it reallocates them to new files, overwriting their existing contents.</para>
-        <para>Here is the rule to avoid this scenario:</para>
-        <para>LAST_ID &gt;= lov_objid and LAST_ID == last_physical_object and lov_objid &gt;= last_used_object</para>
-        <para>Although the lov_objid value should be equal to the last_used_object value, the above rule suffices to keep Lustre happy at the expense of a few leaked objects.</para>
-        <para>In situations where there is on-disk corruption of the OST, for example caused by running with write cache enabled on the disks, the LAST_ID value may become inconsistent and result in a message similar to:</para>
-        <screen>&quot;filter_precreate()) HOME-OST0003: Serious error: 
-objid 3478673 already exists; is this filesystem corrupt?&quot;
-</screen>
-        <para>A related situation may happen if there is a significant discrepancy between the record of previously-created objects on the OST and the previously-allocated objects on the MDS, for example if the MDS has been corrupted, or restored from backup, which may cause significant data loss if left unchecked. This produces a message like:</para>
-        <screen>&quot;HOME-OST0003: ignoring bogus orphan destroy request: 
-obdid 3438673 last_id 3478673&quot;
-</screen>
-        <para>To recover from this situation, determine and set a reasonable LAST_ID value.</para>
-                <note><para>The file system must be stopped on all servers before performing this procedure.</para></note>
-
-        <para>For hex &lt; -&gt; decimal translations:</para>
-        <para>Use GDB:</para>
-        <screen>(gdb) p /x 15028
-$2 = 0x3ab4
-</screen>
-        <para>Or bc:</para>
-        <screen>echo &quot;obase=16; 15028&quot; | bc
-</screen>
-        <para> 1. Determine a reasonable value for the LAST_ID file. Check on the MDS:</para>
-        <screen># mount -t ldiskfs /dev/&lt;mdsdev&gt; /mnt/mds
+    <section remap="h3">
+      <title>26.3.2 OSTs Become Read-Only</title>
+      <para>If the SCSI devices are inaccessible to Lustre at the block device level, then <literal>ldiskfs</literal> remounts the device read-only to prevent file system corruption. This is a normal behavior. The status in <literal>/proc/fs/lustre/health_check</literal> also shows &quot;not healthy&quot; on the affected nodes.</para>
+      <para>To determine what caused the &quot;not healthy&quot; condition:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Examine the consoles of all servers for any error indications</para>
+        </listitem>
+        <listitem>
+          <para>Examine the syslogs of all servers for any LustreErrors or <literal>LBUG</literal></para>
+        </listitem>
+        <listitem>
+          <para>Check the health of your system hardware and network. (Are the disks working as expected, is the network dropping packets?)</para>
+        </listitem>
+        <listitem>
+          <para>Consider what was happening on the cluster at the time. Does this relate to a specific user workload or a system load condition? Is the condition reproducible? Does it happen at a specific time (day, week or month)?</para>
+        </listitem>
+      </itemizedlist>
+      <para>To recover from this problem, you must restart Lustre services using these file systems. There is no other way to know that the I/O made it to disk, and the state of the cache may be inconsistent with what is on disk.</para>
+    </section>
+    <section remap="h3">
+      <title>26.3.3 Identifying a Missing OST</title>
+      <para>If an OST is missing for any reason, you may need to know what files are affected. Although an OST is missing, the files system should be operational. From any mounted client node, generate a list of files that reside on the affected OST. It is advisable to mark the missing OST as &apos;unavailable&apos; so clients and the MDS do not time out trying to contact it.</para>
+      <orderedlist>
+        <listitem>
+          <para>Generate a list of devices and determine the OST&apos;s device number. Run:</para>
+          <screen>$ lctl dl </screen>
+          <para>The lctl dl command output lists the device name and number, along with the device UUID and the number of references on the device.</para>
+        </listitem>
+        <listitem>
+          <para>Deactivate the OST (on the OSS at the MDS). Run:</para>
+          <screen>$ lctl --device &lt;OST device name or number&gt; deactivate</screen>
+          <para>The OST device number or device name is generated by the lctl dl command.</para>
+          <para>The <literal>deactivate</literal> command prevents clients from creating new objects on the specified OST, although you can still access the OST for reading.</para>
+        </listitem>
+        <note>
+          <para>If the OST later becomes available it needs to be reactivated, run:</para>
+          <screen># lctl --device &lt;OST device name or number&gt; activate</screen>
+        </note>
+        <listitem>
+          <para>Determine all files that are striped over the missing OST, run:</para>
+          <screen># lfs getstripe -r -O {OST_UUID} /mountpoint</screen>
+          <para>This returns a simple list of filenames from the affected file system.</para>
+        </listitem>
+        <listitem>
+          <para>If necessary, you can read the valid parts of a striped file, run:</para>
+          <screen># dd if=filename of=new_filename bs=4k conv=sync,noerror</screen>
+        </listitem>
+        <listitem>
+          <para>You can delete these files with the unlink or munlink command.</para>
+          <screen># unlink|munlink filename {filename ...} </screen>
+        </listitem>
+        <note>
+          <para>There is no functional difference between the <literal>unlink</literal> and <literal>munlink</literal> commands. The unlink command is for newer Linux distributions. You can run <literal>munlink</literal> if <literal>unlink</literal> is not available.</para>
+          <para>When you run the <literal>unlink</literal> or <literal>munlink</literal> command, the file on the MDS is permanently removed.</para>
+        </note>
+        <listitem>
+          <para>If you need to know, specifically, which parts of the file are missing data, then you first need to determine the file layout (striping pattern), which includes the index of the missing OST). Run:</para>
+          <screen># lfs getstripe -v {filename}</screen>
+        </listitem>
+        <listitem>
+          <para>Use this computation is to determine which offsets in the file are affected: [(C*N + X)*S, (C*N + X)*S + S - 1], N = { 0, 1, 2, ...}</para>
+          <para>where:</para>
+          <para>C = stripe count</para>
+          <para>S = stripe size</para>
+          <para>X = index of bad OST for this file</para>
+        </listitem>
+      </orderedlist>
+      <para>For example, for a 2 stripe file, stripe size = 1M, the bad OST is at index 0, and you have holes in the file at: [(2*N + 0)*1M, (2*N + 0)*1M + 1M - 1], N = { 0, 1, 2, ...}</para>
+      <para>If the file system cannot be mounted, currently there is no way that parses metadata directly from an MDS. If the bad OST does not start, options to mount the file system are to provide a loop device OST in its place or replace it with a newly-formatted OST. In that case, the missing objects are created and are read as zero-filled.</para>
+    </section>
+    <section xml:id="dbdoclet.50438198_69657">
+      <title>26.3.4 Fixing a Bad LAST_ID on an OST</title>
+      <para>Each OST contains a LAST_ID file, which holds the last object (pre-)created by the MDS  <footnote>
+          <para>The contents of the LAST_ID file must be accurate regarding the actual objects that exist on the OST.</para>
+        </footnote>. The MDT contains a lov_objid file, with values that represent the last object the MDS has allocated to a file.</para>
+      <para>During normal operation, the MDT keeps some pre-created (but unallocated) objects on the OST, and the relationship between LAST_ID and lov_objid should be LAST_ID &lt;= lov_objid. Any difference in the file values results in objects being created on the OST when it next connects to the MDS. These objects are never actually allocated to a file, since they are of 0 length (empty), but they do no harm. Creating empty objects enables the OST to catch up to the MDS, so normal operations resume.</para>
+      <para>However, in the case where lov_objid &lt; LAST_ID, bad things can happen as the MDS is not aware of objects that have already been allocated on the OST, and it reallocates them to new files, overwriting their existing contents.</para>
+      <para>Here is the rule to avoid this scenario:</para>
+      <para>LAST_ID &gt;= lov_objid and LAST_ID == last_physical_object and lov_objid &gt;= last_used_object</para>
+      <para>Although the lov_objid value should be equal to the last_used_object value, the above rule suffices to keep Lustre happy at the expense of a few leaked objects.</para>
+      <para>In situations where there is on-disk corruption of the OST, for example caused by running with write cache enabled on the disks, the LAST_ID value may become inconsistent and result in a message similar to:</para>
+      <screen>&quot;filter_precreate()) HOME-OST0003: Serious error: 
+objid 3478673 already exists; is this filesystem corrupt?&quot;</screen>
+      <para>A related situation may happen if there is a significant discrepancy between the record of previously-created objects on the OST and the previously-allocated objects on the MDS, for example if the MDS has been corrupted, or restored from backup, which may cause significant data loss if left unchecked. This produces a message like:</para>
+      <screen>&quot;HOME-OST0003: ignoring bogus orphan destroy request: 
+obdid 3438673 last_id 3478673&quot;</screen>
+      <para>To recover from this situation, determine and set a reasonable LAST_ID value.</para>
+      <note>
+        <para>The file system must be stopped on all servers before performing this procedure.</para>
+      </note>
+      <para>For hex &lt; -&gt; decimal translations:</para>
+      <para>Use GDB:</para>
+      <screen>(gdb) p /x 15028
+$2 = 0x3ab4</screen>
+      <para>Or <literal>bc</literal>:</para>
+      <screen>echo &quot;obase=16; 15028&quot; | bc</screen>
+      <orderedlist>
+        <listitem>
+          <para>Determine a reasonable value for the LAST_ID file. Check on the MDS:</para>
+          <screen># mount -t ldiskfs /dev/&lt;mdsdev&gt; /mnt/mds
  # od -Ax -td8 /mnt/mds/lov_objid
  </screen>
-        <para>There is one entry for each OST, in OST index order. This is what the MDS thinks is the last in-use object.</para>
-        <para> 2. Determine the OST index for this OST.</para>
-        <screen># od -Ax -td4 /mnt/ost/last_rcvd
-</screen>
-        <para>It will have it at offset 0x8c.</para>
-        <para> 3. Check on the OST. Use debugfs to check the LAST_ID value:</para>
-        <screen>debugfs -c -R &apos;dump /O/0/LAST_ID /tmp/LAST_ID&apos; /dev/XXX ; od -Ax -td8 /tmp/\
+          <para>There is one entry for each OST, in OST index order. This is what the MDS thinks is the last in-use object.</para>
+        </listitem>
+        <listitem>
+          <para>Determine the OST index for this OST.</para>
+          <screen># od -Ax -td4 /mnt/ost/last_rcvd
+</screen>
+          <para>It will have it at offset 0x8c.</para>
+        </listitem>
+        <listitem>
+          <para>Check on the OST. Use debugfs to check the LAST_ID value:</para>
+          <screen>debugfs -c -R &apos;dump /O/0/LAST_ID /tmp/LAST_ID&apos; /dev/XXX ; od -Ax -td8 /tmp/\
  LAST_ID&quot;
  </screen>
-        <para> 4. Check the objects on the OST:</para>
-        <screen>mount -rt ldiskfs /dev/{ostdev} /mnt/ost
+        </listitem>
+        <listitem>
+          <para>Check the objects on the OST:</para>
+          <screen>mount -rt ldiskfs /dev/{ostdev} /mnt/ost
  # note the ls below is a number one and not a letter L
  ls -1s /mnt/ost/O/0/d* | grep -v [a-z] |
  sort -k2 -n &gt; /tmp/objects.{diskname}
   
-tail -30 /tmp/objects.{diskname}
-</screen>
-        <para>This shows you the OST state. There may be some pre-created orphans. Check for zero-length objects. Any zero-length objects with IDs higher than LAST_ID should be deleted. New objects will be pre-created.</para>
-        <para>If the OST LAST_ID value matches that for the objects existing on the OST, then it is possible the lov_objid file on the MDS is incorrect. Delete the lov_objid file on the MDS and it will be re-created from the LAST_ID on the OSTs.</para>
-        <para>If you determine the LAST_ID file on the OST is incorrect (that is, it does not match what objects exist, does not match the MDS lov_objid value), then you have decided on a proper value for LAST_ID.</para>
-        <para>Once you have decided on a proper value for LAST_ID, use this repair procedure.</para>
-        <orderedlist><listitem>
-        <para>Access:</para>
-        <screen>mount -t ldiskfs /dev/{ostdev} /mnt/ost
-</screen>
-        </listitem><listitem>
-        <para>Check the current:</para>
-        <screen>od -Ax -td8 /mnt/ost/O/0/LAST_ID
-</screen>
-        </listitem><listitem>
-        <para>Be very safe, only work on backups:</para>
-        <screen>cp /mnt/ost/O/0/LAST_ID /tmp/LAST_ID
-</screen>
-        </listitem><listitem>
-        <para>Convert binary to text:</para>
-        <screen>xxd /tmp/LAST_ID /tmp/LAST_ID.asc
-</screen>
-        </listitem><listitem>
-        <para>Fix:</para>
-        <screen>vi /tmp/LAST_ID.asc
-</screen>
-        </listitem><listitem>
-        <para>Convert to binary:</para>
-        <screen>xxd -r /tmp/LAST_ID.asc /tmp/LAST_ID.new
-</screen>
-        </listitem><listitem>
-        <para>Verify:</para>
-        <screen>od -Ax -td8 /tmp/LAST_ID.new
-</screen>
-        </listitem><listitem>
-        <para>Replace:</para>
-        <screen>cp /tmp/LAST_ID.new /mnt/ost/O/0/LAST_ID
-</screen>
-        </listitem><listitem>
-        <para>Clean up:</para>
-        <screen>umount /mnt/ost
-</screen>
-        </listitem></orderedlist>
-      </section>
-      <section remap="h3">
-        <title>26.3.5 Handling/Debugging <anchor xml:id="dbdoclet.50438198_marker-1291446" xreflabel=""/>&quot;Bind: Address already in use&quot; Error</title>
-        <para>During startup, Lustre may report a bind: Address already in use error and reject to start the operation. This is caused by a portmap service (often NFS locking) which starts before Lustre and binds to the default port 988. You must have port 988 open from firewall or IP tables for incoming connections on the client, OSS, and MDS nodes. LNET will create three outgoing connections on available, reserved ports to each client-server pair, starting with 1023, 1022 and 1021.</para>
-        <para>Unfortunately, you cannot set sunprc to avoid port 988. If you receive this error, do the following:</para>
-        <itemizedlist><listitem>
-            <para> Start Lustre before starting any service that uses sunrpc.</para>
-          </listitem>
-
-<listitem>
-            <para> Use a port other than 988 for Lustre. This is configured in /etc/modprobe.conf as an option to the LNET module. For example:</para>
-          </listitem>
-
-</itemizedlist>
-        <screen>options lnet accept_port=988
-</screen>
-        <itemizedlist><listitem>
-            <para> Add modprobe ptlrpc to your system startup scripts before the service that uses sunrpc. This causes Lustre to bind to port 988 and sunrpc to select a different port.</para>
-          </listitem>
-
-</itemizedlist>
-                <note><para>You can also use the sysctl command to mitigate the NFS client from grabbing the Lustre service port. However, this is a partial workaround as other user-space RPC servers still have the ability to grab the port.</para></note>
-
-      </section>
-      <section remap="h3">
-        <title>26.3.6 Handling/Debugging <anchor xml:id="dbdoclet.50438198_marker-1291470" xreflabel=""/>Error &quot;- 28&quot;</title>
-        <para>A Linux error -28 (ENOSPC) that occurs during a write or sync operation indicates that an existing file residing on an OST could not be rewritten or updated because the OST was full, or nearly full. To verify if this is the case, on a client on which the OST is mounted, enter :</para>
-        <screen>lfs df -h
-</screen>
-        <para>To address this issue, you can do one of the following:</para>
-        <itemizedlist><listitem>
-            <para> Expand the disk space on the OST.</para>
-          </listitem>
-
-<listitem>
-            <para> Copy or stripe the file to a less full OST.</para>
-          </listitem>
-
-</itemizedlist>
-        <para>A Linux error -28 (ENOSPC) that occurs when a new file is being created may indicate that the MDS has run out of inodes and needs to be made larger. Newly created files do not written to full OSTs, while existing files continue to reside on the OST where they were initially created. To view inode information on the MDS, enter:</para>
-        <screen>lfs df -i
-</screen>
-        <para>Typically, Lustre reports this error to your application. If the application is checking the return code from its function calls, then it decodes it into a textual error message such as No space left on device. Both versions of the error message also appear in the system log.</para>
-        <para>For more information about the lfs df command, see <link xl:href="ManagingStripingFreeSpace.html#50438209_35838">Checking File System Free Space</link>.</para>
-        <para>Although it is less efficient, you can also use the grep command to determine which OST or MDS is running out of space. To check the free space and inodes on a client, enter:</para>
-        <screen>grep &apos;[0-9]&apos; /proc/fs/lustre/osc/*/kbytes{free,avail,total}
+tail -30 /tmp/objects.{diskname}</screen>
+          <para>This shows you the OST state. There may be some pre-created orphans. Check for zero-length objects. Any zero-length objects with IDs higher than LAST_ID should be deleted. New objects will be pre-created.</para>
+        </listitem>
+      </orderedlist>
+      <para>If the OST LAST_ID value matches that for the objects existing on the OST, then it is possible the lov_objid file on the MDS is incorrect. Delete the lov_objid file on the MDS and it will be re-created from the LAST_ID on the OSTs.</para>
+      <para>If you determine the LAST_ID file on the OST is incorrect (that is, it does not match what objects exist, does not match the MDS lov_objid value), then you have decided on a proper value for LAST_ID.</para>
+      <para>Once you have decided on a proper value for LAST_ID, use this repair procedure.</para>
+      <orderedlist>
+        <listitem>
+          <para>Access:</para>
+          <screen>mount -t ldiskfs /dev/{ostdev} /mnt/ost</screen>
+        </listitem>
+        <listitem>
+          <para>Check the current:</para>
+          <screen>od -Ax -td8 /mnt/ost/O/0/LAST_ID</screen>
+        </listitem>
+        <listitem>
+          <para>Be very safe, only work on backups:</para>
+          <screen>cp /mnt/ost/O/0/LAST_ID /tmp/LAST_ID</screen>
+        </listitem>
+        <listitem>
+          <para>Convert binary to text:</para>
+          <screen>xxd /tmp/LAST_ID /tmp/LAST_ID.asc</screen>
+        </listitem>
+        <listitem>
+          <para>Fix:</para>
+          <screen>vi /tmp/LAST_ID.asc</screen>
+        </listitem>
+        <listitem>
+          <para>Convert to binary:</para>
+          <screen>xxd -r /tmp/LAST_ID.asc /tmp/LAST_ID.new</screen>
+        </listitem>
+        <listitem>
+          <para>Verify:</para>
+          <screen>od -Ax -td8 /tmp/LAST_ID.new</screen>
+        </listitem>
+        <listitem>
+          <para>Replace:</para>
+          <screen>cp /tmp/LAST_ID.new /mnt/ost/O/0/LAST_ID</screen>
+        </listitem>
+        <listitem>
+          <para>Clean up:</para>
+          <screen>umount /mnt/ost</screen>
+        </listitem>
+      </orderedlist>
+    </section>
+    <section remap="h3">
+      <title>26.3.5 Handling/Debugging &quot;<literal>Bind: Address already in use</literal>&quot; Error</title>
+      <para>During startup, Lustre may report a <literal>bind: Address already in use</literal> error and reject to start the operation. This is caused by a portmap service (often NFS locking) which starts before Lustre and binds to the default port 988. You must have port 988 open from firewall or IP tables for incoming connections on the client, OSS, and MDS nodes. LNET will create three outgoing connections on available, reserved ports to each client-server pair, starting with 1023, 1022 and 1021.</para>
+      <para>Unfortunately, you cannot set sunprc to avoid port 988. If you receive this error, do the following:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Start Lustre before starting any service that uses sunrpc.</para>
+        </listitem>
+        <listitem>
+          <para>Use a port other than 988 for Lustre. This is configured in /etc/modprobe.conf as an option to the LNET module. For example:</para>
+          <screen>options lnet accept_port=988</screen>
+        </listitem>
+      </itemizedlist>
+      <itemizedlist>
+        <listitem>
+          <para>Add modprobe ptlrpc to your system startup scripts before the service that uses sunrpc. This causes Lustre to bind to port 988 and sunrpc to select a different port.</para>
+        </listitem>
+      </itemizedlist>
+      <note>
+        <para>You can also use the <literal>sysctl</literal> command to mitigate the NFS client from grabbing the Lustre service port. However, this is a partial workaround as other user-space RPC servers still have the ability to grab the port.</para>
+      </note>
+    </section>
+    <section remap="h3">
+      <title>26.3.6 Handling/Debugging Error &quot;- 28&quot;</title>
+      <para>A Linux error -28 (<literal>ENOSPC</literal>) that occurs during a write or sync operation indicates that an existing file residing on an OST could not be rewritten or updated because the OST was full, or nearly full. To verify if this is the case, on a client on which the OST is mounted, enter :</para>
+      <screen>lfs df -h</screen>
+      <para>To address this issue, you can do one of the following:</para>
+      <itemizedlist>
+        <listitem>
+          <para>Expand the disk space on the OST.</para>
+        </listitem>
+        <listitem>
+          <para>Copy or stripe the file to a less full OST.</para>
+        </listitem>
+      </itemizedlist>
+      <para>A Linux error -28 (<literal>ENOSPC</literal>) that occurs when a new file is being created may indicate that the MDS has run out of inodes and needs to be made larger. Newly created files do not written to full OSTs, while existing files continue to reside on the OST where they were initially created. To view inode information on the MDS, enter:</para>
+      <screen>lfs df -i</screen>
+      <para>Typically, Lustre reports this error to your application. If the application is checking the return code from its function calls, then it decodes it into a textual error message such as <literal>No space left on device</literal>. Both versions of the error message also appear in the system log.</para>
+      <para>For more information about the <literal>lfs df</literal> command, see <xref linkend="dbdoclet.50438209_35838">Checking File System Free Space</xref>.</para>
+      <para>Although it is less efficient, you can also use the grep command to determine which OST or MDS is running out of space. To check the free space and inodes on a client, enter:</para>
+      <screen>grep &apos;[0-9]&apos; /proc/fs/lustre/osc/*/kbytes{free,avail,total}
  grep &apos;[0-9]&apos; /proc/fs/lustre/osc/*/files{free,total}
  grep &apos;[0-9]&apos; /proc/fs/lustre/mdc/*/kbytes{free,avail,total}
-grep &apos;[0-9]&apos; /proc/fs/lustre/mdc/*/files{free,total}
-</screen>
-                <note><para>You can find other numeric error codes along with a short name and text description in /usr/include/asm/errno.h.</para></note>
-
-      </section>
-      <section remap="h3">
-        <title>26.3.7 Triggering <anchor xml:id="dbdoclet.50438198_marker-1291480" xreflabel=""/>Watchdog for PID NNN</title>
-        <para>In some cases, a server node triggers a watchdog timer and this causes a process stack to be dumped to the console along with a Lustre kernel debug log being dumped into /tmp (by default). The presence of a watchdog timer does NOT mean that the thread OOPSed, but rather that it is taking longer time than expected to complete a given operation. In some cases, this situation is expected.</para>
-        <para>For example, if a RAID rebuild is really slowing down I/O on an OST, it might trigger watchdog timers to trip. But another message follows shortly thereafter, indicating that the thread in question has completed processing (after some number of seconds). Generally, this indicates a transient problem. In other cases, it may legitimately signal that a thread is stuck because of a software error (lock inversion, for example).</para>
-        <screen>Lustre: 0:0:(watchdog.c:122:lcw_cb()) 
-</screen>
-        <para>The above message indicates that the watchdog is active for pid 933:</para>
-        <para>It was inactive for 100000ms:</para>
-        <screen>Lustre: 0:0:(linux-debug.c:132:portals_debug_dumpstack()) 
-</screen>
-        <para>Showing stack for process:</para>
-        <screen>933 ll_ost_25     D F896071A     0   933      1    934   932 (L-TLB)
+grep &apos;[0-9]&apos; /proc/fs/lustre/mdc/*/files{free,total}</screen>
+      <note>
+        <para>You can find other numeric error codes along with a short name and text description in <literal>/usr/include/asm/errno.h</literal>.</para>
+      </note>
+    </section>
+    <section remap="h3">
+      <title>26.3.7 Triggering Watchdog for PID NNN</title>
+      <para>In some cases, a server node triggers a watchdog timer and this causes a process stack to be dumped to the console along with a Lustre kernel debug log being dumped into <literal>/tmp</literal> (by default). The presence of a watchdog timer does NOT mean that the thread OOPSed, but rather that it is taking longer time than expected to complete a given operation. In some cases, this situation is expected.</para>
+      <para>For example, if a RAID rebuild is really slowing down I/O on an OST, it might trigger watchdog timers to trip. But another message follows shortly thereafter, indicating that the thread in question has completed processing (after some number of seconds). Generally, this indicates a transient problem. In other cases, it may legitimately signal that a thread is stuck because of a software error (lock inversion, for example).</para>
+      <screen>Lustre: 0:0:(watchdog.c:122:lcw_cb()) </screen>
+      <para>The above message indicates that the watchdog is active for pid 933:</para>
+      <para>It was inactive for 100000ms:</para>
+      <screen>Lustre: 0:0:(linux-debug.c:132:portals_debug_dumpstack()) </screen>
+      <para>Showing stack for process:</para>
+      <screen>933 ll_ost_25     D F896071A     0   933      1    934   932 (L-TLB)
  f6d87c60 00000046 00000000 f896071a f8def7cc 00002710 00001822 2da48cae
  0008cf1a f6d7c220 f6d7c3d0 f6d86000 f3529648 f6d87cc4 f3529640 f8961d3d
-00000010 f6d87c9c ca65a13c 00001fff 00000001 00000001 00000000 00000001
-</screen>
-        <para>Call trace:</para>
-        <screen>filter_do_bio+0x3dd/0xb90 [obdfilter]
+00000010 f6d87c9c ca65a13c 00001fff 00000001 00000001 00000000 00000001</screen>
+      <para>Call trace:</para>
+      <screen>filter_do_bio+0x3dd/0xb90 [obdfilter]
  default_wake_function+0x0/0x20
  filter_direct_io+0x2fb/0x990 [obdfilter]
  filter_preprw_read+0x5c5/0xe00 [obdfilter]
@@ -376,80 +481,75 @@ ost_handle+0x14c2/0x42d0 [ost]
  ptlrpc_server_handle_request+0x870/0x10b0 [ptlrpc]
  ptlrpc_main+0x42e/0x7c0 [ptlrpc]
  </screen>
-      </section>
-      <section remap="h3">
-        <title>26.3.8 Handling <anchor xml:id="dbdoclet.50438198_marker-1291503" xreflabel=""/>Timeouts on Initial Lustre Setup</title>
-        <para>If you come across timeouts or hangs on the initial setup of your Lustre system, verify that name resolution for servers and clients is working correctly. Some distributions configure /etc/hosts sts so the name of the local machine (as reported by the &apos;hostname&apos; command) is mapped to local host (127.0.0.1) instead of a proper IP address.</para>
-        <para>This might produce this error:</para>
-        <screen>LustreError:(ldlm_handle_cancel()) received cancel for unknown lock cookie
+    </section>
+    <section remap="h3">
+      <title>26.3.8 Handling Timeouts on Initial Lustre Setup</title>
+      <para>If you come across timeouts or hangs on the initial setup of your Lustre system, verify that name resolution for servers and clients is working correctly. Some distributions configure <literal>/etc/hosts sts</literal> so the name of the local machine (as reported by the &apos;hostname&apos; command) is mapped to local host (127.0.0.1) instead of a proper IP address.</para>
+      <para>This might produce this error:</para>
+      <screen>LustreError:(ldlm_handle_cancel()) received cancel for unknown lock cookie
  0xe74021a4b41b954e from nid 0x7f000001 (0:127.0.0.1)
  </screen>
-      </section>
-      <section remap="h3">
-        <title>26.3.9 Handling/Debugging <anchor xml:id="dbdoclet.50438198_marker-1291509" xreflabel=""/>&quot;LustreError: xxx went back in time&quot;</title>
-        <para>Each time Lustre changes the state of the disk file system, it records a unique transaction number. Occasionally, when committing these transactions to the disk, the last committed transaction number displays to other nodes in the cluster to assist the recovery. Therefore, the promised transactions remain absolutely safe on the disappeared disk.</para>
-        <para>This situation arises when:</para>
-        <itemizedlist><listitem>
-            <para> You are using a disk device that claims to have data written to disk before it actually does, as in case of a device with a large cache. If that disk device crashes or loses power in a way that causes the loss of the cache, there can be a loss of transactions that you believe are committed. This is a very serious event, and you should run e2fsck against that storage before restarting Lustre.</para>
-          </listitem>
-
-<listitem>
-            <para> As per the Lustre requirement, the shared storage used for failover is completely cache-coherent. This ensures that if one server takes over for another, it sees the most up-to-date and accurate copy of the data. In case of the failover of the server, if the shared storage does not provide cache coherency between all of its ports, then Lustre can produce an error.</para>
-          </listitem>
-
-</itemizedlist>
-        <para>If you know the exact reason for the error, then it is safe to proceed with no further action. If you do not know the reason, then this is a serious issue and you should explore it with your disk vendor.</para>
-        <para>If the error occurs during failover, examine your disk cache settings. If it occurs after a restart without failover, try to determine how the disk can report that a write succeeded, then lose the Data Device corruption or Disk Errors.</para>
-      </section>
-      <section remap="h3">
-        <title>26.3.10 Lustre Error: <anchor xml:id="dbdoclet.50438198_marker-1291517" xreflabel=""/>&quot;Slow Start_Page_Write&quot;</title>
-        <para>The slow start_page_write message appears when the operation takes an extremely long time to allocate a batch of memory pages. Use these pages to receive network traffic first, and then write to disk.</para>
-      </section>
-      <section remap="h3">
-        <title>26.3.11 Drawbacks in Doing <anchor xml:id="dbdoclet.50438198_marker-1291520" xreflabel=""/>Multi-client O_APPEND Writes</title>
-        <para>It is possible to do multi-client O_APPEND writes to a single file, but there are few drawbacks that may make this a sub-optimal solution. These drawbacks are:</para>
-        <itemizedlist><listitem>
-            <para>  Each client needs to take an EOF lock on all the OSTs, as it is difficult to know which OST holds the end of the file until you check all the OSTs. As all the clients are using the same O_APPEND, there is significant locking overhead.</para>
-          </listitem>
-
-<listitem>
-            <para> The second client cannot get all locks until the end of the writing of the first client, as the taking serializes all writes from the clients.</para>
-          </listitem>
-
-<listitem>
-            <para> To avoid deadlocks, the taking of these locks occurs in a known, consistent order. As a client cannot know which OST holds the next piece of the file until the client has locks on all OSTS, there is a need of these locks in case of a striped file.</para>
-          </listitem>
-
-</itemizedlist>
-      </section>
-      <section remap="h3">
-        <title>26.3.12 Slowdown Occurs <anchor xml:id="dbdoclet.50438198_marker-1291921" xreflabel=""/>During Lustre Startup</title>
-        <para>When Lustre starts, the Lustre file system needs to read in data from the disk. For the very first mdsrate run after the reboot, the MDS needs to wait on all the OSTs for object pre-creation. This causes a slowdown to occur when Lustre starts up.</para>
-        <para>After the file system has been running for some time, it contains more data in cache and hence, the variability caused by reading critical metadata from disk is mostly eliminated. The file system now reads data from the cache.</para>
-      </section>
-      <section remap="h3">
-        <title>26.3.13 Log Message 'Out of <anchor xml:id="dbdoclet.50438198_marker-1292113" xreflabel=""/>Memory' on OST</title>
-        <para>When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre system. If insufficient memory is available, an 'out of memory' message can be logged.</para>
-        <para>During normal operation, several conditions indicate insufficient RAM on a server node:</para>
-        <itemizedlist><listitem>
-            <para> kernel &quot;Out of memory&quot; and/or &quot;oom-killer&quot; messages</para>
-          </listitem>
-
-<listitem>
-            <para> Lustre &quot;kmalloc of &apos;mmm&apos; (NNNN bytes) failed...&quot; messages</para>
-          </listitem>
-
-<listitem>
-            <para> Lustre or kernel stack traces showing processes stuck in &quot;try_to_free_pages&quot;</para>
-          </listitem>
-
-</itemizedlist>
-        <para>For information on determining the MDS memory and OSS memory requirements, see <link xl:href="SettingUpLustreSystem.html#50438256_26456">Determining Memory Requirements</link>.</para>
-      </section>
-      <section remap="h3">
-        <title>26.3.14 Setting SCSI <anchor xml:id="dbdoclet.50438198_marker-1294800" xreflabel=""/>I/O Sizes</title>
-        <para>Some SCSI drivers default to a maximum I/O size that is too small for good Lustre performance. we have fixed quite a few drivers, but you may still find that some drivers give unsatisfactory performance with Lustre. As the default value is hard-coded, you need to recompile the drivers to change their default. On the other hand, some drivers may have a wrong default set.</para>
-        <para>If you suspect bad I/O performance and an analysis of Lustre statistics indicates that I/O is not 1 MB, check /sys/block/&lt;device&gt;/queue/max_sectors_kb. If the max_sectors_kb value is less than 1024, set it to at least 1024 to improve performance. If changing max_sectors_kb does not change the I/O size as reported by Lustre, you may want to examine the SCSI driver code.</para>
-      </section>
+    </section>
+    <section remap="h3">
+      <title>26.3.9 Handling/Debugging &quot;LustreError: xxx went back in time&quot;</title>
+      <para>Each time Lustre changes the state of the disk file system, it records a unique transaction number. Occasionally, when committing these transactions to the disk, the last committed transaction number displays to other nodes in the cluster to assist the recovery. Therefore, the promised transactions remain absolutely safe on the disappeared disk.</para>
+      <para>This situation arises when:</para>
+      <itemizedlist>
+        <listitem>
+          <para>You are using a disk device that claims to have data written to disk before it actually does, as in case of a device with a large cache. If that disk device crashes or loses power in a way that causes the loss of the cache, there can be a loss of transactions that you believe are committed. This is a very serious event, and you should run e2fsck against that storage before restarting Lustre.</para>
+        </listitem>
+        <listitem>
+          <para>As per the Lustre requirement, the shared storage used for failover is completely cache-coherent. This ensures that if one server takes over for another, it sees the most up-to-date and accurate copy of the data. In case of the failover of the server, if the shared storage does not provide cache coherency between all of its ports, then Lustre can produce an error.</para>
+        </listitem>
+      </itemizedlist>
+      <para>If you know the exact reason for the error, then it is safe to proceed with no further action. If you do not know the reason, then this is a serious issue and you should explore it with your disk vendor.</para>
+      <para>If the error occurs during failover, examine your disk cache settings. If it occurs after a restart without failover, try to determine how the disk can report that a write succeeded, then lose the Data Device corruption or Disk Errors.</para>
+    </section>
+    <section remap="h3">
+      <title>26.3.10 Lustre Error: &quot;<literal>Slow Start_Page_Write</literal>&quot;</title>
+      <para>The slow <literal>start_page_write</literal> message appears when the operation takes an extremely long time to allocate a batch of memory pages. Use these pages to receive network traffic first, and then write to disk.</para>
+    </section>
+    <section remap="h3">
+      <title>26.3.11 Drawbacks in Doing Multi-client O_APPEND Writes</title>
+      <para>It is possible to do multi-client <literal>O_APPEND</literal> writes to a single file, but there are few drawbacks that may make this a sub-optimal solution. These drawbacks are:</para>
+      <itemizedlist>
+        <listitem>
+          <para>  Each client needs to take an <literal>EOF</literal> lock on all the OSTs, as it is difficult to know which OST holds the end of the file until you check all the OSTs. As all the clients are using the same <literal>O_APPEND</literal>, there is significant locking overhead.</para>
+        </listitem>
+        <listitem>
+          <para> The second client cannot get all locks until the end of the writing of the first client, as the taking serializes all writes from the clients.</para>
+        </listitem>
+        <listitem>
+          <para> To avoid deadlocks, the taking of these locks occurs in a known, consistent order. As a client cannot know which OST holds the next piece of the file until the client has locks on all OSTS, there is a need of these locks in case of a striped file.</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section remap="h3">
+      <title>26.3.12 Slowdown Occurs During Lustre Startup</title>
+      <para>When Lustre starts, the Lustre file system needs to read in data from the disk. For the very first mdsrate run after the reboot, the MDS needs to wait on all the OSTs for object pre-creation. This causes a slowdown to occur when Lustre starts up.</para>
+      <para>After the file system has been running for some time, it contains more data in cache and hence, the variability caused by reading critical metadata from disk is mostly eliminated. The file system now reads data from the cache.</para>
+    </section>
+    <section remap="h3">
+      <title>26.3.13 Log Message <literal>&apos;Out of Memory</literal>&apos; on OST</title>
+      <para>When planning the hardware for an OSS node, consider the memory usage of several components in the Lustre system. If insufficient memory is available, an &apos;out of memory&apos; message can be logged.</para>
+      <para>During normal operation, several conditions indicate insufficient RAM on a server node:</para>
+      <itemizedlist>
+        <listitem>
+          <para> kernel &quot;<literal>Out of memory</literal>&quot; and/or &quot;<literal>oom-killer</literal>&quot; messages</para>
+        </listitem>
+        <listitem>
+          <para> Lustre &quot;<literal>kmalloc of &apos;mmm&apos; (NNNN bytes) failed...</literal>&quot; messages</para>
+        </listitem>
+        <listitem>
+          <para> Lustre or kernel stack traces showing processes stuck in &quot;<literal>try_to_free_pages</literal>&quot;</para>
+        </listitem>
+      </itemizedlist>
+      <para>For information on determining the MDS memory and OSS memory requirements, see <xref linkend="dbdoclet.50438256_26456">Determining Memory Requirements</xref>.</para>
+    </section>
+    <section remap="h3">
+      <title>26.3.14 Setting SCSI I/O Sizes</title>
+      <para>Some SCSI drivers default to a maximum I/O size that is too small for good Lustre performance. we have fixed quite a few drivers, but you may still find that some drivers give unsatisfactory performance with Lustre. As the default value is hard-coded, you need to recompile the drivers to change their default. On the other hand, some drivers may have a wrong default set.</para>
+      <para>If you suspect bad I/O performance and an analysis of Lustre statistics indicates that I/O is not 1 MB, check <literal>/sys/block/&lt;device&gt;/queue/max_sectors_kb</literal>. If the <literal>max_sectors_kb</literal> value is less than 1024, set it to at least 1024 to improve performance. If changing <literal>max_sectors_kb</literal> does not change the I/O size as reported by Lustre, you may want to examine the SCSI driver code.</para>
+    </section>
    </section>
  </chapter>
diff --git a/TroubleShootingRecovery.xml b/TroubleShootingRecovery.xml

index 2a4af7d..346e4ed 100644 (file)
--- a/TroubleShootingRecovery.xml
+++ b/TroubleShootingRecovery.xml
@@ -1,75 +1,72 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter version="5.0" xml:lang="en-US" xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" xml:id='troubleshootingrecovery'>
+<?xml version='1.0' encoding='UTF-8'?>
+<!-- This document was created with Syntext Serna Free. -->
+<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="troubleshootingrecovery">
    <info>
-    <title xml:id='troubleshootingrecovery.title'>Troubleshooting Recovery</title>
+    <title xml:id="troubleshootingrecovery.title">Troubleshooting Recovery</title>
    </info>
    <para>This chapter describes what to do if something goes wrong during recovery. It describes:</para>
-
-  <itemizedlist><listitem>
+  <itemizedlist>
+    <listitem>
        <para><xref linkend="dbdoclet.50438225_71141"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438225_37365"/></para>
      </listitem>
-
-<listitem>
+    <listitem>
        <para><xref linkend="dbdoclet.50438225_12316"/></para>
      </listitem>
-
-</itemizedlist>
-
-    <section xml:id="dbdoclet.50438225_71141">
-      <title>27.1 Recovering from Errors or <anchor xml:id="dbdoclet.50438225_marker-1292184" xreflabel=""/>Corruption on a Backing File System</title>
-      <para>When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on the file system. ldiskfs journaling ensures that the file system remains coherent. The backing file systems are never accessed directly from the client, so client crashes are not relevant.</para>
-      <para>The only time it is REQUIRED that e2fsck be run on a device is when an event causes problems that ldiskfs journaling is unable to handle, such as a hardware device failure or I/O error. If the ldiskfs kernel code detects corruption on the disk, it mounts the file system as read-only to prevent further corruption, but still allows read access to the device. This appears as error &quot;-30&quot; (EROFS) in the syslogs on the server, e.g.:</para>
-      <screen>Dec 29 14:11:32 mookie kernel: LDISKFS-fs error (device sdz): ldiskfs_looku\
-p: unlinked inode 5384166 in dir #145170469
-</screen>
-      <para>Dec 29 14:11:32 mookie kernel: Remounting filesystem read-only</para>
-      <para>In such a situation, it is normally required that e2fsck only be run on the bad device before placing the device back into service.</para>
-      <para>In the vast majority of cases, Lustre can cope with any inconsistencies it finds on the disk and between other devices in the file system.</para>
-              <note><para>lfsck is rarely required for Lustre operation.</para></note>
-      <para>For problem analysis, it is strongly recommended that e2fsck be run under a logger, like script, to record all of the output and changes that are made to the file system in case this information is needed later.</para>
-      <para>If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n option) to assess the type and extent of damage to the file system. The drawback is that in this mode, e2fsck does not recover the file system journal, so there may appear to be file system corruption when none really exists.</para>
-      <para>To address concern about whether corruption is real or only due to the journal not being replayed, you can briefly mount and unmount the ldiskfs filesystem directly on the node with Lustre stopped (NOT via Lustre), using a command similar to:</para>
-      <screen>mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost
-</screen>
-      <para>This causes the journal to be recovered.</para>
-      <para>The e2fsck utility works well when fixing file system corruption (better than similar file system recovery tools and a primary reason why ldiskfs was chosen over other file systems for Lustre). However, it is often useful to identify the type of damage that has occurred so an ldiskfs expert can make intelligent decisions about what needs fixing, in place of e2fsck.</para>
-      <screen>root# {stop lustre services for this device, if running} 
+  </itemizedlist>
+  <section xml:id="dbdoclet.50438225_71141">
+    <title>27.1 Recovering from Errors or Corruption on a Backing File System</title>
+    <para>When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on the file system. <literal>ldiskfs</literal> journaling ensures that the file system remains coherent. The backing file systems are never accessed directly from the client, so client crashes are not relevant.</para>
+    <para>The only time it is REQUIRED that <literal>e2fsck</literal> be run on a device is when an event causes problems that ldiskfs journaling is unable to handle, such as a hardware device failure or I/O error. If the ldiskfs kernel code detects corruption on the disk, it mounts the file system as read-only to prevent further corruption, but still allows read access to the device. This appears as error &quot;-30&quot; (<literal>EROFS</literal>) in the syslogs on the server, e.g.:</para>
+    <screen>Dec 29 14:11:32 mookie kernel: LDISKFS-fs error (device sdz): 
+ldiskfs_lookup: unlinked inode 5384166 in dir #145170469</screen>
+    <para>Dec 29 14:11:32 mookie kernel: Remounting filesystem read-only</para>
+    <para>In such a situation, it is normally required that e2fsck only be run on the bad device before placing the device back into service.</para>
+    <para>In the vast majority of cases, Lustre can cope with any inconsistencies it finds on the disk and between other devices in the file system.</para>
+    <note>
+      <para><literal>lfsck</literal> is rarely required for Lustre operation.</para>
+    </note>
+    <para>For problem analysis, it is strongly recommended that <literal>e2fsck</literal> be run under a logger, like script, to record all of the output and changes that are made to the file system in case this information is needed later.</para>
+    <para>If time permits, it is also a good idea to first run <literal>e2fsck</literal> in non-fixing mode (-n option) to assess the type and extent of damage to the file system. The drawback is that in this mode, <literal>e2fsck</literal> does not recover the file system journal, so there may appear to be file system corruption when none really exists.</para>
+    <para>To address concern about whether corruption is real or only due to the journal not being replayed, you can briefly mount and unmount the <literal>ldiskfs</literal> filesystem directly on the node with Lustre stopped (NOT via Lustre), using a command similar to:</para>
+    <screen>mount -t ldiskfs /dev/{ostdev} /mnt/ost; umount /mnt/ost</screen>
+    <para>This causes the journal to be recovered.</para>
+    <para>The <literal>e2fsck</literal> utility works well when fixing file system corruption (better than similar file system recovery tools and a primary reason why <literal>ldiskfs</literal> was chosen over other file systems for Lustre). However, it is often useful to identify the type of damage that has occurred so an <literal>ldiskfs</literal> expert can make intelligent decisions about what needs fixing, in place of <literal>e2fsck</literal>.</para>
+    <screen>root# {stop lustre services for this device, if running} 
  root# script /tmp/e2fsck.sda 
  Script started, file is /tmp/e2fsck.sda 
  root# mount -t ldiskfs /dev/sda /mnt/ost 
  root# umount /mnt/ost 
-root# e2fsck -fn /dev/sda   # don&apos;t fix file system, just check for corrupt\
-ion 
+root# e2fsck -fn /dev/sda   # don&apos;t fix file system, just check for corruption 
  : 
  [e2fsck output] 
  : 
-root# e2fsck -fp /dev/sda   # fix filesystem using &quot;prudent&quot; answers (usually\
- &apos;y&apos;)
+root# e2fsck -fp /dev/sda   # fix filesystem using &quot;prudent&quot; answers (usually &apos;y&apos;)
  </screen>
-      <para>In addition, the e2fsprogs package contains the lfsck tool, which does distributed coherency checking for the Lustre file system after e2fsck has been run. Running lfsck is NOT required in a large majority of cases, at a small risk of having some leaked space in the file system. To avoid a lengthy downtime, it can be run (with care) after Lustre is started.</para>
-    </section>
-    <section xml:id="dbdoclet.50438225_37365">
-      <title>27.2 Recovering from <anchor xml:id="dbdoclet.50438225_marker-1292186" xreflabel=""/>Corruption in the Lustre File System</title>
-      <para>In cases where the MDS or an OST becomes corrupt, you can run a distributed check on the file system to determine what sort of problems exist. Use lfsck to correct any defects found.</para>
-      <orderedlist><listitem>
-      <para>Stop the Lustre file system.</para>
-  </listitem><listitem>
-      <para>Run e2fsck -f on the individual MDS / OST that had problems to fix any local file system damage.</para>
-      <para>We recommend running e2fsck under script, to create a log of changes made to the file system in case it is needed later. After e2fsck is run, bring up the file system, if necessary, to reduce the outage window.</para>
-  </listitem><listitem>
-      <para>Run a full e2fsck of the MDS to create a database for lfsck. You <emphasis>must</emphasis> use the -n option for a mounted file system, otherwise you will corrupt the file system.</para>
-      <screen>e2fsck -n -v --mdsdb /tmp/mdsdb /dev/{mdsdev}
+    <para>In addition, the <literal>e2fsprogs</literal> package contains the <literal>lfsck</literal> tool, which does distributed coherency checking for the Lustre file system after <literal>e2fsck</literal> has been run. Running <literal>lfsck</literal> is NOT required in a large majority of cases, at a small risk of having some leaked space in the file system. To avoid a lengthy downtime, it can be run (with care) after Lustre is started.</para>
+  </section>
+  <section xml:id="dbdoclet.50438225_37365">
+    <title>27.2 Recovering from Corruption in the Lustre File System</title>
+    <para>In cases where the MDS or an OST becomes corrupt, you can run a distributed check on the file system to determine what sort of problems exist. Use <literal>lfsck</literal> to correct any defects found.</para>
+    <orderedlist>
+      <listitem>
+        <para>Stop the Lustre file system.</para>
+      </listitem>
+      <listitem>
+        <para>Run <literal>e2fsck -f</literal> on the individual MDS / OST that had problems to fix any local file system damage.</para>
+        <para>We recommend running <literal>e2fsck</literal> under script, to create a log of changes made to the file system in case it is needed later. After <literal>e2fsck</literal> is run, bring up the file system, if necessary, to reduce the outage window.</para>
+      </listitem>
+      <listitem>
+        <para>Run a full <literal>e2fsck</literal> of the MDS to create a database for <literal>lfsck</literal>. You <emphasis>must</emphasis> use the <literal>-n</literal> option for a mounted file system, otherwise you will corrupt the file system.</para>
+        <screen>e2fsck -n -v --mdsdb /tmp/mdsdb /dev/{mdsdev}
  </screen>
-      <para>The mdsdb file can grow fairly large, depending on the number of files in the file system (10 GB or more for millions of files, though the actual file size is larger because the file is sparse). It is quicker to write the file to a local file system due to seeking and small writes. Depending on the number of files, this step can take several hours to complete.</para>
-      <para><emphasis role="bold">Example</emphasis></para>
-      <screen>e2fsck -n -v --mdsdb /tmp/mdsdb /dev/sdb
+        <para>The <literal>mds</literal>db file can grow fairly large, depending on the number of files in the file system (10 GB or more for millions of files, though the actual file size is larger because the file is sparse). It is quicker to write the file to a local file system due to seeking and small writes. Depending on the number of files, this step can take several hours to complete.</para>
+        <para><emphasis role="bold">Example</emphasis></para>
+        <screen>e2fsck -n -v --mdsdb /tmp/mdsdb /dev/sdb
  e2fsck 1.39.cfs1 (29-May-2006)
-Warning: skipping journal recovery because doing a read-only filesystem che\
-ck.
+Warning: skipping journal recovery because doing a read-only filesystem check.
  lustre-MDT0000 contains a file system with errors, check forced.
  Pass 1: Checking inodes, blocks, and sizes
  MDS: ost_idx 0 max_id 288
@@ -109,22 +106,24 @@ lustre-MDT0000: ******* WARNING: Filesystem still has errors *******
     --------
     387 files
  </screen>
-  </listitem><listitem>
-      <para>Make this file accessible on all OSTs, either by using a shared file system or copying the file to the OSTs. The pdcp command is useful here.</para>
-      <para>The pdcp command (installed with pdsh), can be used to copy files to groups of hosts. Pdcp is available here:</para>
-      <para><link xl:href="http://sourceforge.net/projects/pdsh">http://sourceforge.net/projects/pdsh</link></para>
-  </listitem><listitem>
-      <para>Run a similar e2fsck step on the OSTs. The e2fsck --ostdb command can be run in parallel on all OSTs.</para>
-      <screen>e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ostNdb} \/dev/{ostNdev}
+      </listitem>
+      <listitem>
+        <para>Make this file accessible on all OSTs, either by using a shared file system or copying the file to the OSTs. The <literal>pdcp</literal> command is useful here.</para>
+        <para>The <literal>pdcp</literal> command (installed with <literal>pdsh</literal>), can be used to copy files to groups of hosts. <literal>pdcp</literal> is available here:</para>
+        <para><ulink xl:href="http://sourceforge.net/projects/pdsh">http://sourceforge.net/projects/pdsh</ulink></para>
+      </listitem>
+      <listitem>
+        <para>Run a similar <literal>e2fsck</literal> step on the OSTs. The <literal>e2fsck --ostdb</literal> command can be run in parallel on all OSTs.</para>
+        <screen>e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ostNdb} \/dev/{ostNdev}
  </screen>
-      <para>The mdsdb file is read-only in this step; a single copy can be shared by all OSTs.</para>
-              <note><para>If the OSTs do not have shared file system access to the MDS, a stub mdsdb file, {mdsdb}.mdshdr, is generated. This can be used instead of the full mdsdb file.</para></note>
-       <para><emphasis role="bold">Example:</emphasis></para>
-      <screen>[root@oss161 ~]# e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb \ /tmp/ostdb /dev/\
-sda 
+        <para>The <literal>mdsdb</literal> file is read-only in this step; a single copy can be shared by all OSTs.</para>
+        <note>
+          <para>If the OSTs do not have shared file system access to the MDS, a stub <literal>mdsdb</literal> file, <literal>{mdsdb}.mdshdr</literal>, is generated. This can be used instead of the full <literal>mdsdb</literal> file.</para>
+        </note>
+        <para><emphasis role="bold">Example:</emphasis></para>
+        <screen>[root@oss161 ~]# e2fsck -n -v --mdsdb /tmp/mdsdb --ostdb \ /tmp/ostdb /dev/sda 
  e2fsck 1.39.cfs1 (29-May-2006)
-Warning: skipping journal recovery because doing a read-only filesystem che\
-ck.
+Warning: skipping journal recovery because doing a read-only filesystem check.
  lustre-OST0000 contains a file system with errors, check forced.
  Pass 1: Checking inodes, blocks, and sizes
  Pass 2: Checking directory structure
@@ -160,17 +159,15 @@ lustre-OST0000: ******* WARNING: Filesystem still has errors *******
     0 sockets
     --------
     368 files
- 
-</screen>
-  </listitem><listitem>
-      <para>Make the mdsdb file and all ostdb files available on a mounted client and run lfsck to examine the file system. Optionally, correct the defects found by lfsck.</para>
-      <screen>script /root/lfsck.lustre.log 
-lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db} ... /lus\
-tre/mount/point
-</screen>
-      <para><emphasis role="bold">Example:</emphasis></para>
-      <screen>script /root/lfsck.lustre.log
-lfsck -n -v --mdsdb /home/mdsdb --ostdb /home/{ost1db} \/mnt/lustre/client/
+ </screen>
+      </listitem>
+      <listitem>
+        <para>Make the <literal>mdsdb</literal> file and all <literal>ostdb</literal> files available on a mounted client and run <literal>lfsck</literal> to examine the file system. Optionally, correct the defects found by <literal>lfsck</literal>.</para>
+        <screen>script /root/lfsck.lustre.log 
+lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db} ... /lustre/mount/point\</screen>
+        <para><emphasis role="bold">Example:</emphasis></para>
+        <screen>script /root/lfsck.lustre.log
+lfsck -n -v --mdsdb /home/mdsdb --ostdb /home/{ost1db} /mnt/lustre/client/
  MDSDB: /home/mdsdb
  OSTDB[0]: /home/ostdb
  MOUNTPOINT: /mnt/lustre/client/
@@ -186,40 +183,41 @@ lfsck: ost_idx 0: pass3: check for orphan objects
  lfsck: ost_idx 0: pass3 OK (321 files total)
  lfsck: pass4: check for duplicate object references
  lfsck: pass4 OK (no duplicates)
-lfsck: fixed 0 errors
-</screen>
-      <para>By default, lfsck reports errors, but it does not repair any inconsistencies found. lfsck checks for three kinds of inconsistencies:</para>
-      <itemizedlist><listitem>
-          <para> Inode exists but has missing objects (dangling inode). This normally happens if there was a problem with an OST.</para>
-        </listitem>
-
-<listitem>
-          <para> Inode is missing but OST has unreferenced objects (orphan object). Normally, this happens if there was a problem with the MDS.</para>
-        </listitem>
-
-<listitem>
-          <para> Multiple inodes reference the same objects. This can happen if the MDS is corrupted or if the MDS storage is cached and loses some, but not all, writes.</para>
-        </listitem>
-
-</itemizedlist>
-      <para>If the file system is in use and being modified while the --mdsdb and --ostdb steps are running, lfsck may report inconsistencies where none exist due to files and objects being created/removed after the database files were collected. Examine the lfsck results closely. You may want to re-run the test.</para>
-  </listitem></orderedlist>
-      <section remap="h3">
-        <title>27.2.1 <anchor xml:id="dbdoclet.50438225_13916" xreflabel=""/>Working with Orphaned <anchor xml:id="dbdoclet.50438225_marker-1292187" xreflabel=""/>Objects</title>
-        <para>The easiest problem to resolve is that of orphaned objects. When the -l option for lfsck is used, these objects are linked to new files and put into lost+found in the Lustre file system, where they can be examined and saved or deleted as necessary. If you are certain the objects are not useful, run lfsck with the -d option to delete orphaned objects and free up any space they are using.</para>
-        <para>To fix dangling inodes, use lfsck with the -c option to create new, zero-length objects on the OSTs. These files read back with binary zeros for stripes that had objects re-created. Even without lfsck repair, these files can be read by entering:</para>
-        <screen>dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror
-</screen>
-        <para>Because it is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup.</para>
-                <note><para>You cannot write to the holes of such files without having lfsck re-create the objects. Generally, it is easier to delete these files and restore them from backup.</para></note>
-        <para>To fix inodes with duplicate objects, use lfsck with the -c option to copy the duplicate object to a new object and assign it to a file. One file will be okay and the duplicate will likely contain garbage. By itself, lfsck cannot tell which file is the usable one.</para>
-      </section>
+lfsck: fixed 0 errors</screen>
+        <para>By default, <literal>lfsck</literal> reports errors, but it does not repair any inconsistencies found. <literal>lfsck</literal> checks for three kinds of inconsistencies:</para>
+        <itemizedlist>
+          <listitem>
+            <para>Inode exists but has missing objects (dangling inode). This normally happens if there was a problem with an OST.</para>
+          </listitem>
+          <listitem>
+            <para>Inode is missing but OST has unreferenced objects (orphan object). Normally, this happens if there was a problem with the MDS.</para>
+          </listitem>
+          <listitem>
+            <para>Multiple inodes reference the same objects. This can happen if the MDS is corrupted or if the MDS storage is cached and loses some, but not all, writes.</para>
+          </listitem>
+        </itemizedlist>
+        <para>If the file system is in use and being modified while the <literal>--mdsdb</literal> and <literal>--ostdb</literal> steps are running, <literal>lfsck</literal> may report inconsistencies where none exist due to files and objects being created/removed after the database files were collected. Examine the <literal>lfsck</literal> results closely. You may want to re-run the test.</para>
+      </listitem>
+    </orderedlist>
+    <section xml:id='dbdoclet.50438225_13916'>
+      <title>27.2.1 Working with Orphaned Objects</title>
+      <para>The easiest problem to resolve is that of orphaned objects. When the <literal>-l</literal> option for <literal>lfsck</literal> is used, these objects are linked to new files and put into <literal>lost+found</literal> in the Lustre file system, where they can be examined and saved or deleted as necessary. If you are certain the objects are not useful, run <literal>lfsck</literal> with the <literal>-d</literal> option to delete orphaned objects and free up any space they are using.</para>
+      <para>To fix dangling inodes, use <literal>lfsck</literal> with the <literal>-c</literal> option to create new, zero-length objects on the OSTs. These files read back with binary zeros for stripes that had objects re-created. Even without <literal>lfsck</literal> repair, these files can be read by entering:</para>
+      <screen>dd if=/lustre/bad/file of=/new/file bs=4k conv=sync,noerror</screen>
+      <para>Because it is rarely useful to have files with large holes in them, most users delete these files after reading them (if useful) and/or restoring them from backup.</para>
+      <note>
+        <para>You cannot write to the holes of such files without having <literal>lfsck</literal> re-create the objects. Generally, it is easier to delete these files and restore them from backup.</para>
+      </note>
+      <para>To fix inodes with duplicate objects, use <literal>lfsck</literal> with the <literal>-c</literal> option to copy the duplicate object to a new object and assign it to a file. One file will be okay and the duplicate will likely contain garbage. By itself, <literal>lfsck</literal> cannot tell which file is the usable one.</para>
      </section>
-    <section xml:id="dbdoclet.50438225_12316">
-      <title>27.3 Recovering from an <anchor xml:id="dbdoclet.50438225_marker-1292768" xreflabel=""/>Unavailable OST</title>
-      <para>One of the most common problems encountered in a Lustre environment is when an OST becomes unavailable, because of a network partition, OSS node crash, etc. When this happens, the OST's clients pause and wait for the OST to become available again, either on the primary OSS or a failover OSS. When the OST comes back online, Lustre starts a recovery process to enable clients to reconnect to the OST. Lustre servers put a limit on the time they will wait in recovery for clients to reconnect. The timeout length is determined by the obd_timeout parameter.</para>
-      <para>During recovery, clients reconnect and replay their requests serially, in the same order they were done originally. Until a client receives a confirmation that a given transaction has been written to stable storage, the client holds on to the transaction, in case it needs to be replayed. Periodically, a progress message prints to the log, stating how_many/expected clients have reconnected. If the recovery is aborted, this log shows how many clients managed to reconnect. When all clients have completed recovery, or if the recovery timeout is reached, the recovery period ends and the OST resumes normal request processing.</para>
-      <para>If some clients fail to replay their requests during the recovery period, this will not stop the recovery from completing. You may have a situation where the OST recovers, but some clients are not able to participate in recovery (e.g. network problems or client failure), so they are evicted and their requests are not replayed. This would result in any operations on the evicted clients failing, including in-progress writes, which would cause cached writes to be lost. This is a normal outcome; the recovery cannot wait indefinitely, or the file system would be hung any time a client failed. The lost transactions are an unfortunate result of the recovery process.</para>
-      <note><para>The version-based recovery (VBR) feature enables a failed client to be &apos;&apos;skipped&apos;&apos;, so remaining clients can replay their requests, resulting in a more successful recovery from a downed OST. For more information about the VBR feature, see <xref linkend='lustrerecovery'/>(Version-based Recovery).</para></note>
+  </section>
+  <section xml:id="dbdoclet.50438225_12316">
+    <title>27.3 Recovering from an Unavailable OST</title>
+    <para>One of the most common problems encountered in a Lustre environment is when an OST becomes unavailable, because of a network partition, OSS node crash, etc. When this happens, the OST&apos;s clients pause and wait for the OST to become available again, either on the primary OSS or a failover OSS. When the OST comes back online, Lustre starts a recovery process to enable clients to reconnect to the OST. Lustre servers put a limit on the time they will wait in recovery for clients to reconnect. The timeout length is determined by the <literal>obd_timeout</literal> parameter.</para>
+    <para>During recovery, clients reconnect and replay their requests serially, in the same order they were done originally. Until a client receives a confirmation that a given transaction has been written to stable storage, the client holds on to the transaction, in case it needs to be replayed. Periodically, a progress message prints to the log, stating how_many/expected clients have reconnected. If the recovery is aborted, this log shows how many clients managed to reconnect. When all clients have completed recovery, or if the recovery timeout is reached, the recovery period ends and the OST resumes normal request processing.</para>
+    <para>If some clients fail to replay their requests during the recovery period, this will not stop the recovery from completing. You may have a situation where the OST recovers, but some clients are not able to participate in recovery (e.g. network problems or client failure), so they are evicted and their requests are not replayed. This would result in any operations on the evicted clients failing, including in-progress writes, which would cause cached writes to be lost. This is a normal outcome; the recovery cannot wait indefinitely, or the file system would be hung any time a client failed. The lost transactions are an unfortunate result of the recovery process.</para>
+    <note>
+      <para>The version-based recovery (VBR) feature enables a failed client to be &apos;&apos;skipped&apos;&apos;, so remaining clients can replay their requests, resulting in a more successful recovery from a downed OST. For more information about the VBR feature, see <xref linkend="lustrerecovery"/>(Version-based Recovery).</para>
+    </note>
    </section>
  </chapter>
author	Richard Henwood <rhenwood@whamcloud.com>
	Fri, 20 May 2011 16:44:31 +0000 (11:44 -0500)
committer	Richard Henwood <rhenwood@whamcloud.com>
	Fri, 20 May 2011 16:44:31 +0000 (11:44 -0500)
I_LustreIntro.xml		patch \| blob \| history
InstallingLustreFromSourceCode.xml		patch \| blob \| history
LustreDebugging.xml		patch \| blob \| history
LustreProc.xml		patch \| blob \| history
LustreRecovery.xml		patch \| blob \| history
LustreTroubleshooting.xml		patch \| blob \| history
TroubleShootingRecovery.xml		patch \| blob \| history