LustreTroubleshooting.xml

   1 <?xml version='1.0' encoding='UTF-8'?><chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en-US" xml:id="lustretroubleshooting">
   2   <title xml:id="lustretroubleshooting.title">Lustre File System Troubleshooting</title>
   3   <para>This chapter provides information about troubleshooting a Lustre file system, submitting a
   4     bug to the Jira bug tracking system, and Lustre file system performance tips. It includes the
   5     following sections:</para>
   6   <itemizedlist>
   7     <listitem>
   8       <para><xref linkend="dbdoclet.50438198_11171"/></para>
   9     </listitem>
  10     <listitem>
  11       <para><xref linkend="dbdoclet.50438198_30989"/></para>
  12     </listitem>
  13     <listitem>
  14       <para><xref linkend="dbdoclet.50438198_93109"/></para>
  15     </listitem>
  16   </itemizedlist>
  17   <section xml:id="dbdoclet.50438198_11171">
  18       <title><indexterm><primary>troubleshooting</primary></indexterm>
  19           <indexterm><primary>lustre</primary><secondary>troubleshooting</secondary><see>troubleshooting</see></indexterm>
  20           <indexterm><primary>lustre</primary><secondary>errors</secondary><see>troubleshooting</see></indexterm>
  21           <indexterm><primary>errors</primary><see>troubleshooting</see></indexterm>
  22           Lustre Error Messages</title>
  23     <para>Several resources are available to help troubleshoot an issue in a Lustre file system.
  24       This section describes error numbers, error messages and logs.</para>
  25     <section remap="h3">
  26       <title><indexterm><primary>troubleshooting</primary><secondary>error numbers</secondary></indexterm>Error Numbers</title>
  27       <para>Error numbers are generated by the Linux operating system and are located in
  28           <literal>/usr/include/asm-generic/errno.h</literal>. The Lustre software does not use all
  29         of the available Linux error numbers. The exact meaning of an error number depends on where
  30         it is used. Here is a summary of the basic errors that Lustre file system users may
  31         encounter.</para>
  32       <informaltable frame="all">
  33         <tgroup cols="3">
  34           <colspec colname="c1" colwidth="33*"/>
  35           <colspec colname="c2" colwidth="33*"/>
  36           <colspec colname="c3" colwidth="33*"/>
  37           <thead>
  38             <row>
  39               <entry>
  40                 <para><emphasis role="bold">Error Number</emphasis></para>
  41               </entry>
  42               <entry>
  43                 <para><emphasis role="bold">Error Name</emphasis></para>
  44               </entry>
  45               <entry>
  46                 <para><emphasis role="bold">Description</emphasis></para>
  47               </entry>
  48             </row>
  49           </thead>
  50           <tbody>
  51             <row>
  52               <entry>
  53                 <para> -1</para>
  54               </entry>
  55               <entry>
  56                 <literal> -EPERM </literal>
  57               </entry>
  58               <entry>
  59                 <para> Permission is denied.</para>
  60               </entry>
  61             </row>
  62             <row>
  63               <entry> -2 </entry>
  64               <entry>
  65                 <literal> -ENOENT </literal>
  66               </entry>
  67               <entry>
  68                 <para> The requested file or directory does not exist.</para>
  69               </entry>
  70             </row>
  71             <row>
  72               <entry>
  73                 <para> -4</para>
  74               </entry>
  75               <entry>
  76                 <literal> -EINTR </literal>
  77               </entry>
  78               <entry>
  79                 <para> The operation was interrupted (usually CTRL-C or a killing process).</para>
  80               </entry>
  81             </row>
  82             <row>
  83               <entry>
  84                 <para> -5</para>
  85               </entry>
  86               <entry>
  87                 <literal> -EIO </literal>
  88               </entry>
  89               <entry>
  90                 <para> The operation failed with a read or write error.</para>
  91               </entry>
  92             </row>
  93             <row>
  94               <entry>
  95                 <para> -19</para>
  96               </entry>
  97               <entry>
  98                 <literal> -ENODEV </literal>
  99               </entry>
 100               <entry>
 101                 <para> No such device is available. The server stopped or failed over.</para>
 102               </entry>
 103             </row>
 104             <row>
 105               <entry>
 106                 <para> -22</para>
 107               </entry>
 108               <entry>
 109                 <literal> -EINVAL </literal>
 110               </entry>
 111               <entry>
 112                 <para> The parameter contains an invalid value.</para>
 113               </entry>
 114             </row>
 115             <row>
 116               <entry>
 117                 <para> -28</para>
 118               </entry>
 119               <entry>
 120                 <literal> -ENOSPC </literal>
 121               </entry>
 122               <entry>
 123                 <para> The file system is out-of-space or out of inodes. Use <literal>lfs df</literal> (query the amount of file system space) or <literal>lfs df -i</literal> (query the number of inodes).</para>
 124               </entry>
 125             </row>
 126             <row>
 127               <entry>
 128                 <para> -30</para>
 129               </entry>
 130               <entry>
 131                 <literal> -EROFS </literal>
 132               </entry>
 133               <entry>
 134                 <para> The file system is read-only, likely due to a detected error.</para>
 135               </entry>
 136             </row>
 137             <row>
 138               <entry>
 139                 <para> -43</para>
 140               </entry>
 141               <entry>
 142                 <literal> -EIDRM </literal>
 143               </entry>
 144               <entry>
 145                 <para> The UID/GID does not match any known UID/GID on the MDS. Update etc/hosts and etc/group on the MDS to add the missing user or group.</para>
 146               </entry>
 147             </row>
 148             <row>
 149               <entry>
 150                 <para> -107</para>
 151               </entry>
 152               <entry>
 153                 <literal> -ENOTCONN </literal>
 154               </entry>
 155               <entry>
 156                 <para> The client is not connected to this server.</para>
 157               </entry>
 158             </row>
 159             <row>
 160               <entry>
 161                 <para> -110</para>
 162               </entry>
 163               <entry>
 164                 <literal> -ETIMEDOUT </literal>
 165               </entry>
 166               <entry>
 167                 <para> The operation took too long and timed out.</para>
 168               </entry>
 169             </row>
 170             <row>
 171               <entry>
 172                 <para> -122</para>
 173               </entry>
 174               <entry>
 175                 <literal> -EDQUOT </literal>
 176               </entry>
 177               <entry>
 178                 <para> The operation exceeded the user disk quota and was aborted.</para>
 179               </entry>
 180             </row>
 181           </tbody>
 182         </tgroup>
 183       </informaltable>
 184     </section>
 185     <section xml:id="dbdoclet.50438198_40669">
 186       <title><indexterm><primary>troubleshooting</primary><secondary>error messages</secondary></indexterm>Viewing Error Messages</title>
 187       <para>As Lustre software code runs on the kernel, single-digit error codes display to the
 188         application; these error codes are an indication of the problem. Refer to the kernel console
 189         log (dmesg) for all recent kernel messages from that node. On the node,
 190           <literal>/var/log/messages</literal> holds a log of all messages for at least the past
 191         day.</para>
 192       <para>The error message initiates with &quot;LustreError&quot; in the console log and provides a short description of:</para>
 193       <itemizedlist>
 194         <listitem>
 195           <para>What the problem is</para>
 196         </listitem>
 197         <listitem>
 198           <para>Which process ID had trouble</para>
 199         </listitem>
 200         <listitem>
 201           <para>Which server node it was communicating with, and so on.</para>
 202         </listitem>
 203       </itemizedlist>
 204       <para>Lustre logs are dumped to <literal>/proc/sys/lnet/debug_path</literal>.</para>
 205       <para>Collect the first group of messages related to a problem, and any messages that precede &quot;LBUG&quot; or &quot;assertion failure&quot; errors. Messages that mention server nodes (OST or MDS) are specific to that server; you must collect similar messages from the relevant server console logs.</para>
 206       <para>Another Lustre debug log holds information for a short period of time for action by the
 207         Lustre software, which, in turn, depends on the processes on the Lustre node. Use the
 208         following command to extract debug logs on each of the nodes, run</para>
 209       <screen>$ lctl dk <replaceable>filename</replaceable></screen>
 210       <note>
 211         <para>LBUG freezes the thread to allow capture of the panic stack. A system reboot is needed to clear the thread.</para>
 212       </note>
 213     </section>
 214   </section>
 215   <section xml:id="dbdoclet.50438198_30989">
 216       <title><indexterm>
 217         <primary>troubleshooting</primary>
 218         <secondary>reporting bugs</secondary>
 219       </indexterm><indexterm>
 220         <primary>reporting bugs</primary>
 221         <see>troubleshooting</see>
 222       </indexterm> Reporting a Lustre File System Bug</title>
 223     <para>If you cannot resolve a problem by troubleshooting your Lustre file system, other options are:<itemizedlist>
 224         <listitem>
 225           <para>Post a question to the <link xmlns:xlink="http://www.w3.org/1999/xlink"
 226               xlink:href="https://lists.01.org/mailman/listinfo/hpdd-discuss">hppd-discuss</link>
 227             email list or search the archives for information about your issue.</para>
 228         </listitem>
 229         <listitem>
 230           <para>Submit a ticket to the <link xmlns:xlink="http://www.w3.org/1999/xlink"
 231               xlink:href="https://jira.whamcloud.com/secure/Dashboard.jspa"
 232                 >Jira</link><abbrev><superscript>*</superscript></abbrev> bug tracking and project
 233             management tool used for the Lustre software project. If you are a first-time user,
 234             you'll need to open an account by clicking on <emphasis role="bold">Sign up</emphasis>
 235             on the Welcome page.</para>
 236         </listitem>
 237       </itemizedlist> To submit a Jira ticket, follow these steps:<orderedlist>
 238         <listitem>
 239           <para>To avoid filing a duplicate ticket, search for existing tickets for your issue.
 240               <emphasis role="italic">For search tips, see <xref
 241                 xmlns:xlink="http://www.w3.org/1999/xlink" linkend="section_jj2_4b1_kk"
 242               />.</emphasis></para>
 243         </listitem>
 244         <listitem>
 245           <para>To create a ticket, click <emphasis role="bold">+Create Issue</emphasis> in the
 246             upper right corner. <emphasis role="italic">Create a separate ticket for each issue you
 247               wish to submit.</emphasis></para>
 248         </listitem>
 249         <listitem>
 250           <para>In the form displayed, enter the following information:<itemizedlist>
 251               <listitem>
 252                 <para><emphasis role="italic">Project</emphasis> - Select <emphasis role="bold"
 253                     >Lustre</emphasis> or <emphasis role="bold">Lustre Documentation</emphasis> or
 254                   an appropriate project.</para>
 255               </listitem>
 256               <listitem>
 257                 <para><emphasis role="italic">Issue type</emphasis> - Select <emphasis role="bold"
 258                     >Bug</emphasis>.</para>
 259               </listitem>
 260               <listitem>
 261                 <para><emphasis role="italic">Summary</emphasis> - Enter a short description of the
 262                   issue. Use terms that would be useful for someone searching for a similar issue. A
 263                   LustreError or ASSERT/panic message often makes a good summary.</para>
 264               </listitem>
 265               <listitem>
 266                 <para><emphasis role="italic">Affects version(s)</emphasis> - Select your Lustre
 267                   release.</para>
 268               </listitem>
 269               <listitem>
 270                 <para><emphasis role="italic">Environment</emphasis> - Enter your kernel with
 271                   version number.</para>
 272               </listitem>
 273               <listitem>
 274                 <para><emphasis role="italic">Description</emphasis> - Include a detailed
 275                   description of <emphasis role="italic">visible symptoms</emphasis> and, if
 276                   possible, <emphasis role="italic">how the problem is produced</emphasis>. Other
 277                   useful information may include <emphasis role="italic">the behavior you expect to
 278                     see</emphasis> and <emphasis role="italic">what you have tried so far to
 279                     diagnose the problem</emphasis>.</para>
 280               </listitem>
 281               <listitem>
 282                 <para><emphasis role="italic">Attachments</emphasis> - Attach log sources such as
 283                   Lustre debug log dumps (see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 284                     linkend="dbdoclet.50438274_15874"/>), syslogs, or console logs. <emphasis
 285                     role="italic"><emphasis role="bold">Note:</emphasis></emphasis> Lustre debug
 286                   logs must be processed using <code>lctl df</code> prior to attaching to a Jira
 287                   ticket. For more information, see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 288                     linkend="dbdoclet.50438274_62472"/>. </para>
 289               </listitem>
 290             </itemizedlist>Other fields in the form are used for project tracking and are irrelevant
 291             to reporting an issue. You can leave these in their default state.</para>
 292         </listitem>
 293       </orderedlist></para>
 294     <section xml:id="section_jj2_4b1_kk">
 295       <title>Searching the Jira<superscript>*</superscript> Bug Tracker for Duplicate
 296         Tickets</title>
 297       <para>Before submitting a ticket, always search the Jira bug tracker for an existing ticket
 298         for your issue.This avoids duplicating effort and may immediately provide you with a
 299         solution to your problem. </para>
 300       <para>To do a search in the Jira bug tracker, select the <emphasis role="bold"
 301           >Issues</emphasis> tab and click on <emphasis role="bold">New filter</emphasis>. Use the
 302         filters provided to select criteria for your search. To search for specific text, enter the
 303         text in the "Contains text" field and click the magnifying glass icon.</para>
 304       <para>When searching for text such as an ASSERTION or LustreError message, you can remove NIDS
 305         and other installation-specific text from your search string by following the example
 306         below.</para>
 307       <para><emphasis role="italic">Original error message:</emphasis></para>
 308       <para><code>"(filter_io_26.c:</code><emphasis role="bold"
 309           >791</emphasis><code>:filter_commitrw_write()) ASSERTION(oti-&gt;oti_transno
 310           &lt;=obd-&gt;obd_last_committed) failed: oti_transno </code><emphasis role="bold"
 311           >752</emphasis>
 312         <code>last_committed </code><emphasis role="bold">750</emphasis><code>"</code></para>
 313       <para><emphasis role="italic">Optimized search string :</emphasis></para>
 314       <para><code>"(filter_io_26.c:" </code><emphasis role="bold">AND</emphasis>
 315         <code>":filter_commitrw_write()) ASSERTION(oti-&gt;oti_transno
 316           &lt;=obd-&gt;obd_last_committed) failed:</code></para>
 317     </section>
 318   </section>
 319   <section xml:id="dbdoclet.50438198_93109">
 320     <title><indexterm>
 321         <primary>troubleshooting</primary>
 322         <secondary>common problems</secondary>
 323       </indexterm>Common Lustre File System Problems</title>
 324     <para>This section describes how to address common issues encountered with a Lustre file
 325       system.</para>
 326     <section remap="h3">
 327       <title>OST Object is Missing or Damaged</title>
 328       <para>If the OSS fails to find an object or finds a damaged object, this message appears:</para>
 329       <para><screen>OST object missing or damaged (OST &quot;ost1&quot;, object 98148, error -2)</screen></para>
 330       <para>If the reported error is -2 (<literal>-ENOENT</literal>, or &quot;No such file or directory&quot;), then the object is missing. This can occur either because the MDS and OST are out of sync, or because an OST object was corrupted and deleted.</para>
 331       <para>If you have recovered the file system from a disk failure by using e2fsck, then unrecoverable objects may have been deleted or moved to /lost+found on the raw OST partition. Because files on the MDS still reference these objects, attempts to access them produce this error.</para>
 332       <para>If you have recovered a backup of the raw MDS or OST partition, then the restored partition is very likely to be out of sync with the rest of your cluster. No matter which server partition you restored from backup, files on the MDS may reference objects which no longer exist (or did not exist when the backup was taken); accessing those files produces this error.</para>
 333       <para>If neither of those descriptions is applicable to your situation, then it is possible
 334         that you have discovered a programming error that allowed the servers to get out of sync.
 335         Please submit a Jira ticket (see <xref xmlns:xlink="http://www.w3.org/1999/xlink"
 336           linkend="dbdoclet.50438198_30989"/>).</para>
 337       <para>If the reported error is anything else (such as -5, &quot;<literal>I/O error</literal>&quot;), it likely indicates a storage failure. The low-level file system returns this error if it is unable to read from the storage device.</para>
 338       <para><emphasis role="bold">Suggested Action</emphasis></para>
 339       <para>If the reported error is -2, you can consider checking in <literal>/lost+found</literal> on your raw OST device, to see if the missing object is there. However, it is likely that this object is lost forever, and that the file that references the object is now partially or completely lost. Restore this file from backup, or salvage what you can and delete it.</para>
 340       <para>If the reported error is anything else, then you should immediately inspect this server for storage problems.</para>
 341     </section>
 342     <section remap="h3">
 343       <title>OSTs Become Read-Only</title>
 344       <para>If the SCSI devices are inaccessible to the Lustre file system at the block device
 345         level, then <literal>ldiskfs</literal> remounts the device read-only to prevent file system
 346         corruption. This is a normal behavior. The status in
 347           <literal>/proc/fs/lustre/health_check</literal> also shows &quot;not healthy&quot; on the
 348         affected nodes.</para>
 349       <para>To determine what caused the &quot;not healthy&quot; condition:</para>
 350       <itemizedlist>
 351         <listitem>
 352           <para>Examine the consoles of all servers for any error indications</para>
 353         </listitem>
 354         <listitem>
 355           <para>Examine the syslogs of all servers for any LustreErrors or <literal>LBUG</literal></para>
 356         </listitem>
 357         <listitem>
 358           <para>Check the health of your system hardware and network. (Are the disks working as expected, is the network dropping packets?)</para>
 359         </listitem>
 360         <listitem>
 361           <para>Consider what was happening on the cluster at the time. Does this relate to a specific user workload or a system load condition? Is the condition reproducible? Does it happen at a specific time (day, week or month)?</para>
 362         </listitem>
 363       </itemizedlist>
 364       <para>To recover from this problem, you must restart Lustre services using these file systems. There is no other way to know that the I/O made it to disk, and the state of the cache may be inconsistent with what is on disk.</para>
 365     </section>
 366     <section remap="h3">
 367       <title>Identifying a Missing OST</title>
 368       <para>If an OST is missing for any reason, you may need to know what files are affected. Although an OST is missing, the files system should be operational. From any mounted client node, generate a list of files that reside on the affected OST. It is advisable to mark the missing OST as &apos;unavailable&apos; so clients and the MDS do not time out trying to contact it.</para>
 369       <orderedlist>
 370         <listitem>
 371           <para>Generate a list of devices and determine the OST&apos;s device number. Run:</para>
 372           <screen>$ lctl dl </screen>
 373           <para>The lctl dl command output lists the device name and number, along with the device UUID and the number of references on the device.</para>
 374         </listitem>
 375         <listitem>
 376           <para>Deactivate the OST (on the OSS at the MDS). Run:</para>
 377           <screen>$ lctl --device <replaceable>lustre_device_number</replaceable> deactivate</screen>
 378           <para>The OST device number or device name is generated by the lctl dl command.</para>
 379           <para>The <literal>deactivate</literal> command prevents clients from creating new objects on the specified OST, although you can still access the OST for reading.</para>
 380           <note>
 381             <para>If the OST later becomes available it needs to be reactivated, run:</para>
 382             <screen># lctl --device <replaceable>lustre_device_number</replaceable> activate</screen>
 383           </note>
 384         </listitem>
 385         <listitem>
 386           <para>Determine all files that are striped over the missing OST, run:</para>
 387           <screen># lfs getstripe -r -O {OST_UUID} /mountpoint</screen>
 388           <para>This returns a simple list of filenames from the affected file system.</para>
 389         </listitem>
 390         <listitem>
 391           <para>If necessary, you can read the valid parts of a striped file, run:</para>
 392           <screen># dd if=filename of=new_filename bs=4k conv=sync,noerror</screen>
 393         </listitem>
 394         <listitem>
 395           <para>You can delete these files with the unlink or munlink command.</para>
 396           <screen># unlink|munlink filename {filename ...} </screen>
 397           <note>
 398             <para>There is no functional difference between the <literal>unlink</literal> and <literal>munlink</literal> commands. The unlink command is for newer Linux distributions. You can run <literal>munlink</literal> if <literal>unlink</literal> is not available.</para>
 399             <para>When you run the <literal>unlink</literal> or <literal>munlink</literal> command, the file on the MDS is permanently removed.</para>
 400           </note>
 401         </listitem>
 402         <listitem>
 403           <para>If you need to know, specifically, which parts of the file are missing data, then you first need to determine the file layout (striping pattern), which includes the index of the missing OST). Run:</para>
 404           <screen># lfs getstripe -v {filename}</screen>
 405         </listitem>
 406         <listitem>
 407           <para>Use this computation is to determine which offsets in the file are affected: [(C*N + X)*S, (C*N + X)*S + S - 1], N = { 0, 1, 2, ...}</para>
 408           <para>where:</para>
 409           <para>C = stripe count</para>
 410           <para>S = stripe size</para>
 411           <para>X = index of bad OST for this file</para>
 412         </listitem>
 413       </orderedlist>
 414       <para>For example, for a 2 stripe file, stripe size = 1M, the bad OST is at index 0, and you have holes in the file at: [(2*N + 0)*1M, (2*N + 0)*1M + 1M - 1], N = { 0, 1, 2, ...}</para>
 415       <para>If the file system cannot be mounted, currently there is no way that parses metadata directly from an MDS. If the bad OST does not start, options to mount the file system are to provide a loop device OST in its place or replace it with a newly-formatted OST. In that case, the missing objects are created and are read as zero-filled.</para>
 416     </section>
 417     <section xml:id="dbdoclet.repair_ost_lastid">
 418       <title>Fixing a Bad LAST_ID on an OST</title>
 419       <para>Each OST contains a LAST_ID file, which holds the last object (pre-)created by the MDS  <footnote>
 420           <para>The contents of the LAST_ID file must be accurate regarding the actual objects that exist on the OST.</para>
 421         </footnote>. The MDT contains a lov_objid file, with values that represent the last object the MDS has allocated to a file.</para>
 422       <para>During normal operation, the MDT keeps some pre-created (but unallocated) objects on the OST, and the relationship between LAST_ID and lov_objid should be LAST_ID &lt;= lov_objid. Any difference in the file values results in objects being created on the OST when it next connects to the MDS. These objects are never actually allocated to a file, since they are of 0 length (empty), but they do no harm. Creating empty objects enables the OST to catch up to the MDS, so normal operations resume.</para>
 423       <para>However, in the case where lov_objid &lt; LAST_ID, bad things can happen as the MDS is not aware of objects that have already been allocated on the OST, and it reallocates them to new files, overwriting their existing contents.</para>
 424       <para>Here is the rule to avoid this scenario:</para>
 425       <para>LAST_ID &gt;= lov_objid and LAST_ID == last_physical_object and lov_objid &gt;= last_used_object</para>
 426       <para>Although the lov_objid value should be equal to the last_used_object value, the above
 427         rule suffices to keep the Lustre file system happy at the expense of a few leaked
 428         objects.</para>
 429       <para>In situations where there is on-disk corruption of the OST, for example caused by running with write cache enabled on the disks, the LAST_ID value may become inconsistent and result in a message similar to:</para>
 430       <screen>&quot;filter_precreate()) HOME-OST0003: Serious error:
 431 objid 3478673 already exists; is this filesystem corrupt?&quot;</screen>
 432       <para>A related situation may happen if there is a significant discrepancy between the record of previously-created objects on the OST and the previously-allocated objects on the MDS, for example if the MDS has been corrupted, or restored from backup, which may cause significant data loss if left unchecked. This produces a message like:</para>
 433       <screen>&quot;HOME-OST0003: ignoring bogus orphan destroy request:
 434 obdid 3438673 last_id 3478673&quot;</screen>
 435       <para>To recover from this situation, determine and set a reasonable LAST_ID value.</para>
 436       <note>
 437         <para>The file system must be stopped on all servers before performing this procedure.</para>
 438       </note>
 439       <para>For hex-to-decimal translations:</para>
 440       <para>Use GDB:</para>
 441       <screen>(gdb) p /x 15028
 442 $2 = 0x3ab4</screen>
 443       <para>Or <literal>bc</literal>:</para>
 444       <screen>echo &quot;obase=16; 15028&quot; | bc</screen>
 445       <orderedlist>
 446         <listitem>
 447           <para>Determine a reasonable value for the LAST_ID file. Check on the MDS:</para>
 448           <screen># mount -t ldiskfs <replaceable>/dev/mdt_device</replaceable> /mnt/mds
 449 # od -Ax -td8 /mnt/mds/lov_objid
 450 </screen>
 451           <para>There is one entry for each OST, in OST index order. This is what the MDS thinks is the last in-use object.</para>
 452         </listitem>
 453         <listitem>
 454           <para>Determine the OST index for this OST.</para>
 455           <screen># od -Ax -td4 /mnt/ost/last_rcvd
 456 </screen>
 457           <para>It will have it at offset 0x8c.</para>
 458         </listitem>
 459         <listitem>
 460           <para>Check on the OST. Use debugfs to check the LAST_ID value:</para>
 461           <screen>debugfs -c -R &apos;dump /O/0/LAST_ID /tmp/LAST_ID&apos; /dev/XXX ; od -Ax -td8 /tmp/\
 462 LAST_ID&quot;
 463 </screen>
 464         </listitem>
 465         <listitem>
 466           <para>Check the objects on the OST:</para>
 467           <screen>mount -rt ldiskfs /dev/{ostdev} /mnt/ost
 468 # note the ls below is a number one and not a letter L
 469 ls -1s /mnt/ost/O/0/d* | grep -v [a-z] |
 470 sort -k2 -n &gt; /tmp/objects.{diskname}
 471
 472 tail -30 /tmp/objects.{diskname}</screen>
 473           <para>This shows you the OST state. There may be some pre-created orphans. Check for zero-length objects. Any zero-length objects with IDs higher than LAST_ID should be deleted. New objects will be pre-created.</para>
 474         </listitem>
 475       </orderedlist>
 476       <para>If the OST LAST_ID value matches that for the objects existing on the OST, then it is possible the lov_objid file on the MDS is incorrect. Delete the lov_objid file on the MDS and it will be re-created from the LAST_ID on the OSTs.</para>
 477       <para>If you determine the LAST_ID file on the OST is incorrect (that is, it does not match what objects exist, does not match the MDS lov_objid value), then you have decided on a proper value for LAST_ID.</para>
 478       <para>Once you have decided on a proper value for LAST_ID, use this repair procedure.</para>
 479       <orderedlist>
 480         <listitem>
 481           <para>Access:</para>
 482           <screen>mount -t ldiskfs /dev/{ostdev} /mnt/ost</screen>
 483         </listitem>
 484         <listitem>
 485           <para>Check the current:</para>
 486           <screen>od -Ax -td8 /mnt/ost/O/0/LAST_ID</screen>
 487         </listitem>
 488         <listitem>
 489           <para>Be very safe, only work on backups:</para>
 490           <screen>cp /mnt/ost/O/0/LAST_ID /tmp/LAST_ID</screen>
 491         </listitem>
 492         <listitem>
 493           <para>Convert binary to text:</para>
 494           <screen>xxd /tmp/LAST_ID /tmp/LAST_ID.asc</screen>
 495         </listitem>
 496         <listitem>
 497           <para>Fix:</para>
 498           <screen>vi /tmp/LAST_ID.asc</screen>
 499         </listitem>
 500         <listitem>
 501           <para>Convert to binary:</para>
 502           <screen>xxd -r /tmp/LAST_ID.asc /tmp/LAST_ID.new</screen>
 503         </listitem>
 504         <listitem>
 505           <para>Verify:</para>
 506           <screen>od -Ax -td8 /tmp/LAST_ID.new</screen>
 507         </listitem>
 508         <listitem>
 509           <para>Replace:</para>
 510           <screen>cp /tmp/LAST_ID.new /mnt/ost/O/0/LAST_ID</screen>
 511         </listitem>
 512         <listitem>
 513           <para>Clean up:</para>
 514           <screen>umount /mnt/ost</screen>
 515         </listitem>
 516       </orderedlist>
 517     </section>
 518     <section remap="h3">
 519       <title><indexterm><primary>troubleshooting</primary><secondary>'Address already in use'</secondary></indexterm>Handling/Debugging &quot;<literal>Bind: Address already in use</literal>&quot; Error</title>
 520       <para>During startup, the Lustre software may report a <literal>bind: Address already in
 521           use</literal> error and reject to start the operation. This is caused by a portmap service
 522         (often NFS locking) that starts before the Lustre file system and binds to the default port
 523         988. You must have port 988 open from firewall or IP tables for incoming connections on the
 524         client, OSS, and MDS nodes. LNet will create three outgoing connections on available,
 525         reserved ports to each client-server pair, starting with 1023, 1022 and 1021.</para>
 526       <para>Unfortunately, you cannot set sunprc to avoid port 988. If you receive this error, do the following:</para>
 527       <itemizedlist>
 528         <listitem>
 529           <para>Start the Lustre file system before starting any service that uses sunrpc.</para>
 530         </listitem>
 531         <listitem>
 532           <para>Use a port other than 988 for the Lustre file system. This is configured in
 533               <literal>/etc/modprobe.d/lustre.conf</literal> as an option to the LNet module. For
 534             example:</para>
 535           <screen>options lnet accept_port=988</screen>
 536         </listitem>
 537       </itemizedlist>
 538       <itemizedlist>
 539         <listitem>
 540           <para>Add modprobe ptlrpc to your system startup scripts before the service that uses
 541             sunrpc. This causes the Lustre file system to bind to port 988 and sunrpc to select a
 542             different port.</para>
 543         </listitem>
 544       </itemizedlist>
 545       <note>
 546         <para>You can also use the <literal>sysctl</literal> command to mitigate the NFS client from grabbing the Lustre service port. However, this is a partial workaround as other user-space RPC servers still have the ability to grab the port.</para>
 547       </note>
 548     </section>
 549     <section remap="h3">
 550       <title><indexterm><primary>troubleshooting</primary><secondary>'Error -28'</secondary></indexterm>Handling/Debugging Error &quot;- 28&quot;</title>
 551       <para>A Linux error -28 (<literal>ENOSPC</literal>) that occurs during a write or sync operation indicates that an existing file residing on an OST could not be rewritten or updated because the OST was full, or nearly full. To verify if this is the case, on a client on which the OST is mounted, enter :</para>
 552       <screen>lfs df -h</screen>
 553       <para>To address this issue, you can do one of the following:</para>
 554       <itemizedlist>
 555         <listitem>
 556           <para>Expand the disk space on the OST.</para>
 557         </listitem>
 558         <listitem>
 559           <para>Copy or stripe the file to a less full OST.</para>
 560         </listitem>
 561       </itemizedlist>
 562       <para>A Linux error -28 (<literal>ENOSPC</literal>) that occurs when a new file is being created may indicate that the MDS has run out of inodes and needs to be made larger. Newly created files do not written to full OSTs, while existing files continue to reside on the OST where they were initially created. To view inode information on the MDS, enter:</para>
 563       <screen>lfs df -i</screen>
 564       <para>Typically, the Lustre software reports this error to your application. If the
 565         application is checking the return code from its function calls, then it decodes it into a
 566         textual error message such as <literal>No space left on device</literal>. Both versions of
 567         the error message also appear in the system log.</para>
 568       <para>For more information about the <literal>lfs df</literal> command, see <xref linkend="dbdoclet.50438209_35838"/>.</para>
 569       <para>Although it is less efficient, you can also use the grep command to determine which OST or MDS is running out of space. To check the free space and inodes on a client, enter:</para>
 570       <screen>grep &apos;[0-9]&apos; /proc/fs/lustre/osc/*/kbytes{free,avail,total}
 571 grep &apos;[0-9]&apos; /proc/fs/lustre/osc/*/files{free,total}
 572 grep &apos;[0-9]&apos; /proc/fs/lustre/mdc/*/kbytes{free,avail,total}
 573 grep &apos;[0-9]&apos; /proc/fs/lustre/mdc/*/files{free,total}</screen>
 574       <note>
 575         <para>You can find other numeric error codes along with a short name and text description in <literal>/usr/include/asm/errno.h</literal>.</para>
 576       </note>
 577     </section>
 578     <section remap="h3">
 579       <title>Triggering Watchdog for PID NNN</title>
 580       <para>In some cases, a server node triggers a watchdog timer and this causes a process stack to be dumped to the console along with a Lustre kernel debug log being dumped into <literal>/tmp</literal> (by default). The presence of a watchdog timer does NOT mean that the thread OOPSed, but rather that it is taking longer time than expected to complete a given operation. In some cases, this situation is expected.</para>
 581       <para>For example, if a RAID rebuild is really slowing down I/O on an OST, it might trigger watchdog timers to trip. But another message follows shortly thereafter, indicating that the thread in question has completed processing (after some number of seconds). Generally, this indicates a transient problem. In other cases, it may legitimately signal that a thread is stuck because of a software error (lock inversion, for example).</para>
 582       <screen>Lustre: 0:0:(watchdog.c:122:lcw_cb()) </screen>
 583       <para>The above message indicates that the watchdog is active for pid 933:</para>
 584       <para>It was inactive for 100000ms:</para>
 585       <screen>Lustre: 0:0:(linux-debug.c:132:portals_debug_dumpstack()) </screen>
 586       <para>Showing stack for process:</para>
 587       <screen>933 ll_ost_25     D F896071A     0   933      1    934   932 (L-TLB)
 588 f6d87c60 00000046 00000000 f896071a f8def7cc 00002710 00001822 2da48cae
 589 0008cf1a f6d7c220 f6d7c3d0 f6d86000 f3529648 f6d87cc4 f3529640 f8961d3d
 590 00000010 f6d87c9c ca65a13c 00001fff 00000001 00000001 00000000 00000001</screen>
 591       <para>Call trace:</para>
 592       <screen>filter_do_bio+0x3dd/0xb90 [obdfilter]
 593 default_wake_function+0x0/0x20
 594 filter_direct_io+0x2fb/0x990 [obdfilter]
 595 filter_preprw_read+0x5c5/0xe00 [obdfilter]
 596 lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc]
 597 ost_brw_read+0x18df/0x2400 [ost]
 598 ost_handle+0x14c2/0x42d0 [ost]
 599 ptlrpc_server_handle_request+0x870/0x10b0 [ptlrpc]
 600 ptlrpc_main+0x42e/0x7c0 [ptlrpc]
 601 </screen>
 602     </section>
 603     <section remap="h3">
 604       <title><indexterm>
 605           <primary>troubleshooting</primary>
 606           <secondary>timeouts on setup</secondary>
 607         </indexterm>Handling Timeouts on Initial Lustre File System Setup</title>
 608       <para>If you come across timeouts or hangs on the initial setup of your Lustre file system,
 609         verify that name resolution for servers and clients is working correctly. Some distributions
 610         configure <literal>/etc/hosts</literal> so the name of the local machine (as reported by the
 611         &apos;hostname&apos; command) is mapped to local host (127.0.0.1) instead of a proper IP
 612         address.</para>
 613       <para>This might produce this error:</para>
 614       <screen>LustreError:(ldlm_handle_cancel()) received cancel for unknown lock cookie
 615 0xe74021a4b41b954e from nid 0x7f000001 (0:127.0.0.1)
 616 </screen>
 617     </section>
 618     <section remap="h3">
 619       <title>Handling/Debugging &quot;LustreError: xxx went back in time&quot;</title>
 620       <para>Each time the Lustre software changes the state of the disk file system, it records a
 621         unique transaction number. Occasionally, when committing these transactions to the disk, the
 622         last committed transaction number displays to other nodes in the cluster to assist the
 623         recovery. Therefore, the promised transactions remain absolutely safe on the disappeared
 624         disk.</para>
 625       <para>This situation arises when:</para>
 626       <itemizedlist>
 627         <listitem>
 628           <para>You are using a disk device that claims to have data written to disk before it
 629             actually does, as in case of a device with a large cache. If that disk device crashes or
 630             loses power in a way that causes the loss of the cache, there can be a loss of
 631             transactions that you believe are committed. This is a very serious event, and you
 632             should run e2fsck against that storage before restarting the Lustre file system.</para>
 633         </listitem>
 634         <listitem>
 635           <para>As required by the Lustre software, the shared storage used for failover is
 636             completely cache-coherent. This ensures that if one server takes over for another, it
 637             sees the most up-to-date and accurate copy of the data. In case of the failover of the
 638             server, if the shared storage does not provide cache coherency between all of its ports,
 639             then the Lustre software can produce an error.</para>
 640         </listitem>
 641       </itemizedlist>
 642       <para>If you know the exact reason for the error, then it is safe to proceed with no further action. If you do not know the reason, then this is a serious issue and you should explore it with your disk vendor.</para>
 643       <para>If the error occurs during failover, examine your disk cache settings. If it occurs after a restart without failover, try to determine how the disk can report that a write succeeded, then lose the Data Device corruption or Disk Errors.</para>
 644     </section>
 645     <section remap="h3">
 646       <title>Lustre Error: &quot;<literal>Slow Start_Page_Write</literal>&quot;</title>
 647       <para>The slow <literal>start_page_write</literal> message appears when the operation takes an extremely long time to allocate a batch of memory pages. Use these pages to receive network traffic first, and then write to disk.</para>
 648     </section>
 649     <section remap="h3">
 650       <title>Drawbacks in Doing Multi-client O_APPEND Writes</title>
 651       <para>It is possible to do multi-client <literal>O_APPEND</literal> writes to a single file, but there are few drawbacks that may make this a sub-optimal solution. These drawbacks are:</para>
 652       <itemizedlist>
 653         <listitem>
 654           <para>  Each client needs to take an <literal>EOF</literal> lock on all the OSTs, as it is difficult to know which OST holds the end of the file until you check all the OSTs. As all the clients are using the same <literal>O_APPEND</literal>, there is significant locking overhead.</para>
 655         </listitem>
 656         <listitem>
 657           <para> The second client cannot get all locks until the end of the writing of the first client, as the taking serializes all writes from the clients.</para>
 658         </listitem>
 659         <listitem>
 660           <para> To avoid deadlocks, the taking of these locks occurs in a known, consistent order. As a client cannot know which OST holds the next piece of the file until the client has locks on all OSTS, there is a need of these locks in case of a striped file.</para>
 661         </listitem>
 662       </itemizedlist>
 663     </section>
 664     <section remap="h3">
 665       <title><indexterm>
 666           <primary>troubleshooting</primary>
 667           <secondary>slowdown during startup</secondary>
 668         </indexterm>Slowdown Occurs During Lustre File System Startup</title>
 669       <para>When a Lustre file system starts, it needs to read in data from the disk. For the very
 670         first mdsrate run after the reboot, the MDS needs to wait on all the OSTs for object
 671         pre-creation. This causes a slowdown to occur when the file system starts up.</para>
 672       <para>After the file system has been running for some time, it contains more data in cache and hence, the variability caused by reading critical metadata from disk is mostly eliminated. The file system now reads data from the cache.</para>
 673     </section>
 674     <section remap="h3">
 675       <title><indexterm><primary>troubleshooting</primary><secondary>OST out of memory</secondary></indexterm>Log Message <literal>&apos;Out of Memory</literal>&apos; on OST</title>
 676       <para>When planning the hardware for an OSS node, consider the memory usage of several
 677         components in the Lustre file system. If insufficient memory is available, an &apos;out of
 678         memory&apos; message can be logged.</para>
 679       <para>During normal operation, several conditions indicate insufficient RAM on a server node:</para>
 680       <itemizedlist>
 681         <listitem>
 682           <para> kernel &quot;<literal>Out of memory</literal>&quot; and/or &quot;<literal>oom-killer</literal>&quot; messages</para>
 683         </listitem>
 684         <listitem>
 685           <para> Lustre &quot;<literal>kmalloc of &apos;mmm&apos; (NNNN bytes) failed...</literal>&quot; messages</para>
 686         </listitem>
 687         <listitem>
 688           <para> Lustre or kernel stack traces showing processes stuck in &quot;<literal>try_to_free_pages</literal>&quot;</para>
 689         </listitem>
 690       </itemizedlist>
 691       <para>For information on determining the MDS memory and OSS memory requirements, see <xref linkend="dbdoclet.50438256_26456"/>.</para>
 692     </section>
 693     <section remap="h3">
 694       <title>Setting SCSI I/O Sizes</title>
 695       <para>Some SCSI drivers default to a maximum I/O size that is too small for good Lustre file
 696         system performance. we have fixed quite a few drivers, but you may still find that some
 697         drivers give unsatisfactory performance with the Lustre file system. As the default value is
 698         hard-coded, you need to recompile the drivers to change their default. On the other hand,
 699         some drivers may have a wrong default set.</para>
 700       <para>If you suspect bad I/O performance and an analysis of Lustre file system statistics
 701         indicates that I/O is not 1 MB, check
 702           <literal>/sys/block/<replaceable>device</replaceable>/queue/max_sectors_kb</literal>. If
 703         the <literal>max_sectors_kb</literal> value is less than 1024, set it to at least 1024 to
 704         improve performance. If changing <literal>max_sectors_kb</literal> does not change the I/O
 705         size as reported by the Lustre software, you may want to examine the SCSI driver
 706         code.</para>
 707     </section>
 708   </section>
 709 </chapter>