LustreProc.xml

   1 <?xml version='1.0' encoding='UTF-8'?>
   2 <!-- This document was created with Syntext Serna Free. -->
   3 <chapter xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0"
   4   xml:lang="en-US" xml:id="lustreproc">
   5   <title xml:id="lustreproc.title">LustreProc</title>
   6   <para>The <literal>/proc</literal> file system acts as an interface to internal data structures in
   7     the kernel. This chapter describes entries in <literal>/proc</literal> that are useful for
   8     tuning and monitoring aspects of a Lustre file system. It includes these sections:</para>
   9   <itemizedlist>
  10     <listitem>
  11       <para><xref linkend="dbdoclet.50438271_90999"/></para>
  12     </listitem>
  13     <listitem>
  14       <para><xref linkend="dbdoclet.50438271_78950"/></para>
  15     </listitem>
  16     <listitem>
  17       <para><xref linkend="dbdoclet.50438271_83523"/></para>
  18       <para>The <literal>/proc</literal> directory provides a file-system like interface to internal
  19         data structures in the kernel. These data structures include settings and metrics for
  20         components such as memory, networking, file systems, and kernel housekeeping routines, which
  21         are available throughout the hierarchical file layout in <literal>/proc.</literal>
  22         Typically, metrics are accessed by reading from <literal>/proc</literal> files and settings
  23         are changed by writing to <literal>/proc</literal> files. </para>
  24       <para>The <literal>/proc</literal> directory contains files that allow an operator to
  25         interface with the Lustre file system to tune and monitor many aspects of system and
  26         application performance.</para>
  27     </listitem>
  28   </itemizedlist>
  29   <section xml:id="dbdoclet.50438271_90999">
  30     <title><indexterm>
  31         <primary>proc</primary>
  32       </indexterm> Lustre Entries in /proc</title>
  33     <para>This section describes <literal>/proc</literal> entries for Lustre.</para>
  34     <section remap="h3">
  35       <title>Locating Lustre File Systems and Servers</title>
  36       <para>Use the <literal>/proc</literal> files on the MGS to locate the following:</para>
  37       <itemizedlist>
  38         <listitem>
  39           <para> All known file systems</para>
  40           <screen>mgs# cat /proc/fs/lustre/mgs/MGS/filesystems
  41 testfs
  42 lustre</screen>
  43         </listitem>
  44       </itemizedlist>
  45       <itemizedlist>
  46         <listitem>
  47           <para> The names of the servers in a file system (for a file system that has at least one
  48             server running)</para>
  49           <screen>mgs# cat /proc/fs/lustre/mgs/MGS/live/testfs
  50 fsname: testfs
  51 flags: 0x0         gen: 7
  52 testfs-MDT0000
  53 testfs-OST0000</screen>
  54         </listitem>
  55       </itemizedlist>
  56       <para>All servers are named according to the convention
  57             <literal><replaceable>fsname</replaceable>-<replaceable>MDT|OSTnumber</replaceable></literal>.
  58         Server names for live servers are listed under
  59         <literal>/proc/fs/lustre/devices</literal>:</para>
  60       <screen>mds# cat /proc/fs/lustre/devices
  61 0 UP mgs MGS MGS 11
  62 1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
  63 2 UP mdt MDS MDS_uuid 3
  64 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
  65 4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 7
  66 5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
  67 6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
  68 7 UP lov lustre-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04
  69 8 UP mdc lustre-MDT0000-mdc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
  70 9 UP osc lustre-OST0000-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
  71 10 UP osc lustre-OST0001-osc-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05</screen>
  72       <para>A server name can also be displayed by viewing the device label at any time.</para>
  73       <screen>mds# e2label /dev/sda
  74 lustre-MDT0000</screen>
  75     </section>
  76     <section remap="h3">
  77       <title><indexterm>
  78           <primary>proc</primary>
  79           <secondary>timeouts</secondary>
  80         </indexterm>Timeouts in a Lustre File System</title>
  81       <para>Two types of timeouts are used in a Lustre file system.</para>
  82       <itemizedlist>
  83         <listitem>
  84           <para><emphasis role="italic">LND timeouts</emphasis> - LND timeouts ensure that
  85             point-to-point communications complete in a finite time in the presence of failures.
  86             These timeouts are logged with the <literal>S_LND</literal> flag set. They are not
  87             printed as console messages, so you should check the Lustre log for
  88               <literal>D_NETERROR</literal> messages or enable printing of
  89               <literal>D_NETERROR</literal> messages to the console (<literal>lctl set_param
  90               printk=+neterror</literal>).</para>
  91           <para>Congested routers can be a source of spurious LND timeouts. To avoid this situation,
  92             increase the number of LNET router buffers to reduce back-pressure and/or increase LND
  93             timeouts on all nodes on all connected networks. Also consider increasing the total
  94             number of LNET router nodes in the system so that the aggregate router bandwidth matches
  95             the aggregate server bandwidth.</para>
  96         </listitem>
  97       </itemizedlist>
  98       <itemizedlist>
  99         <listitem>
 100           <para><emphasis role="italic">Lustre timeouts </emphasis>- Lustre timeouts ensure that
 101             Lustre RPCs complete in a finite time in the presence of failures. These timeouts are
 102             always printed as console messages. If Lustre timeouts are not accompanied by LNET
 103             timeouts, then increase the Lustre timeout on both servers and clients.</para>
 104         </listitem>
 105       </itemizedlist>
 106       <para>Specific Lustre timeouts include:</para>
 107       <itemizedlist>
 108         <listitem>
 109           <para><literal>/proc/sys/lustre/timeout</literal> - The time period that a client waits
 110             for a server to complete an RPC (default is 100s). Servers wait half of this time for a
 111             normal client RPC to complete and a quarter of this time for a single bulk request (read
 112             or write of up to 4 MB) to complete. The client pings recoverable targets (MDS and OSTs)
 113             at one quarter of the timeout, and the server waits one and a half times the timeout
 114             before evicting a client for being &quot;stale.&quot;</para>
 115           <note>
 116             <para>A Lustre client sends periodic &apos;ping&apos; messages to servers with which it
 117               has had no communication for a specified period of time. Any network activity between
 118               a client and a server in the file system also serves as a ping.</para>
 119           </note>
 120         </listitem>
 121         <listitem>
 122           <para><literal>/proc/sys/lustre/ldlm_timeout</literal> - The time period for which a
 123             server will wait for a client to reply to an initial AST (lock cancellation request),
 124             where the default is 20s for an OST and 6s for an MDS. If the client replies to the AST,
 125             the server will give it a normal timeout (half the client timeout) to flush any dirty
 126             data and release the lock.</para>
 127         </listitem>
 128         <listitem>
 129           <para><literal>/proc/sys/lustre/fail_loc</literal> - The internal debugging failure hook.
 130             See <literal>lustre/include/linux/obd_support.h</literal> for the definitions of
 131             individual failure locations. The default value is 0 (zero).</para>
 132         </listitem>
 133         <listitem>
 134           <para><literal>/proc/sys/lustre/dump_on_timeout</literal> - Triggers dumps of the Lustre
 135             debug log when timeouts occur. The default value is 0 (zero).</para>
 136         </listitem>
 137         <listitem>
 138           <para><literal>/proc/sys/lustre/dump_on_eviction</literal> - Triggers dumps of the Lustre
 139             debug log when an eviction occurs. The default value is 0 (zero). </para>
 140         </listitem>
 141       </itemizedlist>
 142     </section>
 143     <section remap="h3">
 144       <title><indexterm>
 145           <primary>proc</primary>
 146           <secondary>adaptive timeouts</secondary>
 147         </indexterm>Adaptive Timeouts</title>
 148       <para>In a Lustre file system, an adaptive mechanism is used to set RPC timeouts. The adaptive
 149         timeouts feature (enabled, by default) causes servers to track actual RPC completion times
 150         and to report estimated completion times for future RPCs back to clients. The clients use
 151         these estimates to set their future RPC timeout values. If server request processing slows
 152         down for any reason, the RPC completion estimates increase, and the clients allow more time
 153         for RPC completion.</para>
 154       <para>If RPCs queued on the server approach their timeouts, then the server sends an early
 155         reply to the client, telling the client to allow more time. In this manner, clients avoid
 156         RPC timeouts and disconnect/reconnect cycles. Conversely, as a server speeds up, RPC timeout
 157         values decrease, allowing faster detection of non-responsive servers and faster attempts to
 158         reconnect to the failover partner of the server.</para>
 159       <para>Adaptive timeouts were introduced in the Lustre 1.8.0.1 release. Prior to this release,
 160         the static <literal>obd_timeout</literal> (<literal>/proc/sys/lustre/timeout</literal>)
 161         value was used as the maximum completion time for all RPCs; this value also affected the
 162         client-server ping interval and initial recovery timer. With adaptive timeouts,
 163           <literal>obd_timeout</literal> is only used for the ping interval and initial recovery
 164         estimate. When a client reconnects during recovery, the server uses the client&apos;s
 165         timeout value to reset the recovery wait period; i.e., the server learns how long the client
 166         had been willing to wait, and takes this into account when adjusting the recovery
 167         period.</para>
 168       <section remap="h4">
 169         <title><indexterm>
 170             <primary>proc</primary>
 171             <secondary>configuring adaptive timeouts</secondary>
 172           </indexterm><indexterm>
 173             <primary>configuring</primary>
 174             <secondary>adaptive timeouts</secondary>
 175           </indexterm>Configuring Adaptive Timeouts</title>
 176         <para>A goal of adaptive timeouts is to relieve users from having to tune the
 177             <literal>obd_timeout</literal> value. In general, <literal>obd_timeout</literal> should
 178           no longer need to be changed. However, several parameters related to adaptive timeouts can
 179           be set by users. In most situations, the default values should be used.</para>
 180         <para>The following parameters can be set persistently system-wide using <literal>lctl
 181             conf_param</literal> on the MGS. For example, <literal>lctl conf_param
 182             testfs.sys.at_max=1500</literal> sets the <literal>at_max</literal> value for all
 183           servers and clients using the testfs file system.</para>
 184         <note>
 185           <para>Nodes using multiple Lustre file systems must use the same <literal>at_*</literal>
 186             values for all file systems.)</para>
 187         </note>
 188         <informaltable frame="all">
 189           <tgroup cols="2">
 190             <colspec colname="c1" colwidth="30*"/>
 191             <colspec colname="c2" colwidth="80*"/>
 192             <thead>
 193               <row>
 194                 <entry>
 195                   <para><emphasis role="bold">Parameter</emphasis></para>
 196                 </entry>
 197                 <entry>
 198                   <para><emphasis role="bold">Description</emphasis></para>
 199                 </entry>
 200               </row>
 201             </thead>
 202             <tbody>
 203               <row>
 204                 <entry>
 205                   <para>
 206                     <literal> at_min </literal></para>
 207                 </entry>
 208                 <entry>
 209                   <para>Sets the minimum adaptive timeout (in seconds). Default value is 0. The
 210                       <literal>at_min</literal> parameter is the minimum processing time that a
 211                     server will report. Clients base their timeouts on this value, but they do not
 212                     use this value directly. If you experience cases in which, for unknown reasons,
 213                     the adaptive timeout value is too short and clients time out their RPCs (usually
 214                     due to temporary network outages), then you can increase the
 215                       <literal>at_min</literal> value to compensate for this. Ideally, users should
 216                     leave <literal>at_min</literal> set to its default.</para>
 217                 </entry>
 218               </row>
 219               <row>
 220                 <entry>
 221                   <para>
 222                     <literal> at_max </literal></para>
 223                 </entry>
 224                 <entry>
 225                   <para>Sets the maximum adaptive timeout (in seconds). The
 226                       <literal>at_max</literal> parameter is an upper-limit on the service time
 227                     estimate and is used as a &apos;failsafe&apos; in case of rogue/bad/buggy code
 228                     that would lead to never-ending estimate increases. If <literal>at_max</literal>
 229                     is reached, an RPC request is considered &apos;broken&apos; and will time
 230                     out.</para>
 231                   <para>Setting <literal>at_max</literal> to 0 causes adaptive timeouts to be
 232                     disabled and the static fixed-timeout method (<literal>obd_timeout</literal>) to
 233                     be used.</para>
 234                   <note>
 235                     <para>It is possible that slow hardware might validly cause the service estimate
 236                       to increase beyond the default value of <literal>at_max</literal>. In this
 237                       case, you should increase <literal>at_max</literal> to the maximum time you
 238                       are willing to wait for an RPC completion.</para>
 239                   </note>
 240                 </entry>
 241               </row>
 242               <row>
 243                 <entry>
 244                   <para>
 245                     <literal> at_history </literal></para>
 246                 </entry>
 247                 <entry>
 248                   <para>Sets a time period (in seconds) within which adaptive timeouts remember the
 249                     slowest event that occurred. Default value is 600.</para>
 250                 </entry>
 251               </row>
 252               <row>
 253                 <entry>
 254                   <para>
 255                     <literal> at_early_margin </literal></para>
 256                 </entry>
 257                 <entry>
 258                   <para>Sets how far before the deadline the Lustre client sends an early reply.
 259                     Default value is 5<footnote>
 260                       <para>This default was chosen as a reasonable time in which to send a reply
 261                         from the point at which it was sent.</para>
 262                     </footnote>.</para>
 263                 </entry>
 264               </row>
 265               <row>
 266                 <entry>
 267                   <para>
 268                     <literal> at_extra </literal></para>
 269                 </entry>
 270                 <entry>
 271                   <para>Sets the incremental amount of time that a server asks for, with each early
 272                     reply. The server does not know how much time the RPC will take, so it asks for
 273                     a fixed value. Default value is 30<footnote>
 274                       <para>This default was chosen as a balance between sending too many early
 275                         replies for the same RPC and overestimating the actual completion
 276                         time.</para>
 277                     </footnote>. When a server finds a queued request about to time out (and needs
 278                     to send an early reply out), the server adds the <literal>at_extra</literal>
 279                     value. If the time expires, the Lustre client enters recovery status and
 280                     reconnects to restore it to normal status.</para>
 281                   <para>If you see multiple early replies for the same RPC asking for multiple
 282                     30-second increases, change the <literal>at_extra</literal> value to a larger
 283                     number to cut down on early replies sent and, therefore, network load.</para>
 284                 </entry>
 285               </row>
 286               <row>
 287                 <entry>
 288                   <para>
 289                     <literal> ldlm_enqueue_min </literal></para>
 290                 </entry>
 291                 <entry>
 292                   <para>Sets the minimum lock enqueue time. Default value is 100. The
 293                       <literal>ldlm_enqueue</literal> time is the maximum of the measured enqueue
 294                     estimate (influenced by <literal>at_min</literal> and <literal>at_max</literal>
 295                     parameters), multiplied by a weighting factor, and the
 296                       <literal>ldlm_enqueue_min</literal> setting. LDLM lock enqueues were based on
 297                     the <literal>obd_timeout</literal> value; now they have a dedicated minimum
 298                     value. Lock enqueues increase as the measured enqueue times increase (similar to
 299                     adaptive timeouts).</para>
 300                 </entry>
 301               </row>
 302             </tbody>
 303           </tgroup>
 304         </informaltable>
 305         <para>Adaptive timeouts are enabled by default. To disable adaptive timeouts, at run time,
 306           set <literal>at_max</literal> to 0. On the MGS, run:</para>
 307         <screen>$ lctl conf_param <replaceable>fsname</replaceable>.sys.at_max=0</screen>
 308         <note>
 309           <para>Changing the status of adaptive timeouts at runtime may cause a transient client
 310             timeout, recovery, and reconnection.</para>
 311         </note>
 312       </section>
 313       <section remap="h4">
 314         <title><indexterm>
 315             <primary>proc</primary>
 316             <secondary>interpreting adaptive timeouts</secondary>
 317           </indexterm>Interpreting Adaptive Timeout Information</title>
 318         <para>Adaptive timeout information can be read from the timeouts files in
 319             <literal>/proc/fs/lustre/*/</literal> for each server and client or by using the
 320             <literal>lctl</literal> command.</para>
 321         <para>To read information from timeouts file, enter a command similar to:</para>
 322         <screen>cfs21:~# cat /proc/fs/lustre/ost/OSS/ost_io/timeouts</screen>
 323         <para>To use the <literal>lctl</literal> command, enter a command similar to:</para>
 324         <screen>$ lctl get_param -n ost.*.ost_io.timeouts</screen>
 325         <para>Example output:</para>
 326         <screen>service : cur 33  worst 34 (at 1193427052, 0d0h26m40s ago) 1 1 33 2</screen>
 327         <para>In this example, the <literal>ost_io</literal> service on this node is currently
 328           reporting an estimate of 33 seconds. The worst RPC service time was 34 seconds, and it
 329           happened 26 minutes ago.</para>
 330         <para>The output also provides a history of service times. In this example, four
 331           &quot;bins&quot; of <literal>adaptive_timeout_history</literal> are shown, with the
 332           maximum RPC time in each bin reported. In 0-150 seconds, the maximum RPC time was 1, with
 333           the same result in 150-300 seconds. From 300-450 seconds, the worst (maximum) RPC time was
 334           33 seconds, and from 450-600s the worst time was 2 seconds. The current estimated service
 335           time is the maximum value of the four bins (33 seconds in this example).</para>
 336         <para>Service times (as reported by the servers) are also tracked in the client OBDs:</para>
 337         <screen>cfs21:# lctl get_param osc.*.timeouts
 338 last reply : 1193428639, 0d0h00m00s ago
 339 network    : cur  1 worst  2 (at 1193427053, 0d0h26m26s ago)  1  1  1  1
 340 portal 6   : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33  2
 341 portal 28  : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  1  1  1
 342 portal 7   : cur  1 worst  1 (at 1193426141, 0d0h41m38s ago)  1  0  1  1
 343 portal 17  : cur  1 worst  1 (at 1193426177, 0d0h41m02s ago)  1  0  0  1
 344 </screen>
 345         <para>In this case, RPCs to portal 6, the <literal>OST_IO_PORTAL</literal> (see
 346             <literal>lustre/include/lustre/lustre_idl.h</literal>), shows the history of what the
 347             <literal>ost_io</literal> portal has reported as the service estimate.</para>
 348         <para>Server statistic files also show the range of estimates in the order
 349           min/max/sum/sumsq.</para>
 350         <screen>cfs21:~# lctl get_param mdt.*.mdt.stats
 351 ...
 352 req_timeout               6 samples [sec] 1 10 15 105
 353 ...
 354 </screen>
 355       </section>
 356     </section>
 357     <section remap="h3">
 358       <title><indexterm>
 359           <primary>proc</primary>
 360           <secondary>LNET</secondary>
 361         </indexterm><indexterm>
 362           <primary>LNET</primary>
 363           <secondary>proc</secondary>
 364         </indexterm>LNET Information</title>
 365       <para>This section describes <literal>/proc</literal> entries containing LNET information.
 366         These entries include:<itemizedlist>
 367           <listitem>
 368             <para><literal>/proc/sys/lnet/peers</literal> - Shows all NIDs known to this node and
 369               provides information on the queue state.</para>
 370             <para>Example:</para>
 371             <screen># cat /proc/sys/lnet/peers
 372 nid                refs   state  max  rtr  min   tx    min   queue
 373 0@lo               1      ~rtr   0    0    0     0     0     0
 374 192.168.10.35@tcp  1      ~rtr   8    8    8     8     6     0
 375 192.168.10.36@tcp  1      ~rtr   8    8    8     8     6     0
 376 192.168.10.37@tcp  1      ~rtr   8    8    8     8     6     0</screen>
 377             <para>The fields are explained in the table below:</para>
 378             <informaltable frame="all">
 379               <tgroup cols="2">
 380                 <colspec colname="c1" colwidth="30*"/>
 381                 <colspec colname="c2" colwidth="80*"/>
 382                 <thead>
 383                   <row>
 384                     <entry>
 385                       <para><emphasis role="bold">Field</emphasis></para>
 386                     </entry>
 387                     <entry>
 388                       <para><emphasis role="bold">Description</emphasis></para>
 389                     </entry>
 390                   </row>
 391                 </thead>
 392                 <tbody>
 393                   <row>
 394                     <entry>
 395                       <para>
 396                         <literal>
 397                           <replaceable>refs</replaceable>
 398                         </literal>
 399                       </para>
 400                     </entry>
 401                     <entry>
 402                       <para>A reference count (principally used for debugging).</para>
 403                     </entry>
 404                   </row>
 405                   <row>
 406                     <entry>
 407                       <para>
 408                         <literal>
 409                           <replaceable>state</replaceable>
 410                         </literal>
 411                       </para>
 412                     </entry>
 413                     <entry>
 414                       <para>Only valid to refer to routers. Possible values:</para>
 415                       <itemizedlist>
 416                         <listitem>
 417                           <para><literal>~rtr</literal> (indicates this node is not a router)</para>
 418                         </listitem>
 419                         <listitem>
 420                           <para><literal>up/down</literal> (indicates this node is a router)</para>
 421                         </listitem>
 422                         <listitem>
 423                           <para><literal>auto_fail</literal> (if enabled)</para>
 424                         </listitem>
 425                       </itemizedlist>
 426                     </entry>
 427                   </row>
 428                   <row>
 429                     <entry>
 430                       <para>
 431                         <literal> max </literal></para>
 432                     </entry>
 433                     <entry>
 434                       <para>Maximum number of concurrent sends from this peer.</para>
 435                     </entry>
 436                   </row>
 437                   <row>
 438                     <entry>
 439                       <para>
 440                         <literal> rtr </literal></para>
 441                     </entry>
 442                     <entry>
 443                       <para>Routing buffer credits.</para>
 444                     </entry>
 445                   </row>
 446                   <row>
 447                     <entry>
 448                       <para>
 449                         <literal> min </literal></para>
 450                     </entry>
 451                     <entry>
 452                       <para>Minimum routing buffer credits seen.</para>
 453                     </entry>
 454                   </row>
 455                   <row>
 456                     <entry>
 457                       <para>
 458                         <literal> tx </literal></para>
 459                     </entry>
 460                     <entry>
 461                       <para>Send credits.</para>
 462                     </entry>
 463                   </row>
 464                   <row>
 465                     <entry>
 466                       <para>
 467                         <literal> min </literal></para>
 468                     </entry>
 469                     <entry>
 470                       <para>Minimum send credits seen.</para>
 471                     </entry>
 472                   </row>
 473                   <row>
 474                     <entry>
 475                       <para>
 476                         <literal> queue </literal></para>
 477                     </entry>
 478                     <entry>
 479                       <para>Total bytes in active/queued sends.</para>
 480                     </entry>
 481                   </row>
 482                 </tbody>
 483               </tgroup>
 484             </informaltable>
 485             <para>Credits work like a semaphore. They are initialized to allow a certain number of
 486               operations (8 in the example above). LNET keeps a track of the minimum value so that
 487               you can see how congested a resource is.</para>
 488             <para>A value of <literal>rtr/tx</literal> less than <literal>max</literal> indicates
 489               operations are in progress. The number of operations is equal to
 490                 <literal>rtr</literal> or <literal>tx</literal> subtracted from
 491                 <literal>max</literal>.</para>
 492             <para>A value of <literal>rtr/tx</literal> greater that <literal>max</literal> indicates
 493               operations are blocking.</para>
 494             <para>LNET also limits concurrent sends and router buffers allocated to a single peer so
 495               that no peer can occupy all these resources.</para>
 496           </listitem>
 497         </itemizedlist><itemizedlist>
 498           <listitem>
 499             <para><literal>/proc/sys/lnet/nis</literal> - Shows the current queue health on this
 500               node.</para>
 501             <para>Example:</para>
 502             <screen># cat /proc/sys/lnet/nis
 503 nid                    refs   peer    max   tx    min
 504 0@lo                   3      0       0     0     0
 505 192.168.10.34@tcp      4      8       256   256   252
 506 </screen>
 507             <para> The fields are explained below:</para>
 508             <informaltable frame="all">
 509               <tgroup cols="2">
 510                 <colspec colname="c1" colwidth="30*"/>
 511                 <colspec colname="c2" colwidth="80*"/>
 512                 <thead>
 513                   <row>
 514                     <entry>
 515                       <para><emphasis role="bold">Field</emphasis></para>
 516                     </entry>
 517                     <entry>
 518                       <para><emphasis role="bold">Description</emphasis></para>
 519                     </entry>
 520                   </row>
 521                 </thead>
 522                 <tbody>
 523                   <row>
 524                     <entry>
 525                       <para>
 526                         <literal> nid </literal></para>
 527                     </entry>
 528                     <entry>
 529                       <para>Network interface.</para>
 530                     </entry>
 531                   </row>
 532                   <row>
 533                     <entry>
 534                       <para>
 535                         <literal> refs </literal></para>
 536                     </entry>
 537                     <entry>
 538                       <para>Internal reference counter.</para>
 539                     </entry>
 540                   </row>
 541                   <row>
 542                     <entry>
 543                       <para>
 544                         <literal> peer </literal></para>
 545                     </entry>
 546                     <entry>
 547                       <para>Number of peer-to-peer send credits on this NID. Credits are used to
 548                         size buffer pools.</para>
 549                     </entry>
 550                   </row>
 551                   <row>
 552                     <entry>
 553                       <para>
 554                         <literal> max </literal></para>
 555                     </entry>
 556                     <entry>
 557                       <para>Total number of send credits on this NID.</para>
 558                     </entry>
 559                   </row>
 560                   <row>
 561                     <entry>
 562                       <para>
 563                         <literal> tx </literal></para>
 564                     </entry>
 565                     <entry>
 566                       <para>Current number of send credits available on this NID.</para>
 567                     </entry>
 568                   </row>
 569                   <row>
 570                     <entry>
 571                       <para>
 572                         <literal> min </literal></para>
 573                     </entry>
 574                     <entry>
 575                       <para>Lowest number of send credits available on this NID.</para>
 576                     </entry>
 577                   </row>
 578                   <row>
 579                     <entry>
 580                       <para>
 581                         <literal> queue </literal></para>
 582                     </entry>
 583                     <entry>
 584                       <para>Total bytes in active/queued sends.</para>
 585                     </entry>
 586                   </row>
 587                 </tbody>
 588               </tgroup>
 589             </informaltable>
 590             <para>Subtracting <literal>max</literal> - <literal>tx</literal> yields the number of
 591               sends currently active. A large or increasing number of active sends may indicate a
 592               problem.</para>
 593             <para>Example:</para>
 594             <screen># cat /proc/sys/lnet/nis
 595 nid                   refs       peer       max        tx         min
 596 0@lo                  2          0          0          0          0
 597 10.67.73.173@tcp      4          8          256        256        253
 598 </screen>
 599           </listitem>
 600         </itemizedlist></para>
 601     </section>
 602     <section remap="h3">
 603       <title><indexterm>
 604           <primary>proc</primary>
 605           <secondary>free space</secondary>
 606         </indexterm>Free Space Distribution</title>
 607       <para>Free-space stripe weighting, as set, gives a priority of &quot;0&quot; to free space
 608         (versus trying to place the stripes &quot;widely&quot; -- nicely distributed across OSSs and
 609         OSTs to maximize network balancing). To adjust this priority as a percentage, use the
 610           <literal>/proc</literal> tunable<literal>qos_prio_free</literal>:</para>
 611       <screen>$ cat /proc/fs/lustre/lov/<replaceable>fsname</replaceable>-mdtlov/qos_prio_free</screen>
 612       <para>The default is 90%. You can permanently set this value by running this command on the
 613         MGS:</para>
 614       <screen>$ lctl conf_param <replaceable>fsname</replaceable>-MDT0000.lov.qos_prio_free=90</screen>
 615       <para>Setting the priority to 100% means that OSS distribution does not count in the
 616         weighting, but the stripe assignment is still done via weighting. If OST 2 has twice as much
 617         free space as OST 1, it is twice as likely to be used, but it is NOT guaranteed to be
 618         used.</para>
 619       <para>Also note that free-space stripe weighting does not activate until two OSTs are
 620         imbalanced by more than 20%. Until then, a faster round-robin stripe allocator is used. (The
 621         round-robin order also maximizes network balancing.)</para>
 622       <section remap="h4">
 623         <title><indexterm>
 624             <primary>proc</primary>
 625             <secondary>striping</secondary>
 626           </indexterm>Managing Stripe Allocation</title>
 627         <para>The MDS uses two methods to manage stripe allocation and determine which OSTs to use
 628           for file object storage:</para>
 629         <itemizedlist>
 630           <listitem>
 631             <para><emphasis role="bold">QOS</emphasis></para>
 632             <para>Quality of Service (QOS) considers an OST&apos;s available blocks, speed, and the
 633               number of existing objects, etc. Using these criteria, the MDS selects OSTs with more
 634               free space more often than OSTs with less free space.</para>
 635           </listitem>
 636         </itemizedlist>
 637         <itemizedlist>
 638           <listitem>
 639             <para><emphasis role="bold">RR</emphasis></para>
 640             <para>Round-Robin (RR) allocates objects evenly across all OSTs. The RR stripe allocator
 641               is faster than QOS, and used often because it distributes space usage/load best in
 642               most situations, maximizing network balancing and improving performance.</para>
 643           </listitem>
 644         </itemizedlist>
 645         <para>Whether QOS or RR is used depends on the setting of the
 646             <literal>qos_threshold_rr</literal> proc tunable. The
 647             <literal>qos_threshold_rr</literal> variable specifies a percentage threshold where the
 648           use of QOS or RR becomes more/less likely. The <literal>qos_threshold_rr</literal> tunable
 649           can be set as an integer, from 0 to 100, and results in this stripe allocation
 650           behavior:</para>
 651         <itemizedlist>
 652           <listitem>
 653             <para> If <literal>qos_threshold_rr</literal> is set to 0, then QOS is always
 654               used</para>
 655           </listitem>
 656           <listitem>
 657             <para> If <literal>qos_threshold_rr</literal> is set to 100, then RR is always
 658               used</para>
 659           </listitem>
 660           <listitem>
 661             <para> The larger the <literal>qos_threshold_rr</literal> setting, the greater the
 662               possibility that RR is used instead of QOS</para>
 663           </listitem>
 664         </itemizedlist>
 665       </section>
 666     </section>
 667   </section>
 668   <section xml:id="dbdoclet.50438271_78950">
 669     <title><indexterm>
 670         <primary>proc</primary>
 671         <secondary>I/O tunables</secondary>
 672       </indexterm>Lustre I/O Tunables</title>
 673     <para>This section describes I/O tunables.</para>
 674     <para><literal> llite.<replaceable>fsname-instance</replaceable>/max_cache_mb</literal></para>
 675     <screen>client# lctl get_param llite.lustre-ce63ca00.max_cached_mb
 676 128</screen>
 677     <para>This tunable is the maximum amount of inactive data cached by the client (default is 3/4
 678       of RAM).</para>
 679     <section remap="h3">
 680       <title><indexterm>
 681           <primary>proc</primary>
 682           <secondary>RPC tunables</secondary>
 683         </indexterm>Client I/O RPC Stream Tunables</title>
 684       <para>The Lustre engine always attempts to pack an optimal amount of data into each I/O RPC
 685         and attempts to keep a consistent number of issued RPCs in progress at a time. Lustre
 686         exposes several tuning variables to adjust behavior according to network conditions and
 687         cluster size. Each OSC has its own tree of these tunables. For example:</para>
 688       <screen>$ ls -d /proc/fs/lustre/osc/OSC_client_ost1_MNT_client_2 /localhost
 689 /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
 690 /proc/fs/lustre/osc/OSC_uml0_ost2_MNT_localhost
 691 /proc/fs/lustre/osc/OSC_uml0_ost3_MNT_localhost
 692 $ ls /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
 693 blocksizefilesfree max_dirty_mb ost_server_uuid stats</screen>
 694       <para>... and so on.</para>
 695       <para>RPC stream tunables are described below.</para>
 696       <para>
 697         <itemizedlist>
 698           <listitem xml:id="lustreproc.maxdirtymb">
 699             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_dirty_mb</literal> - This
 700               tunable controls how many MBs of dirty data can be written and queued up in the
 701                 <literal>OSC. POSIX</literal> file writes that are cached contribute to this count.
 702               When the limit is reached, additional writes stall until previously-cached writes are
 703               written to the server. This may be changed by writing a single ASCII integer to the
 704               file. Only values between 0 and 2048 or 1/4 of RAM are allowable. If 0 is given, no
 705               writes are cached. Performance suffers noticeably unless you use large writes (1 MB or
 706               more).</para>
 707           </listitem>
 708           <listitem>
 709             <para><literal>osc.<replaceable>osc_instance</replaceable>.cur_dirty_bytes</literal> -
 710               This tunable is a read-only value that returns the current amount of bytes written and
 711               cached on this OSC.</para>
 712           </listitem>
 713           <listitem>
 714             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_pages_per_rpc</literal> -
 715               This tunable is the maximum number of pages that will undergo I/O in a single RPC to
 716               the OST. The minimum is a single page and the maximum for this setting is 1024 (for
 717               systems with 4kB <literal>PAGE_SIZE</literal>), with the default maximum of 1MB in the
 718               RPC. It is also possible to specify a units suffix (e.g. <literal>4M</literal>), so
 719               that the RPC size can be specified independently of the client
 720                 <literal>PAGE_SIZE</literal>.</para>
 721           </listitem>
 722           <listitem>
 723             <para><literal>osc.<replaceable>osc_instance</replaceable>.max_rpcs_in_flight</literal>
 724               - This tunable is the maximum number of concurrent RPCs in flight from an OSC to its
 725               OST. If the OSC tries to initiate an RPC but finds that it already has the same number
 726               of RPCs outstanding, it will wait to issue further RPCs until some complete. The
 727               minimum setting is 1 and maximum setting is 256. If you are looking to improve small
 728               file I/O performance, increase the <literal>max_rpcs_in_flight</literal> value.</para>
 729           </listitem>
 730         </itemizedlist>
 731       </para>
 732       <para>To maximize performance, the value for <literal>max_dirty_mb</literal> is recommended to
 733         be 4 * <literal>max_pages_per_rpc</literal> * <literal>max_rpcs_in_flight</literal>.</para>
 734       <note>
 735         <para>The <literal><replaceable>osc_instance</replaceable></literal> is typically
 736               <literal><replaceable>fsname</replaceable>-OST<replaceable>ost_index</replaceable>-osc-<replaceable>mountpoint_instance</replaceable></literal>.
 737           The <literal><replaceable>mountpoint_instance</replaceable></literal> is a unique value
 738           per mount point to allow associating osc, mdc, lov, lmv, and llite parameters for the same
 739           mount point. For <literal><replaceable>osc_instance</replaceable></literal> examples,
 740           refer to the sample command output.</para>
 741       </note>
 742     </section>
 743     <section remap="h3">
 744       <title><indexterm>
 745           <primary>proc</primary>
 746           <secondary>watching RPC</secondary>
 747         </indexterm>Watching the Client RPC Stream</title>
 748       <para>The same directory contains an <literal>rpc_stats</literal> file with a histogram
 749         showing the composition of previous RPCs. The histogram can be cleared by writing any value
 750         into the <literal>rpc_stats</literal> file.</para>
 751       <screen># cat /proc/fs/lustre/osc/testfs-OST0000-osc-c45f9c00/rpc_stats
 752 snapshot_time:                       1174867307.156604 (secs.usecs)
 753 read RPCs in flight:                 0
 754 write RPCs in flight:                0
 755 pending write pages:                 0
 756 pending read pages:                  0
 757                 read                                write
 758 pages per rpc   rpcs  %   cum   %    |   rpcs   %   cum     %
 759 1:              0     0   0          |   0          0       0
 760
 761                 read                                write
 762 rpcs in flight  rpcs  %   cum   %    |   rpcs   %   cum     %
 763 0:              0     0   0          |   0          0       0
 764
 765                 read                                write
 766 offset          rpcs  %   cum   %    |   rpcs   %   cum     %
 767 0:              0     0   0          |   0          0       0
 768
 769
 770 # cat /proc/fs/lustre/osc/testfs-OST0000-osc-ffff810058d2f800/rpc_stats
 771 snapshot_time:            1372786692.389858 (secs.usecs)
 772 read RPCs in flight:      0
 773 write RPCs in flight:     1
 774 dio read RPCs in flight:  0
 775 dio write RPCs in flight: 0
 776 pending write pages:      256
 777 pending read pages:       0
 778
 779                      read                   write
 780 pages per rpc   rpcs   % cum % |       rpcs   % cum %
 781 1:                 0   0   0   |          0   0   0
 782 2:                 0   0   0   |          1   0   0
 783 4:                 0   0   0   |          0   0   0
 784 8:                 0   0   0   |          0   0   0
 785 16:                0   0   0   |          0   0   0
 786 32:                0   0   0   |          2   0   0
 787 64:                0   0   0   |          2   0   0
 788 128:               0   0   0   |          5   0   0
 789 256:             850 100 100   |      18346  99 100
 790
 791                      read                   write
 792 rpcs in flight  rpcs   % cum % |       rpcs   % cum %
 793 0:               691  81  81   |       1740   9   9
 794 1:                48   5  86   |        938   5  14
 795 2:                29   3  90   |       1059   5  20
 796 3:                17   2  92   |       1052   5  26
 797 4:                13   1  93   |        920   5  31
 798 5:                12   1  95   |        425   2  33
 799 6:                10   1  96   |        389   2  35
 800 7:                30   3 100   |      11373  61  97
 801 8:                 0   0 100   |        460   2 100
 802
 803                      read                   write
 804 offset          rpcs   % cum % |       rpcs   % cum %
 805 0:               850 100 100   |      18347  99  99
 806 1:                 0   0 100   |          0   0  99
 807 2:                 0   0 100   |          0   0  99
 808 4:                 0   0 100   |          0   0  99
 809 8:                 0   0 100   |          0   0  99
 810 16:                0   0 100   |          1   0  99
 811 32:                0   0 100   |          1   0  99
 812 64:                0   0 100   |          3   0  99
 813 128:               0   0 100   |          4   0 100
 814
 815 </screen>
 816       <para>Where:</para>
 817       <informaltable frame="all">
 818         <tgroup cols="2">
 819           <colspec colname="c1" colwidth="40*"/>
 820           <colspec colname="c2" colwidth="60*"/>
 821           <thead>
 822             <row>
 823               <entry>
 824                 <para><emphasis role="bold">Field</emphasis></para>
 825               </entry>
 826               <entry>
 827                 <para><emphasis role="bold">Description</emphasis></para>
 828               </entry>
 829             </row>
 830           </thead>
 831           <tbody>
 832             <row>
 833               <entry>
 834                 <para> {read,write} RPCs in flight</para>
 835               </entry>
 836               <entry>
 837                 <para>Number of read/write RPCs issued by the OSC, but not complete at the time of
 838                   the snapshot. This value should always be less than or equal to
 839                     <literal>max_rpcs_in_flight</literal>.</para>
 840               </entry>
 841             </row>
 842             <row>
 843               <entry>
 844                 <para> pending {read,write} pages</para>
 845               </entry>
 846               <entry>
 847                 <para>Number of pending read/write pages that have been queued for I/O in the
 848                   OSC.</para>
 849               </entry>
 850             </row>
 851             <row>
 852               <entry>dio {read,write} RPCs in flight</entry>
 853               <entry>Direct I/O (as opposed to block I/O) read/write RPCs issued but not completed
 854                 at the time of the snapshot.</entry>
 855             </row>
 856             <row>
 857               <entry>
 858                 <para> pages per RPC</para>
 859               </entry>
 860               <entry>
 861                 <para>When an RPC is sent, the number of pages it consists of is recorded (in
 862                   order). A single page RPC increments the <literal>0:</literal> row.</para>
 863               </entry>
 864             </row>
 865             <row>
 866               <entry>
 867                 <para> RPCs in flight</para>
 868               </entry>
 869               <entry>
 870                 <para>When an RPC is sent, the number of other RPCs that are pending is recorded.
 871                   When the first RPC is sent, the <literal>0:</literal> row is incremented. If the
 872                   first RPC is sent while another is pending, the <literal>1:</literal> row is
 873                   incremented and so on. As each RPC *completes*, the number of pending RPCs is not
 874                   tabulated.</para>
 875                 <para>This table is a good way to visualize the concurrency of the RPC stream.
 876                   Ideally, you will see a large clump around the
 877                     <literal>max_rpcs_in_flight</literal> value, which shows that the network is
 878                   being kept busy.</para>
 879               </entry>
 880             </row>
 881             <row>
 882               <entry>
 883                 <para> offset</para>
 884               </entry>
 885               <entry>
 886                 <para> </para>
 887               </entry>
 888             </row>
 889           </tbody>
 890         </tgroup>
 891       </informaltable>
 892       <para>Each row in the table shows the number of reads or writes occurring for the statistic
 893         (ios), the relative percentage of total reads or writes (%), and the cumulative percentage
 894         to that point in the table for the statistic (cum %).</para>
 895     </section>
 896     <section remap="h3">
 897       <title><indexterm>
 898           <primary>proc</primary>
 899           <secondary>read/write survey</secondary>
 900         </indexterm>Client Read-Write Offset Survey</title>
 901       <para>The <literal>offset_stats</literal> parameter maintains statistics for occurrences where
 902         a series of read or write calls from a process did not access the next sequential location.
 903         The offset field is reset to 0 (zero) whenever a different file is read/written.</para>
 904       <para>Read/write offset statistics are off by default. The statistics can be activated by
 905         writing anything into the <literal>offset_stats</literal> file.</para>
 906       <para>Example:</para>
 907       <screen># cat /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats
 908 snapshot_time: 1155748884.591028 (secs.usecs)
 909              RANGE   RANGE    SMALLEST   LARGEST
 910 R/W   PID    START   END      EXTENT     EXTENT    OFFSET
 911 R     8385   0       128      128        128       0
 912 R     8385   0       224      224        224       -128
 913 W     8385   0       250      50         100       0
 914 W     8385   100     1110     10         500       -150
 915 W     8384   0       5233     5233       5233      0
 916 R     8385   500     600      100        100       -610</screen>
 917       <para>Where:</para>
 918       <informaltable frame="all">
 919         <tgroup cols="2">
 920           <colspec colname="c1" colwidth="50*"/>
 921           <colspec colname="c2" colwidth="50*"/>
 922           <thead>
 923             <row>
 924               <entry>
 925                 <para><emphasis role="bold">Field</emphasis></para>
 926               </entry>
 927               <entry>
 928                 <para><emphasis role="bold">Description</emphasis></para>
 929               </entry>
 930             </row>
 931           </thead>
 932           <tbody>
 933             <row>
 934               <entry>
 935                 <para> R/W</para>
 936               </entry>
 937               <entry>
 938                 <para>Whether the non-sequential call was a read or write</para>
 939               </entry>
 940             </row>
 941             <row>
 942               <entry>
 943                 <para> PID </para>
 944               </entry>
 945               <entry>
 946                 <para>Process ID which made the read/write call.</para>
 947               </entry>
 948             </row>
 949             <row>
 950               <entry>
 951                 <para> Range Start/Range End</para>
 952               </entry>
 953               <entry>
 954                 <para>Range in which the read/write calls were sequential.</para>
 955               </entry>
 956             </row>
 957             <row>
 958               <entry>
 959                 <para> Smallest Extent </para>
 960               </entry>
 961               <entry>
 962                 <para>Smallest extent (single read/write) in the corresponding range.</para>
 963               </entry>
 964             </row>
 965             <row>
 966               <entry>
 967                 <para> Largest Extent </para>
 968               </entry>
 969               <entry>
 970                 <para>Largest extent (single read/write) in the corresponding range.</para>
 971               </entry>
 972             </row>
 973             <row>
 974               <entry>
 975                 <para> Offset </para>
 976               </entry>
 977               <entry>
 978                 <para>Difference between the previous range end and the current range start.</para>
 979                 <para>For example, Smallest-Extent indicates that the writes in the range 100 to
 980                   1110 were sequential, with a minimum write of 10 and a maximum write of 500. This
 981                   range was started with an offset of -150. That means this is the difference
 982                   between the last entry&apos;s range-end and this entry&apos;s range-start for the
 983                   same file.</para>
 984                 <para>The <literal>rw_offset_stats</literal> file can be cleared by writing to
 985                   it:</para>
 986                 <para><literal>lctl set_param llite.*.rw_offset_stats=0</literal></para>
 987               </entry>
 988             </row>
 989           </tbody>
 990         </tgroup>
 991       </informaltable>
 992     </section>
 993     <section xml:id="lustreproc.clientstats" remap="h3">
 994       <title><indexterm>
 995           <primary>proc</primary>
 996           <secondary>client stats</secondary>
 997         </indexterm>Client Statistics </title>
 998       <para>The <literal>stats</literal> parameter maintains statistics of activity across the VFS
 999         interface of the Lustre file system. Only non-zero parameters are displayed in the file.
1000         This section describes the statistics that accumulate during typical operation of a
1001         client.</para>
1002       <para>Client statistics are enabled by default. The statistics can be cleared by echoing an
1003         empty string into the <literal>stats</literal> file or by using the command: <literal>lctl
1004           set_param llite.*.stats=0</literal>. Statistics for an individual file system can be
1005         displayed, for example, as shown below:</para>
1006       <screen>client# lctl get_param llite.*.stats
1007 snapshot_time          1308343279.169704 secs.usecs
1008 dirty_pages_hits       14819716 samples [regs]
1009 dirty_pages_misses     81473472 samples [regs]
1010 read_bytes             36502963 samples [bytes] 1 26843582 55488794
1011 write_bytes            22985001 samples [bytes] 0 125912 3379002
1012 brw_read               2279 samples [pages] 1 1 2270
1013 ioctl                  186749 samples [regs]
1014 open                   3304805 samples [regs]
1015 close                  3331323 samples [regs]
1016 seek                   48222475 samples [regs]
1017 fsync                  963 samples [regs]
1018 truncate               9073 samples [regs]
1019 setxattr               19059 samples [regs]
1020 getxattr               61169 samples [regs]
1021 </screen>
1022       <note>
1023         <para>Statistics for all mounted file systems can be discovered by issuing the
1024             <literal>lctl</literal> command <literal>lctl get_param llite.*.stats</literal></para>
1025       </note>
1026       <informaltable frame="all">
1027         <tgroup cols="2">
1028           <colspec colname="c1" colwidth="3*"/>
1029           <colspec colname="c2" colwidth="7*"/>
1030           <thead>
1031             <row>
1032               <entry>
1033                 <para><emphasis role="bold">Field</emphasis></para>
1034               </entry>
1035               <entry>
1036                 <para><emphasis role="bold">Description</emphasis></para>
1037               </entry>
1038             </row>
1039           </thead>
1040           <tbody>
1041             <row>
1042               <entry>
1043                 <para>
1044                   <literal>snapshot_time</literal></para>
1045               </entry>
1046               <entry>
1047                 <para>UNIX* epoch instant the stats file was read.</para>
1048               </entry>
1049             </row>
1050             <row>
1051               <entry>
1052                 <para>
1053                   <literal>dirty_page_hits</literal></para>
1054               </entry>
1055               <entry>
1056                 <para>A count of the number of write operations that have been satisfied by the
1057                   dirty page cache. See <xref xmlns:xlink="http://www.w3.org/1999/xlink"
1058                     linkend="lustreproc.maxdirtymb"/> for dirty cache behavior in a Lustre file
1059                   system.</para>
1060               </entry>
1061             </row>
1062             <row>
1063               <entry>
1064                 <para>
1065                   <literal>dirty_page_misses</literal></para>
1066               </entry>
1067               <entry>
1068                 <para>A count of the number of write operations that were not satisfied by the dirty
1069                   page cache.</para>
1070               </entry>
1071             </row>
1072             <row>
1073               <entry>
1074                 <para>
1075                   <literal>read_bytes</literal></para>
1076               </entry>
1077               <entry>
1078                 <para>A count of the number of read operations that have occurred (samples). Three
1079                   additional parameters are given:</para>
1080                 <variablelist>
1081                   <varlistentry>
1082                     <term>min</term>
1083                     <listitem>
1084                       <para>The minimum number of bytes read in a single request since the counter
1085                         was reset.</para>
1086                     </listitem>
1087                   </varlistentry>
1088                   <varlistentry>
1089                     <term>max</term>
1090                     <listitem>
1091                       <para>The maximum number of bytes read in a single request since the counter
1092                         was reset.</para>
1093                     </listitem>
1094                   </varlistentry>
1095                   <varlistentry>
1096                     <term>sum</term>
1097                     <listitem>
1098                       <para>The accumulated sum of bytes of all read requests since the counter was
1099                         reset.</para>
1100                     </listitem>
1101                   </varlistentry>
1102                 </variablelist>
1103               </entry>
1104             </row>
1105             <row>
1106               <entry>
1107                 <para>
1108                   <literal>write_bytes</literal></para>
1109               </entry>
1110               <entry>
1111                 <para>A count of the number of write operations that have occurred (samples). Three
1112                   additional parameters are given:</para>
1113                 <variablelist>
1114                   <varlistentry>
1115                     <term>min</term>
1116                     <listitem>
1117                       <para>The minimum number of bytes written in a single request since the
1118                         counter was reset.</para>
1119                     </listitem>
1120                   </varlistentry>
1121                   <varlistentry>
1122                     <term>max</term>
1123                     <listitem>
1124                       <para>The maximum number of bytes written in a single request since the
1125                         counter was reset.</para>
1126                     </listitem>
1127                   </varlistentry>
1128                   <varlistentry>
1129                     <term>sum</term>
1130                     <listitem>
1131                       <para>The accumulated sum of bytes of all write requests since the counter was
1132                         reset.</para>
1133                     </listitem>
1134                   </varlistentry>
1135                 </variablelist>
1136               </entry>
1137             </row>
1138             <row>
1139               <entry>
1140                 <para>
1141                   <literal>brw_read</literal></para>
1142               </entry>
1143               <entry>
1144                 <para>A count of the number of pages that have been read.</para>
1145                 <warning>
1146                   <para><literal>brw_</literal> stats are only tallied when the lloop device driver
1147                     is present. lloop device is not currently supported.</para>
1148                 </warning>
1149                 <para>Three additional parameters are given:</para>
1150                 <variablelist>
1151                   <varlistentry>
1152                     <term>min</term>
1153                     <listitem>
1154                       <para>The minimum number of bytes read in a single brw read requests since the
1155                         counter was reset.</para>
1156                     </listitem>
1157                   </varlistentry>
1158                   <varlistentry>
1159                     <term>max</term>
1160                     <listitem>
1161                       <para>The maximum number of bytes read in a single brw read requests since the
1162                         counter was reset.</para>
1163                     </listitem>
1164                   </varlistentry>
1165                   <varlistentry>
1166                     <term>sum</term>
1167                     <listitem>
1168                       <para>The accumulated sum of bytes of all brw read requests since the counter
1169                         was reset.</para>
1170                     </listitem>
1171                   </varlistentry>
1172                 </variablelist>
1173               </entry>
1174             </row>
1175             <row>
1176               <entry>
1177                 <para>
1178                   <literal>ioctl</literal></para>
1179               </entry>
1180               <entry>
1181                 <para>A count of the number of the combined file and directory ioctl
1182                   operations.</para>
1183               </entry>
1184             </row>
1185             <row>
1186               <entry>
1187                 <para>
1188                   <literal>open</literal></para>
1189               </entry>
1190               <entry>
1191                 <para>A count of the number of open operations that have succeeded.</para>
1192               </entry>
1193             </row>
1194             <row>
1195               <entry>
1196                 <para>
1197                   <literal>close</literal></para>
1198               </entry>
1199               <entry>
1200                 <para>A count of the number of close operations that have succeeded.</para>
1201               </entry>
1202             </row>
1203             <row>
1204               <entry>
1205                 <para>
1206                   <literal>seek</literal></para>
1207               </entry>
1208               <entry>
1209                 <para>A count of the number of times <literal>seek</literal> has been called.</para>
1210               </entry>
1211             </row>
1212             <row>
1213               <entry>
1214                 <para>
1215                   <literal>fsync</literal></para>
1216               </entry>
1217               <entry>
1218                 <para>A count of the number of times <literal>fsync</literal> has been
1219                   called.</para>
1220               </entry>
1221             </row>
1222             <row>
1223               <entry>
1224                 <para>
1225                   <literal>truncate</literal></para>
1226               </entry>
1227               <entry>
1228                 <para>A count of the total number of calls to both locked and lockless
1229                   truncate.</para>
1230               </entry>
1231             </row>
1232             <row>
1233               <entry>
1234                 <para>
1235                   <literal>setxattr</literal></para>
1236               </entry>
1237               <entry>
1238                 <para>A count of the number of times <literal>ll_setxattr</literal> has been
1239                   called.</para>
1240               </entry>
1241             </row>
1242             <row>
1243               <entry>
1244                 <para>
1245                   <literal>getxattr</literal></para>
1246               </entry>
1247               <entry>
1248                 <para>A count of the number of times <literal>ll_getxattr</literal> has been
1249                   called.</para>
1250               </entry>
1251             </row>
1252           </tbody>
1253         </tgroup>
1254       </informaltable>
1255     </section>
1256     <section remap="h3">
1257       <title><indexterm>
1258           <primary>proc</primary>
1259           <secondary>read/write survey</secondary>
1260         </indexterm>Client Read-Write Extents Survey</title>
1261       <para><emphasis role="bold">Client-Based I/O Extent Size Survey</emphasis></para>
1262       <para>The <literal>rw_extent_stats</literal> histogram in the <literal>llite</literal>
1263         directory shows you the statistics for the sizes of the read-write I/O extents. This file
1264         does not maintain the per-process statistics.</para>
1265       <para>Example:</para>
1266       <screen>client# lctl get_param llite.testfs-*.extents_stats
1267 snapshot_time:                     1213828728.348516 (secs.usecs)
1268                        read           |            write
1269 extents          calls  %      cum%   |     calls  %     cum%
1270
1271 0K - 4K :        0      0      0      |     2      2     2
1272 4K - 8K :        0      0      0      |     0      0     2
1273 8K - 16K :       0      0      0      |     0      0     2
1274 16K - 32K :      0      0      0      |     20     23    26
1275 32K - 64K :      0      0      0      |     0      0     26
1276 64K - 128K :     0      0      0      |     51     60    86
1277 128K - 256K :    0      0      0      |     0      0     86
1278 256K - 512K :    0      0      0      |     0      0     86
1279 512K - 1024K :   0      0      0      |     0      0     86
1280 1M - 2M :        0      0      0      |     11     13    100</screen>
1281       <para>The file can be cleared by issuing the following command:</para>
1282       <screen>client# lctl set_param llite.testfs-*.extents_stats=0</screen>
1283       <para><emphasis role="bold">Per-Process Client I/O Statistics</emphasis></para>
1284       <para>The <literal>extents_stats_per_process</literal> file maintains the I/O extent size
1285         statistics on a per-process basis. So you can track the per-process statistics for the last
1286           <literal>MAX_PER_PROCESS_HIST</literal> processes.</para>
1287       <para>Example:</para>
1288       <screen>lctl get_param llite.testfs-*.extents_stats_per_process
1289 snapshot_time:                     1213828762.204440 (secs.usecs)
1290                           read            |             write
1291 extents            calls   %      cum%    |      calls   %       cum%
1292
1293 PID: 11488
1294    0K - 4K :       0       0       0      |      0       0       0
1295    4K - 8K :       0       0       0      |      0       0       0
1296    8K - 16K :      0       0       0      |      0       0       0
1297    16K - 32K :     0       0       0      |      0       0       0
1298    32K - 64K :     0       0       0      |      0       0       0
1299    64K - 128K :    0       0       0      |      0       0       0
1300    128K - 256K :   0       0       0      |      0       0       0
1301    256K - 512K :   0       0       0      |      0       0       0
1302    512K - 1024K :  0       0       0      |      0       0       0
1303    1M - 2M :       0       0       0      |      10      100     100
1304
1305 PID: 11491
1306    0K - 4K :       0       0       0      |      0       0       0
1307    4K - 8K :       0       0       0      |      0       0       0
1308    8K - 16K :      0       0       0      |      0       0       0
1309    16K - 32K :     0       0       0      |      20      100     100
1310
1311 PID: 11424
1312    0K - 4K :       0       0       0      |      0       0       0
1313    4K - 8K :       0       0       0      |      0       0       0
1314    8K - 16K :      0       0       0      |      0       0       0
1315    16K - 32K :     0       0       0      |      0       0       0
1316    32K - 64K :     0       0       0      |      0       0       0
1317    64K - 128K :    0       0       0      |      16      100     100
1318
1319 PID: 11426
1320    0K - 4K :       0       0       0      |      1       100     100
1321
1322 PID: 11429
1323    0K - 4K :       0       0       0      |      1       100     100
1324
1325 </screen>
1326       <para>Each row in the table shows the number of reads or writes occurring for the statistic
1327         (ios), the relative percentage of total reads or writes (%), and the cumulative percentage
1328         to that point in the table for the statistic (cum %).</para>
1329     </section>
1330     <section xml:id="dbdoclet.50438271_55057">
1331       <title><indexterm>
1332           <primary>proc</primary>
1333           <secondary>block I/O</secondary>
1334         </indexterm>Watching the OST Block I/O Stream</title>
1335       <para>Similarly, a <literal>brw_stats</literal> histogram in the obdfilter directory shows the
1336         statistics for number of I/O requests sent to the disk, their size, and whether they are
1337         contiguous on the disk or not.</para>
1338       <screen>oss# lctl get_param obdfilter.testfs-OST0000.brw_stats
1339 snapshot_time:                     1174875636.764630 (secs:usecs)
1340                    read                         write
1341 pages per brw      brws    %      cum %   |     rpcs    %      cum %
1342 1:                 0       0      0       |     0       0      0
1343                    read                         write
1344 discont pages      rpcs    %      cum %   |     rpcs    %      cum %
1345 1:                 0       0      0       |     0       0      0
1346                    read                         write
1347 discont blocks     rpcs    %      cum %   |     rpcs    %      cum %
1348 1:                 0       0      0       |     0       0      0
1349                    read                         write
1350 dio frags          rpcs    %      cum %   |     rpcs    %      cum %
1351 1:                 0       0      0       |     0       0      0
1352                    read                         write
1353 disk ios in flight rpcs    %      cum %   |     rpcs    %      cum %
1354 1:                 0       0      0       |     0       0      0
1355                    read                         write
1356 io time (1/1000s)  rpcs    %      cum %   |     rpcs    %      cum %
1357 1:                 0       0      0       |     0       0      0
1358                    read                         write
1359 disk io size       rpcs    %      cum %   |     rpcs    %      cum %
1360 1:                 0       0      0       |     0       0      0
1361                    read                         write
1362
1363 # cat ./obdfilter/testfs-OST0000/brw_stats
1364 snapshot_time:         1372775039.769045 (secs.usecs)
1365
1366                            read      |      write
1367 pages per bulk r/w     rpcs  % cum % |  rpcs   % cum %
1368 1:                     108 100 100   |    39   0   0
1369 2:                       0   0 100   |     6   0   0
1370 4:                       0   0 100   |     1   0   0
1371 8:                       0   0 100   |     0   0   0
1372 16:                      0   0 100   |     4   0   0
1373 32:                      0   0 100   |    17   0   0
1374 64:                      0   0 100   |    12   0   0
1375 128:                     0   0 100   |    24   0   0
1376 256:                     0   0 100   | 23142  99 100
1377
1378                            read      |      write
1379 discontiguous pages    rpcs  % cum % |  rpcs   % cum %
1380 0:                     108 100 100   | 23245 100 100
1381
1382                            read      |      write
1383 discontiguous blocks   rpcs  % cum % |  rpcs   % cum %
1384 0:                     108 100 100   | 23243  99  99
1385 1:                       0   0 100   |     2   0 100
1386
1387                            read      |      write
1388 disk fragmented I/Os   ios   % cum % |   ios   % cum %
1389 0:                      94  87  87   |     0   0   0
1390 1:                      14  12 100   | 23243  99  99
1391 2:                       0   0 100   |     2   0 100
1392
1393                            read      |      write
1394 disk I/Os in flight    ios   % cum % |   ios   % cum %
1395 1:                      14 100 100   | 20896  89  89
1396 2:                       0   0 100   |  1071   4  94
1397 3:                       0   0 100   |   573   2  96
1398 4:                       0   0 100   |   300   1  98
1399 5:                       0   0 100   |   166   0  98
1400 6:                       0   0 100   |   108   0  99
1401 7:                       0   0 100   |    81   0  99
1402 8:                       0   0 100   |    47   0  99
1403 9:                       0   0 100   |     5   0 100
1404
1405                            read      |      write
1406 I/O time (1/1000s)     ios   % cum % |   ios   % cum %
1407 1:                      94  87  87   |     0   0   0
1408 2:                       0   0  87   |     7   0   0
1409 4:                      14  12 100   |    27   0   0
1410 8:                       0   0 100   |    14   0   0
1411 16:                      0   0 100   |    31   0   0
1412 32:                      0   0 100   |    38   0   0
1413 64:                      0   0 100   | 18979  81  82
1414 128:                     0   0 100   |   943   4  86
1415 256:                     0   0 100   |  1233   5  91
1416 512:                     0   0 100   |  1825   7  99
1417 1K:                      0   0 100   |   99   0  99
1418 2K:                      0   0 100   |     0   0  99
1419 4K:                      0   0 100   |     0   0  99
1420 8K:                      0   0 100   |    49   0 100
1421
1422                            read      |      write
1423 disk I/O size          ios   % cum % |   ios   % cum %
1424 4K:                     14 100 100   |    41   0   0
1425 8K:                      0   0 100   |     6   0   0
1426 16K:                     0   0 100   |     1   0   0
1427 32K:                     0   0 100   |     0   0   0
1428 64K:                     0   0 100   |     4   0   0
1429 128K:                    0   0 100   |    17   0   0
1430 256K:                    0   0 100   |    12   0   0
1431 512K:                    0   0 100   |    24   0   0
1432 1M:                      0   0 100   | 23142  99 100
1433 </screen>
1434       <para>The fields are explained below:</para>
1435       <informaltable frame="all">
1436         <tgroup cols="2">
1437           <colspec colname="c1" colwidth="50*"/>
1438           <colspec colname="c2" colwidth="50*"/>
1439           <thead>
1440             <row>
1441               <entry>
1442                 <para><emphasis role="bold">Field</emphasis></para>
1443               </entry>
1444               <entry>
1445                 <para><emphasis role="bold">Description</emphasis></para>
1446               </entry>
1447             </row>
1448           </thead>
1449           <tbody>
1450             <row>
1451               <entry>
1452                 <para>
1453                   <literal>pages per bulk r/w</literal></para>
1454               </entry>
1455               <entry>
1456                 <para>Number of pages per RPC request, which should match aggregate client
1457                     <literal>rpc_stats</literal>.</para>
1458               </entry>
1459             </row>
1460             <row>
1461               <entry>
1462                 <para>
1463                   <literal>discontiguous pages</literal></para>
1464               </entry>
1465               <entry>
1466                 <para>Number of discontinuities in the logical file offset of each page in a single
1467                   RPC.</para>
1468               </entry>
1469             </row>
1470             <row>
1471               <entry>
1472                 <para>
1473                   <literal>discontiguous blocks</literal></para>
1474               </entry>
1475               <entry>
1476                 <para>Number of discontinuities in the physical block allocation in the file system
1477                   for a single RPC.</para>
1478               </entry>
1479             </row>
1480             <row>
1481               <entry>
1482                 <para><literal>disk fragmented I/Os</literal></para>
1483               </entry>
1484               <entry>
1485                 <para>Number of I/Os that were not written entirely sequentially.</para>
1486               </entry>
1487             </row>
1488             <row>
1489               <entry>
1490                 <para><literal>disk I/Os in flight</literal></para>
1491               </entry>
1492               <entry>
1493                 <para>Number of disk I/Os currently pending.</para>
1494               </entry>
1495             </row>
1496             <row>
1497               <entry>
1498                 <para><literal>I/O time (1/1000s)</literal></para>
1499               </entry>
1500               <entry>
1501                 <para>Amount of time for each I/O operation to complete.</para>
1502               </entry>
1503             </row>
1504             <row>
1505               <entry>
1506                 <para><literal>disk I/O size</literal></para>
1507               </entry>
1508               <entry>
1509                 <para>Size of each I/O operation.</para>
1510               </entry>
1511             </row>
1512           </tbody>
1513         </tgroup>
1514       </informaltable>
1515       <para>Each row in the table shows the number of reads or writes occurring for the statistic
1516         (ios), the relative percentage of total reads or writes (%), and the cumulative percentage
1517         to that point in the table for the statistic (cum %).</para>
1518       <para>For each Lustre service, the following information is provided:</para>
1519       <itemizedlist>
1520         <listitem>
1521           <para>Number of requests</para>
1522         </listitem>
1523         <listitem>
1524           <para>Request wait time (avg, min, max and std dev)</para>
1525         </listitem>
1526         <listitem>
1527           <para>Service idle time (% of elapsed time)</para>
1528         </listitem>
1529       </itemizedlist>
1530       <para>Additionally, data on each Lustre service is provided by service type:</para>
1531       <itemizedlist>
1532         <listitem>
1533           <para>Number of requests of this type</para>
1534         </listitem>
1535         <listitem>
1536           <para>Request service time (avg, min, max and std dev)</para>
1537         </listitem>
1538       </itemizedlist>
1539     </section>
1540     <section remap="h3">
1541       <title><indexterm>
1542           <primary>proc</primary>
1543           <secondary>readahead</secondary>
1544         </indexterm>Using File Readahead and Directory Statahead</title>
1545       <para>Lustre 1.6.5.1 introduced file readahead and directory statahead functionality that read
1546         data into memory in anticipation of a process actually requesting the data. File readahead
1547         functionality reads file content data into memory. Directory statahead functionality reads
1548         metadata into memory. When readahead and/or statahead work well, a data-consuming process
1549         finds that the information it needs is available when requested, and it is unnecessary to
1550         wait for network I/O.</para>
1551       <para>Since Lustre 2.2.0, the directory statahead feature has been improved to enhance
1552         directory traversal performance. The improvements have concentrated on two main
1553         issues:</para>
1554       <orderedlist>
1555         <listitem>
1556           <para>A race condition between statahead thread and other VFS operations while processing
1557             asynchronous getattr RPC replies.</para>
1558         </listitem>
1559         <listitem>
1560           <para>There is no file size/block attributes pre-fetching and the traversing thread has to
1561             send synchronous glimpse size RPCs to OST(s).</para>
1562         </listitem>
1563       </orderedlist>
1564       <para>The first issue is resolved by using statahead local dcache, and the second one is
1565         resolved by using asynchronous glimpse lock (AGL) RPCs for pre-fetching file size/block
1566         attributes from OST(s).</para>
1567       <section remap="h4">
1568         <title>Tuning File Readahead</title>
1569         <para>File readahead is triggered when two or more sequential reads by an application fail
1570           to be satisfied by the Linux buffer cache. The size of the initial readahead is 1 MB.
1571           Additional readaheads grow linearly, and increment until the readahead cache on the client
1572           is full at 40 MB.</para>
1573         <para><literal> llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_mb
1574           </literal></para>
1575         <para>This tunable controls the maximum amount of data readahead on a file. Files are read
1576           ahead in RPC-sized chunks (1 MB or the size of read() call, if larger) after the second
1577           sequential read on a file descriptor. Random reads are done at the size of the read() call
1578           only (no readahead). Reads to non-contiguous regions of the file reset the readahead
1579           algorithm, and readahead is not triggered again until there are sequential reads again. To
1580           disable readahead, set this tunable to 0. The default value is 40 MB.</para>
1581         <para><literal> llite.<replaceable>fsname-instance</replaceable>.max_read_ahead_whole_mb
1582           </literal></para>
1583         <para>This tunable controls the maximum size of a file that is read in its entirety,
1584           regardless of the size of the <literal>read()</literal>.</para>
1585       </section>
1586       <section remap="h4">
1587         <title>Tuning Directory Statahead and AGL</title>
1588         <para>Many system commands, like <literal>ls –l</literal>, <literal>du</literal>,
1589             <literal>find</literal>, etc., will traverse directory sequentially. To make these
1590           commands run efficiently, the directory statahead and AGL (asynchronous glimpse lock) can
1591           be enabled to improve the performance of traversing.</para>
1592         <para><literal> /proc/fs/lustre/llite/*/statahead_max </literal></para>
1593         <para>This proc interface controls whether directory statahead is enabled and the maximum
1594           statahead windows size (which means how many files can be pre-fetched by the statahead
1595           thread). By default, statahead is enabled and the value of
1596             <literal>statahead_max</literal> is 32.</para>
1597         <para>To disable statahead, run:</para>
1598         <screen>lctl set_param llite.*.statahead_max=0</screen>
1599         <para>To set the maximum statahead windows size (n), run:</para>
1600         <screen>lctl set_param llite.*.statahead_max=n</screen>
1601         <para>The maximum value of n is 8192.</para>
1602         <para>The AGL can be controlled as follows:</para>
1603         <screen>lctl set_param llite.*.statahead_agl=n</screen>
1604         <para>If &quot;n&quot; is 0, then the AGL is disabled, else the AGL is enabled.</para>
1605         <para><literal> /proc/fs/lustre/llite/*/statahead_stats </literal></para>
1606         <para>This is a read-only interface that indicates the current statahead and AGL
1607           status.</para>
1608         <note>
1609           <para>The AGL is affected by statahead because the inodes processed by AGL are built by
1610             the statahead thread, which means the statahead thread is the input of AGL pipeline. So
1611             if statahead is disabled, then the AGL is disabled by force.</para>
1612         </note>
1613       </section>
1614     </section>
1615     <section remap="h3">
1616       <title><indexterm>
1617           <primary>proc</primary>
1618           <secondary>read cache</secondary>
1619         </indexterm>OSS Read Cache</title>
1620       <para>The OSS read cache feature provides read-only caching of data on an OSS. This
1621         functionality uses the regular Linux page cache to store the data. Just like caching from a
1622         regular filesystem in Linux, OSS read cache uses as much physical memory as is
1623         allocated.</para>
1624       <para>OSS read cache improves Lustre performance in these situations:</para>
1625       <itemizedlist>
1626         <listitem>
1627           <para>Many clients are accessing the same data set (as in HPC applications and when
1628             diskless clients boot from Lustre)</para>
1629         </listitem>
1630         <listitem>
1631           <para>One client is storing data while another client is reading it (essentially
1632             exchanging data via the OST)</para>
1633         </listitem>
1634         <listitem>
1635           <para>A client has very limited caching of its own</para>
1636         </listitem>
1637       </itemizedlist>
1638       <para>OSS read cache offers these benefits:</para>
1639       <itemizedlist>
1640         <listitem>
1641           <para>Allows OSTs to cache read data more frequently</para>
1642         </listitem>
1643         <listitem>
1644           <para>Improves repeated reads to match network speeds instead of disk speeds</para>
1645         </listitem>
1646         <listitem>
1647           <para>Provides the building blocks for OST write cache (small-write aggregation)</para>
1648         </listitem>
1649       </itemizedlist>
1650       <section remap="h4">
1651         <title>Using OSS Read Cache</title>
1652         <para>OSS read cache is implemented on the OSS, and does not require any special support on
1653           the client side. Since OSS read cache uses the memory available in the Linux page cache,
1654           you should use I/O patterns to determine the appropriate amount of memory for the cache;
1655           if the data is mostly reads, then more cache is required than for writes.</para>
1656         <para>OSS read cache is enabled, by default, and managed by the following tunables:</para>
1657         <itemizedlist>
1658           <listitem>
1659             <para><literal>read_cache_enable</literal> controls whether data read from disk during a
1660               read request is kept in memory and available for later read requests for the same
1661               data, without having to re-read it from disk. By default, read cache is enabled
1662                 (<literal>read_cache_enable = 1</literal>).</para>
1663           </listitem>
1664         </itemizedlist>
1665         <para>When the OSS receives a read request from a client, it reads data from disk into its
1666           memory and sends the data as a reply to the requests. If read cache is enabled, this data
1667           stays in memory after the client&apos;s request is finished, and the OSS skips reading
1668           data from disk when subsequent read requests for the same are received. The read cache is
1669           managed by the Linux kernel globally across all OSTs on that OSS, and the least recently
1670           used cache pages will be dropped from memory when the amount of free memory is running
1671           low.</para>
1672         <para>If read cache is disabled (<literal>read_cache_enable = 0</literal>), then the OSS
1673           will discard the data after the client&apos;s read requests are serviced and, for
1674           subsequent read requests, the OSS must read the data from disk.</para>
1675         <para>To disable read cache on all OSTs of an OSS, run:</para>
1676         <screen>root@oss1# lctl set_param obdfilter.*.read_cache_enable=0</screen>
1677         <para>To re-enable read cache on one OST, run:</para>
1678         <screen>root@oss1# lctl set_param obdfilter.{OST_name}.read_cache_enable=1</screen>
1679         <para>To check if read cache is enabled on all OSTs on an OSS, run:</para>
1680         <screen>root@oss1# lctl get_param obdfilter.*.read_cache_enable</screen>
1681         <itemizedlist>
1682           <listitem>
1683             <para><literal>writethrough_cache_enable</literal> controls whether data sent to the OSS
1684               as a write request is kept in the read cache and available for later reads, or if it
1685               is discarded from cache when the write is completed. By default, writethrough cache is
1686               enabled (<literal>writethrough_cache_enable = 1</literal>).</para>
1687           </listitem>
1688         </itemizedlist>
1689         <para>When the OSS receives write requests from a client, it receives data from the client
1690           into its memory and writes the data to disk. If writethrough cache is enabled, this data
1691           stays in memory after the write request is completed, allowing the OSS to skip reading
1692           this data from disk if a later read request, or partial-page write request, for the same
1693           data is received.</para>
1694         <para>If writethrough cache is disabled (<literal>writethrough_cache_enabled = 0</literal>),
1695           then the OSS discards the data after the client&apos;s write request is completed, and for
1696           subsequent read request, or partial-page write request, the OSS must re-read the data from
1697           disk.</para>
1698         <para>Enabling writethrough cache is advisable if clients are doing small or unaligned
1699           writes that would cause partial-page updates, or if the files written by one node are
1700           immediately being accessed by other nodes. Some examples where this might be useful
1701           include producer-consumer I/O models or shared-file writes with a different node doing I/O
1702           not aligned on 4096-byte boundaries. Disabling writethrough cache is advisable in the case
1703           where files are mostly written to the file system but are not re-read within a short time
1704           period, or files are only written and re-read by the same node, regardless of whether the
1705           I/O is aligned or not.</para>
1706         <para>To disable writethrough cache on all OSTs of an OSS, run:</para>
1707         <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=0</screen>
1708         <para>To re-enable writethrough cache on one OST, run:</para>
1709         <screen>root@oss1# lctl set_param obdfilter.{OST_name}.writethrough_cache_enable=1</screen>
1710         <para>To check if writethrough cache is</para>
1711         <screen>root@oss1# lctl set_param obdfilter.*.writethrough_cache_enable=1</screen>
1712         <itemizedlist>
1713           <listitem>
1714             <para><literal>readcache_max_filesize</literal> controls the maximum size of a file that
1715               both the read cache and writethrough cache will try to keep in memory. Files larger
1716               than <literal>readcache_max_filesize</literal> will not be kept in cache for either
1717               reads or writes.</para>
1718           </listitem>
1719         </itemizedlist>
1720         <para>This can be very useful for workloads where relatively small files are repeatedly
1721           accessed by many clients, such as job startup files, executables, log files, etc., but
1722           large files are read or written only once. By not putting the larger files into the cache,
1723           it is much more likely that more of the smaller files will remain in cache for a longer
1724           time.</para>
1725         <para>When setting <literal>readcache_max_filesize</literal>, the input value can be
1726           specified in bytes, or can have a suffix to indicate other binary units such as <emphasis
1727             role="bold">K</emphasis>ilobytes, <emphasis role="bold">M</emphasis>egabytes, <emphasis
1728             role="bold">G</emphasis>igabytes, <emphasis role="bold">T</emphasis>erabytes, or
1729             <emphasis role="bold">P</emphasis>etabytes.</para>
1730         <para>To limit the maximum cached file size to 32MB on all OSTs of an OSS, run:</para>
1731         <screen>root@oss1# lctl set_param obdfilter.*.readcache_max_filesize=32M</screen>
1732         <para>To disable the maximum cached file size on an OST, run:</para>
1733         <screen>root@oss1# lctl set_param obdfilter.{OST_name}.readcache_max_filesize=-1</screen>
1734         <para>To check the current maximum cached file size on all OSTs of an OSS, run:</para>
1735         <screen>root@oss1# lctl get_param obdfilter.*.readcache_max_filesize</screen>
1736       </section>
1737     </section>
1738     <section remap="h3">
1739       <title><indexterm>
1740           <primary>proc</primary>
1741           <secondary>OSS journal</secondary>
1742         </indexterm>OSS Asynchronous Journal Commit</title>
1743       <para>The OSS asynchronous journal commit feature synchronously writes data to disk without
1744         forcing a journal flush. This reduces the number of seeks and significantly improves
1745         performance on some hardware.</para>
1746       <note>
1747         <para>Asynchronous journal commit cannot work with O_DIRECT writes, a journal flush is still
1748           forced.</para>
1749       </note>
1750       <para>When asynchronous journal commit is enabled, client nodes keep data in the page cache (a
1751         page reference). Lustre clients monitor the last committed transaction number (transno) in
1752         messages sent from the OSS to the clients. When a client sees that the last committed
1753         transno reported by the OSS is at least the bulk write transno, it releases the reference on
1754         the corresponding pages. To avoid page references being held for too long on clients after a
1755         bulk write, a 7 second ping request is scheduled (jbd commit time is 5 seconds) after the
1756         bulk write reply is received, so the OSS has an opportunity to report the last committed
1757         transno.</para>
1758       <para>If the OSS crashes before the journal commit occurs, then the intermediate data is lost.
1759         However, new OSS recovery functionality (introduced in the asynchronous journal commit
1760         feature), causes clients to replay their write requests and compensate for the missing disk
1761         updates by restoring the state of the file system.</para>
1762       <para>To enable asynchronous journal commit, set the <literal>sync_journal parameter</literal>
1763         to zero (<literal>sync_journal=0</literal>):</para>
1764       <screen>$ lctl set_param obdfilter.*.sync_journal=0
1765 obdfilter.lol-OST0001.sync_journal=0</screen>
1766       <para>By default, <literal>sync_journal</literal> is disabled
1767           (<literal>sync_journal=1</literal>), which forces a journal flush after every bulk
1768         write.</para>
1769       <para>When asynchronous journal commit is used, clients keep a page reference until the
1770         journal transaction commits. This can cause problems when a client receives a blocking
1771         callback, because pages need to be removed from the page cache, but they cannot be removed
1772         because of the extra page reference.</para>
1773       <para>This problem is solved by forcing a journal flush on lock cancellation. When this
1774         happens, the client is granted the metadata blocks that have hit the disk, and it can safely
1775         release the page reference before processing the blocking callback. The parameter which
1776         controls this action is <literal>sync_on_lock_cancel</literal>, which can be set to the
1777         following values:</para>
1778       <itemizedlist>
1779         <listitem>
1780           <para><literal>always</literal>: Always force a journal flush on lock cancellation</para>
1781         </listitem>
1782         <listitem>
1783           <para><literal>blocking</literal>: Force a journal flush only when the local cancellation
1784             is due to a blocking callback</para>
1785         </listitem>
1786         <listitem>
1787           <para><literal>never</literal>: Do not force any journal flush</para>
1788         </listitem>
1789       </itemizedlist>
1790       <para>Here is an example of <literal>sync_on_lock_cancel</literal> being set not to force a
1791         journal flush:</para>
1792       <screen>$ lctl get_param obdfilter.*.sync_on_lock_cancel
1793 obdfilter.lol-OST0001.sync_on_lock_cancel=never</screen>
1794       <para>By default, <literal>sync_on_lock_cancel</literal> is set to never, because asynchronous
1795         journal commit is disabled by default.</para>
1796       <para>When asynchronous journal commit is enabled (<literal>sync_journal=0</literal>),
1797           <literal>sync_on_lock_cancel</literal> is automatically set to always, if it was
1798         previously set to never.</para>
1799       <para>Similarly, when asynchronous journal commit is disabled,
1800           (<literal>sync_journal=1</literal>), <literal>sync_on_lock_cancel</literal> is enforced to
1801         never.</para>
1802     </section>
1803     <section remap="h3">
1804       <title><indexterm>
1805           <primary>proc</primary>
1806           <secondary>mballoc history</secondary>
1807         </indexterm><literal>mballoc</literal> History</title>
1808       <para><literal> /proc/fs/ldiskfs/sda/mb_history </literal></para>
1809       <para>Multi-Block-Allocate (<literal>mballoc</literal>), enables Lustre to ask
1810           <literal>ldiskfs</literal> to allocate multiple blocks with a single request to the block
1811         allocator. Typically, an <literal>ldiskfs</literal> file system allocates only one block per
1812         time. Each <literal>mballoc</literal>-enabled partition has this file. This is sample
1813         output:</para>
1814       <screen>pid  inode  goal       result      found grps cr   merge tail broken
1815 2838 139267 17/12288/1 17/12288/1  1     0    0    M     1    8192
1816 2838 139267 17/12289/1 17/12289/1  1     0    0    M     0    0
1817 2838 139267 17/12290/1 17/12290/1  1     0    0    M     1    2
1818 2838 24577  3/12288/1  3/12288/1   1     0    0    M     1    8192
1819 2838 24578  3/12288/1  3/771/1     1     1    1          0    0
1820 2838 32769  4/12288/1  4/12288/1   1     0    0    M     1    8192
1821 2838 32770  4/12288/1  4/12289/1   13    1    1          0    0
1822 2838 32771  4/12288/1  5/771/1     26    2    1          0    0
1823 2838 32772  4/12288/1  5/896/1     31    2    1          1    128
1824 2838 32773  4/12288/1  5/897/1     31    2    1          0    0
1825 2828 32774  4/12288/1  5/898/1     31    2    1          1    2
1826 2838 32775  4/12288/1  5/899/1     31    2    1          0    0
1827 2838 32776  4/12288/1  5/900/1     31    2    1          1    4
1828 2838 32777  4/12288/1  5/901/1     31    2    1          0    0
1829 2838 32778  4/12288/1  5/902/1     31    2    1          1    2</screen>
1830       <para>The parameters are described below:</para>
1831       <informaltable frame="all">
1832         <tgroup cols="2">
1833           <colspec colname="c1" colwidth="50*"/>
1834           <colspec colname="c2" colwidth="50*"/>
1835           <thead>
1836             <row>
1837               <entry>
1838                 <para><emphasis role="bold">Parameter</emphasis></para>
1839               </entry>
1840               <entry>
1841                 <para><emphasis role="bold">Description</emphasis></para>
1842               </entry>
1843             </row>
1844           </thead>
1845           <tbody>
1846             <row>
1847               <entry>
1848                 <para>
1849                   <emphasis role="bold">
1850                     <literal>pid</literal>
1851                   </emphasis></para>
1852               </entry>
1853               <entry>
1854                 <para>Process that made the allocation.</para>
1855               </entry>
1856             </row>
1857             <row>
1858               <entry>
1859                 <para>
1860                   <emphasis role="bold">
1861                     <literal>inode</literal>
1862                   </emphasis></para>
1863               </entry>
1864               <entry>
1865                 <para>inode number allocated blocks</para>
1866               </entry>
1867             </row>
1868             <row>
1869               <entry>
1870                 <para>
1871                   <emphasis role="bold">
1872                     <literal>goal</literal>
1873                   </emphasis></para>
1874               </entry>
1875               <entry>
1876                 <para>Initial request that came to <literal>mballoc</literal>
1877                   (group/block-in-group/number-of-blocks)</para>
1878               </entry>
1879             </row>
1880             <row>
1881               <entry>
1882                 <para>
1883                   <emphasis role="bold">
1884                     <literal>result</literal>
1885                   </emphasis></para>
1886               </entry>
1887               <entry>
1888                 <para>What <literal>mballoc</literal> actually found for this request.</para>
1889               </entry>
1890             </row>
1891             <row>
1892               <entry>
1893                 <para>
1894                   <emphasis role="bold">
1895                     <literal>found</literal>
1896                   </emphasis></para>
1897               </entry>
1898               <entry>
1899                 <para>Number of free chunks <literal>mballoc</literal> found and measured before the
1900                   final decision.</para>
1901               </entry>
1902             </row>
1903             <row>
1904               <entry>
1905                 <para>
1906                   <emphasis role="bold">
1907                     <literal>grps</literal>
1908                   </emphasis></para>
1909               </entry>
1910               <entry>
1911                 <para>Number of groups <literal>mballoc</literal> scanned to satisfy the
1912                   request.</para>
1913               </entry>
1914             </row>
1915             <row>
1916               <entry>
1917                 <para>
1918                   <emphasis role="bold">
1919                     <literal>cr</literal>
1920                   </emphasis></para>
1921               </entry>
1922               <entry>
1923                 <para>Stage at which <literal>mballoc</literal> found the result:</para>
1924                 <para><emphasis role="bold">0</emphasis> - best in terms of resource allocation. The
1925                   request was 1MB or larger and was satisfied directly via the kernel buddy
1926                   allocator.</para>
1927                 <para><emphasis role="bold">1</emphasis> - regular stage (good at resource
1928                   consumption)</para>
1929                 <para><emphasis role="bold">2</emphasis> - fs is quite fragmented (not that bad at
1930                   resource consumption)</para>
1931                 <para><emphasis role="bold">3</emphasis> - fs is very fragmented (worst at resource
1932                   consumption)</para>
1933               </entry>
1934             </row>
1935             <row>
1936               <entry>
1937                 <para>
1938                   <emphasis role="bold">
1939                     <literal>queue</literal>
1940                   </emphasis></para>
1941               </entry>
1942               <entry>
1943                 <para>Total bytes in active/queued sends.</para>
1944               </entry>
1945             </row>
1946             <row>
1947               <entry>
1948                 <para>
1949                   <emphasis role="bold">
1950                     <literal>merge</literal>
1951                   </emphasis></para>
1952               </entry>
1953               <entry>
1954                 <para>Whether the request hit the goal. This is good as extents code can now merge
1955                   new blocks to existing extent, eliminating the need for extents tree
1956                   growth.</para>
1957               </entry>
1958             </row>
1959             <row>
1960               <entry>
1961                 <para>
1962                   <emphasis role="bold">
1963                     <literal>tail</literal>
1964                   </emphasis></para>
1965               </entry>
1966               <entry>
1967                 <para>Number of blocks left free after the allocation breaks large free
1968                   chunks.</para>
1969               </entry>
1970             </row>
1971             <row>
1972               <entry>
1973                 <para>
1974                   <emphasis role="bold">
1975                     <literal>broken</literal>
1976                   </emphasis></para>
1977               </entry>
1978               <entry>
1979                 <para>How large the broken chunk was.</para>
1980               </entry>
1981             </row>
1982           </tbody>
1983         </tgroup>
1984       </informaltable>
1985       <para>Most users are probably interested in found/cr. If cr is 0 1 and found is less than 100,
1986         then <literal>mballoc</literal> is doing quite well.</para>
1987       <para>Also, number-of-blocks-in-request (third number in the goal triple) can tell the number
1988         of blocks requested by the <literal>obdfilter</literal>. If the <literal>obdfilter</literal>
1989         is doing a lot of small requests (just few blocks), then either the client is processing
1990         input/output to a lot of small files, or something may be wrong with the client (because it
1991         is better if client sends large input/output requests). This can be investigated with the
1992         OSC <literal>rpc_stats</literal> or OST <literal>brw_stats</literal> mentioned above.</para>
1993       <para>Number of groups scanned (<literal>grps</literal> column) should be small. If it reaches
1994         a few dozen often, then either your disk file system is pretty fragmented or
1995           <literal>mballoc</literal> is doing something wrong in the group selection part.</para>
1996     </section>
1997     <section remap="h3">
1998       <title><indexterm>
1999           <primary>proc</primary>
2000           <secondary>mballoc tunables</secondary>
2001         </indexterm><literal>mballoc</literal> Tunables</title>
2002       <para>Lustre ldiskfs includes a multi-block allocation for ldiskfs to improve the efficiency
2003         of space allocation in the OST storage. Multi-block allocation adds the following
2004         features:</para>
2005       <itemizedlist>
2006         <listitem>
2007           <para> Pre-allocation for single files (helps to resist fragmentation)</para>
2008         </listitem>
2009         <listitem>
2010           <para> Pre-allocation for a group of files (helps to pack small files into large,
2011             contiguous chunks)</para>
2012         </listitem>
2013         <listitem>
2014           <para> Stream allocation (helps to decrease the seek rate)</para>
2015         </listitem>
2016       </itemizedlist>
2017       <para>The following <literal>mballoc</literal> tunables are available:</para>
2018       <informaltable frame="all">
2019         <tgroup cols="2">
2020           <colspec colname="c1" colwidth="50*"/>
2021           <colspec colname="c2" colwidth="50*"/>
2022           <thead>
2023             <row>
2024               <entry>
2025                 <para><emphasis role="bold">Field</emphasis></para>
2026               </entry>
2027               <entry>
2028                 <para><emphasis role="bold">Description</emphasis></para>
2029               </entry>
2030             </row>
2031           </thead>
2032           <tbody>
2033             <row>
2034               <entry>
2035                 <para>
2036                   <literal>mb_max_to_scan</literal></para>
2037               </entry>
2038               <entry>
2039                 <para>Maximum number of free chunks that <literal>mballoc</literal> finds before a
2040                   final decision to avoid livelock.</para>
2041               </entry>
2042             </row>
2043             <row>
2044               <entry>
2045                 <para>
2046                   <literal>mb_min_to_scan</literal></para>
2047               </entry>
2048               <entry>
2049                 <para>Minimum number of free chunks that <literal>mballoc</literal> searches before
2050                   picking the best chunk for allocation. This is useful for a very small request, to
2051                   resist fragmentation of big free chunks.</para>
2052               </entry>
2053             </row>
2054             <row>
2055               <entry>
2056                 <para>
2057                   <literal>mb_order2_req</literal></para>
2058               </entry>
2059               <entry>
2060                 <para>For requests equal to 2^N (where N &gt;= <literal>order2_req</literal>), a
2061                   very fast search via buddy structures is used.</para>
2062               </entry>
2063             </row>
2064             <row>
2065               <entry>
2066                 <para>
2067                   <literal>mb_small_req</literal></para>
2068               </entry>
2069               <entry morerows="1">
2070                 <para>All requests are divided into 3 categories:</para>
2071                 <para>&lt; small_req (packed together to form large, aggregated requests)</para>
2072                 <para>&lt; large_req (allocated mostly in linearly)</para>
2073                 <para>&gt; large_req (very large requests so the arm seek does not matter)</para>
2074                 <para>The idea is that we try to pack small requests to form large requests, and
2075                   then place all large requests (including compound from the small ones) close to
2076                   one another, causing as few arm seeks as possible.</para>
2077               </entry>
2078             </row>
2079             <row>
2080               <entry>
2081                 <para>
2082                   <literal>mb_large_req</literal></para>
2083               </entry>
2084             </row>
2085             <row>
2086               <entry>
2087                 <para>
2088                   <literal>mb_prealloc_table</literal></para>
2089               </entry>
2090               <entry>
2091                 <para>The amount of space to preallocate depends on the current file size. The idea
2092                   is that for small files we do not need 1 MB preallocations and for large files, 1
2093                   MB preallocations are not large enough; it is better to preallocate 4 MB.</para>
2094               </entry>
2095             </row>
2096             <row>
2097               <entry>
2098                 <para>
2099                   <literal>mb_group_prealloc</literal></para>
2100               </entry>
2101               <entry>
2102                 <para>The amount of space (in kilobytes) preallocated for groups of small
2103                   requests.</para>
2104               </entry>
2105             </row>
2106           </tbody>
2107         </tgroup>
2108       </informaltable>
2109     </section>
2110     <section remap="h3">
2111       <title><indexterm>
2112           <primary>proc</primary>
2113           <secondary>locking</secondary>
2114         </indexterm>Locking</title>
2115       <para><literal> ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size
2116         </literal></para>
2117       <para>The <literal>lru_size</literal> parameter is used to control the number of client-side
2118         locks in an LRU queue. LRU size is dynamic, based on load. This optimizes the number of
2119         locks available to nodes that have different workloads (e.g., login/build nodes vs. compute
2120         nodes vs. backup nodes).</para>
2121       <para>The total number of locks available is a function of the server&apos;s RAM. The default
2122         limit is 50 locks/1 MB of RAM. If there is too much memory pressure, then the LRU size is
2123         shrunk. The number of locks on the server is limited to
2124           <replaceable>targets_on_server</replaceable> * <replaceable>client_count</replaceable> *
2125           <replaceable>client_lru_size</replaceable>.</para>
2126       <itemizedlist>
2127         <listitem>
2128           <para>To enable automatic LRU sizing, set the <literal>lru_size</literal> parameter to 0.
2129             In this case, the <literal>lru_size</literal> parameter shows the current number of
2130             locks being used on the export. LRU sizing is enabled by default starting with Lustre
2131             1.6.5.1.</para>
2132         </listitem>
2133         <listitem>
2134           <para>To specify a maximum number of locks, set the lru_size parameter to a value other
2135             than 0 (former numbers are okay, 100 * <replaceable>core_count</replaceable>). We
2136             recommend that you only increase the LRU size on a few login nodes where users access
2137             the file system interactively.</para>
2138         </listitem>
2139       </itemizedlist>
2140       <para>To clear the LRU on a single client, and as a result flush client cache, without
2141         changing the <literal>lru_size</literal> value:</para>
2142       <screen>$ lctl set_param ldlm.namespaces.<replaceable>osc_name|mdc_name</replaceable>.lru_size=clear</screen>
2143       <para>If you shrink the LRU size below the number of existing unused locks, then the unused
2144         locks are canceled immediately. Use echo clear to cancel all locks without changing the
2145         value.</para>
2146       <note>
2147         <para>Currently, the lru_size parameter can only be set temporarily with <literal>lctl
2148             set_param</literal>; it cannot be set permanently.</para>
2149       </note>
2150       <para>To disable LRU sizing, run this command on the Lustre clients:</para>
2151       <screen>$ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))</screen>
2152       <para>Replace <literal>NR_CPU</literal> value with the number of CPUs on the node.</para>
2153       <para>To determine the number of locks being granted:</para>
2154       <screen>$ lctl get_param ldlm.namespaces.*.pool.limit</screen>
2155     </section>
2156     <section xml:id="dbdoclet.50438271_87260">
2157       <title><indexterm>
2158           <primary>proc</primary>
2159           <secondary>thread counts</secondary>
2160         </indexterm>Setting MDS and OSS Thread Counts</title>
2161       <para>MDS and OSS thread counts (minimum and maximum) can be set via the
2162           <literal>{min,max}_thread_count tunable</literal>. For each service, a new
2163           <literal>/proc/fs/lustre/{service}/*/thread_{min,max,started}</literal> entry is created.
2164         The tunable, <literal>{service}.thread_{min,max,started}</literal>, can be used to set the
2165         minimum and maximum thread counts or get the current number of running threads for the
2166         following services.</para>
2167       <informaltable frame="all">
2168         <tgroup cols="2">
2169           <colspec colname="c1" colwidth="50*"/>
2170           <colspec colname="c2" colwidth="50*"/>
2171           <tbody>
2172             <row>
2173               <entry>
2174                 <para>
2175                   <emphasis role="bold">Service</emphasis></para>
2176               </entry>
2177               <entry>
2178                 <para>
2179                   <emphasis role="bold">Description</emphasis></para>
2180               </entry>
2181             </row>
2182             <row>
2183               <entry>
2184                 <literal> mdt.MDS.mds </literal>
2185               </entry>
2186               <entry>
2187                 <para>normal metadata ops</para>
2188               </entry>
2189             </row>
2190             <row>
2191               <entry>
2192                 <literal> mdt.MDS.mds_readpage </literal>
2193               </entry>
2194               <entry>
2195                 <para>metadata readdir</para>
2196               </entry>
2197             </row>
2198             <row>
2199               <entry>
2200                 <literal> mdt.MDS.mds_setattr </literal>
2201               </entry>
2202               <entry>
2203                 <para>metadata setattr</para>
2204               </entry>
2205             </row>
2206             <row>
2207               <entry>
2208                 <literal> ost.OSS.ost </literal>
2209               </entry>
2210               <entry>
2211                 <para>normal data</para>
2212               </entry>
2213             </row>
2214             <row>
2215               <entry>
2216                 <literal> ost.OSS.ost_io </literal>
2217               </entry>
2218               <entry>
2219                 <para>bulk data IO</para>
2220               </entry>
2221             </row>
2222             <row>
2223               <entry>
2224                 <literal> ost.OSS.ost_create </literal>
2225               </entry>
2226               <entry>
2227                 <para>OST object pre-creation service</para>
2228               </entry>
2229             </row>
2230             <row>
2231               <entry>
2232                 <literal> ldlm.services.ldlm_canceld </literal>
2233               </entry>
2234               <entry>
2235                 <para>DLM lock cancel</para>
2236               </entry>
2237             </row>
2238             <row>
2239               <entry>
2240                 <literal> ldlm.services.ldlm_cbd </literal>
2241               </entry>
2242               <entry>
2243                 <para>DLM lock grant</para>
2244               </entry>
2245             </row>
2246           </tbody>
2247         </tgroup>
2248       </informaltable>
2249       <itemizedlist>
2250         <listitem>
2251           <para>To temporarily set this tunable, run:</para>
2252           <screen># lctl {get,set}_param {service}.thread_{min,max,started} </screen>
2253         </listitem>
2254       </itemizedlist>
2255       <itemizedlist>
2256         <listitem>
2257           <para>To permanently set this tunable, run:</para>
2258           <screen># lctl conf_param {service}.thread_{min,max,started} </screen>
2259           <para>The following examples show how to set thread counts and get the number of running
2260             threads for the ost_io service.</para>
2261         </listitem>
2262       </itemizedlist>
2263       <itemizedlist>
2264         <listitem>
2265           <para>To get the number of running threads, run:</para>
2266           <screen># lctl get_param ost.OSS.ost_io.threads_started</screen>
2267           <para>The command output will be similar to this:</para>
2268           <screen>ost.OSS.ost_io.threads_started=128</screen>
2269         </listitem>
2270       </itemizedlist>
2271       <itemizedlist>
2272         <listitem>
2273           <para>To set the maximum number of threads (512), run:</para>
2274           <screen># lctl get_param ost.OSS.ost_io.threads_max</screen>
2275           <para>The command output will be:</para>
2276           <screen>ost.OSS.ost_io.threads_max=512</screen>
2277         </listitem>
2278       </itemizedlist>
2279       <itemizedlist>
2280         <listitem>
2281           <para> To set the maximum thread count to 256 instead of 512 (to avoid overloading the
2282             storage or for an array with requests), run:</para>
2283           <screen># lctl set_param ost.OSS.ost_io.threads_max=256</screen>
2284           <para>The command output will be:</para>
2285           <screen>ost.OSS.ost_io.threads_max=256</screen>
2286         </listitem>
2287       </itemizedlist>
2288       <itemizedlist>
2289         <listitem>
2290           <para> To check if the new <literal>threads_max</literal> setting is active, run:</para>
2291           <screen># lctl get_param ost.OSS.ost_io.threads_max</screen>
2292           <para>The command output will be similar to this:</para>
2293           <screen>ost.OSS.ost_io.threads_max=256</screen>
2294         </listitem>
2295       </itemizedlist>
2296       <note>
2297         <para>Currently, the maximum thread count setting is advisory because Lustre does not reduce
2298           the number of service threads in use, even if that number exceeds the
2299             <literal>threads_max</literal> value. Lustre does not stop service threads once they are
2300           started.</para>
2301       </note>
2302     </section>
2303   </section>
2304   <section xml:id="dbdoclet.50438271_83523">
2305     <title><indexterm>
2306         <primary>proc</primary>
2307         <secondary>debug</secondary>
2308       </indexterm>Debug</title>
2309     <para><literal> /proc/sys/lnet/debug </literal></para>
2310     <para>By default, Lustre generates a detailed log of all operations to aid in debugging. The
2311       level of debugging can affect the performance or speed you achieve with Lustre. Therefore, it
2312       is useful to reduce this overhead by turning down the debug level<footnote>
2313         <para>This controls the level of Lustre debugging kept in the internal log buffer. It does
2314           not alter the level of debugging that goes to syslog.</para>
2315       </footnote> to improve performance. Raise the debug level when you need to collect the logs
2316       for debugging problems. The debugging mask can be set with &quot;symbolic names&quot; instead
2317       of the numerical values that were used in prior releases. The new symbolic format is shown in
2318       the examples below.</para>
2319     <note>
2320       <para>All of the commands below must be run as root; note the <literal>#</literal>
2321         nomenclature.</para>
2322     </note>
2323     <para>To verify the debug level used by examining the <literal>sysctl</literal> that controls
2324       debugging, run:</para>
2325     <screen># sysctl lnet.debug
2326 lnet.debug = ioctl neterror warning error emerg ha config console</screen>
2327     <para>To turn off debugging (except for network error debugging), run this command on all
2328       concerned nodes:</para>
2329     <screen># sysctl -w lnet.debug=&quot;neterror&quot;
2330 lnet.debug = neterror</screen>
2331     <para>To turn off debugging completely, run this command on all concerned nodes:</para>
2332     <screen># sysctl -w lnet.debug=0
2333 lnet.debug = 0</screen>
2334     <para>To set an appropriate debug level for a production environment, run:</para>
2335     <screen># sysctl -w lnet.debug=&quot;warning dlmtrace error emerg ha rpctrace vfstrace&quot;
2336 lnet.debug = warning dlmtrace error emerg ha rpctrace vfstrace</screen>
2337     <para>The flags above collect enough high-level information to aid debugging, but they do not
2338       cause any serious performance impact.</para>
2339     <para>To clear all flags and set new ones, run:</para>
2340     <screen># sysctl -w lnet.debug=&quot;warning&quot;
2341 lnet.debug = warning</screen>
2342     <para>To add new flags to existing ones, prefix them with a
2343       &quot;<literal>+</literal>&quot;:</para>
2344     <screen># sysctl -w lnet.debug=&quot;+neterror +ha&quot;
2345 lnet.debug = +neterror +ha
2346 # sysctl lnet.debug
2347 lnet.debug = neterror warning ha</screen>
2348     <para>To remove flags, prefix them with a &quot;<literal>-</literal>&quot;:</para>
2349     <screen># sysctl -w lnet.debug=&quot;-ha&quot;
2350 lnet.debug = -ha
2351 # sysctl lnet.debug
2352 lnet.debug = neterror warning</screen>
2353     <para>You can verify and change the debug level using the <literal>/proc</literal> interface in
2354       Lustre. To use the flags with <literal>/proc</literal>, run:</para>
2355     <screen># lctl get_param debug
2356 debug=
2357 neterror warning
2358 # lctl set_param debug=+ha
2359 # lctl get_param debug
2360 debug=
2361 neterror warning ha
2362 # lctl set_param debug=-warning
2363 # lctl get_param debug
2364 debug=
2365 neterror ha</screen>
2366     <para><literal> /proc/sys/lnet/subsystem_debug </literal></para>
2367     <para>This controls the debug logs for subsystems (see <literal>S_*</literal>
2368       definitions).</para>
2369     <para><literal> /proc/sys/lnet/debug_path </literal></para>
2370     <para>This indicates the location where debugging symbols should be stored for
2371         <literal>gdb</literal>. The default is set to
2372         <literal>/r/tmp/lustre-log-localhost.localdomain</literal>.</para>
2373     <para>These values can also be set via <literal>sysctl -w lnet.debug={value}</literal></para>
2374     <note>
2375       <para>The above entries only exist when Lustre has already been loaded.</para>
2376     </note>
2377     <para><literal> /proc/sys/lnet/panic_on_lbug </literal></para>
2378     <para>This causes Lustre to call &apos;&apos;panic&apos;&apos; when it detects an internal
2379       problem (an <literal>LBUG</literal>); panic crashes the node. This is particularly useful when
2380       a kernel crash dump utility is configured. The crash dump is triggered when the internal
2381       inconsistency is detected by Lustre.</para>
2382     <para><literal> /proc/sys/lnet/upcall </literal></para>
2383     <para>This allows you to specify the path to the binary which will be invoked when an
2384         <literal>LBUG</literal> is encountered. This binary is called with four parameters. The
2385       first one is the string &apos;&apos;<literal>LBUG</literal>&apos;&apos;. The second one is the
2386       file where the <literal>LBUG</literal> occurred. The third one is the function name. The
2387       fourth one is the line number in the file.</para>
2388     <section remap="h3">
2389       <title>RPC Information for Other OBD Devices</title>
2390       <para>Some OBD devices maintain a count of the number of RPC events that they process.
2391         Sometimes these events are more specific to operations of the device, like llite, than
2392         actual raw RPC counts.</para>
2393       <screen>$ find /proc/fs/lustre/ -name stats
2394 /proc/fs/lustre/osc/lustre-OST0001-osc-ce63ca00/stats
2395 /proc/fs/lustre/osc/lustre-OST0000-osc-ce63ca00/stats
2396 /proc/fs/lustre/osc/lustre-OST0001-osc/stats
2397 /proc/fs/lustre/osc/lustre-OST0000-osc/stats
2398 /proc/fs/lustre/mdt/MDS/mds_readpage/stats
2399 /proc/fs/lustre/mdt/MDS/mds_setattr/stats
2400 /proc/fs/lustre/mdt/MDS/mds/stats
2401 /proc/fs/lustre/mds/lustre-MDT0000/exports/
2402        ab206805-0630-6647-8543-d24265c91a3d/stats
2403 /proc/fs/lustre/mds/lustre-MDT0000/exports/
2404        08ac6584-6c4a-3536-2c6d-b36cf9cbdaa0/stats
2405 /proc/fs/lustre/mds/lustre-MDT0000/stats
2406 /proc/fs/lustre/ldlm/services/ldlm_canceld/stats
2407 /proc/fs/lustre/ldlm/services/ldlm_cbd/stats
2408 /proc/fs/lustre/llite/lustre-ce63ca00/stats
2409 </screen>
2410       <section remap="h4">
2411         <title><indexterm>
2412             <primary>proc</primary>
2413             <secondary>statistics</secondary>
2414           </indexterm>Interpreting OST Statistics</title>
2415         <note>
2416           <para>See also <xref linkend="dbdoclet.50438219_84890"/> (<literal>llobdstat</literal>)
2417             and <xref linkend="dbdoclet.50438273_80593"/> (<literal>collectl</literal>).</para>
2418         </note>
2419         <para>The OST <literal>.../stats</literal> files can be used to track client statistics
2420           (client activity) for each OST. It is possible to get a periodic dump of values from these
2421           file (for example, every 10 seconds), that show the RPC rates (similar to
2422             <literal>iostat</literal>) by using the <literal>llstat</literal> tool:</para>
2423         <screen># llstat /proc/fs/lustre/osc/lustre-OST0000-osc/stats
2424 /usr/bin/llstat: STATS on 09/14/07
2425        /proc/fs/lustre/osc/lustre-OST0000-osc/ stats on 192.168.10.34@tcp
2426 snapshot_time                      1189732762.835363
2427 ost_create                 1
2428 ost_get_info               1
2429 ost_connect                1
2430 ost_set_info               1
2431 obd_ping                   212</screen>
2432         <para>To clear the statistics, give the <literal>-c</literal> option to
2433             <literal>llstat</literal>. To specify how frequently the statistics should be cleared
2434           (in seconds), use an integer for the <literal>-i</literal> option. This is sample output
2435           with <literal>-c</literal> and <literal>-i10</literal> options used, providing statistics
2436           every 10s):</para>
2437         <screen role="smaller">$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
2438
2439 /usr/bin/llstat: STATS on 06/06/07
2440         /proc/fs/lustre/ost/OSS/ost_io/ stats on 192.168.16.35@tcp
2441 snapshot_time                              1181074093.276072
2442
2443 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
2444 Name         Cur.  Cur. #
2445              Count Rate Events Unit   last   min    avg       max    stddev
2446 req_waittime 8     0    8     [usec]  2078   34     259.75    868    317.49
2447 req_qdepth   8     0    8     [reqs]  1      0      0.12      1      0.35
2448 req_active   8     0    8     [reqs]  11     1      1.38      2      0.52
2449 reqbuf_avail 8     0    8     [bufs]  511    63     63.88     64     0.35
2450 ost_write    8     0    8     [bytes] 169767 72914  212209.62 387579 91874.29
2451
2452 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
2453 Name         Cur.  Cur. #
2454              Count Rate Events Unit   last    min   avg       max    stddev
2455 req_waittime 31    3    39    [usec]  30011   34    822.79    12245  2047.71
2456 req_qdepth   31    3    39    [reqs]  0       0     0.03      1      0.16
2457 req_active   31    3    39    [reqs]  58      1     1.77      3      0.74
2458 reqbuf_avail 31    3    39    [bufs]  1977    63    63.79     64     0.41
2459 ost_write    30    3    38    [bytes] 1028467 15019 315325.16 910694 197776.51
2460
2461 /proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
2462 Name         Cur.  Cur. #
2463              Count Rate Events Unit   last    min    avg       max    stddev
2464 req_waittime 21    2    60    [usec]  14970   34     784.32    12245  1878.66
2465 req_qdepth   21    2    60    [reqs]  0       0      0.02      1      0.13
2466 req_active   21    2    60    [reqs]  33      1      1.70      3      0.70
2467 reqbuf_avail 21    2    60    [bufs]  1341    63     63.82     64     0.39
2468 ost_write    21    2    59    [bytes] 7648424 15019  332725.08 910694 180397.87
2469 </screen>
2470         <para>Where:</para>
2471         <informaltable frame="all">
2472           <tgroup cols="2">
2473             <colspec colname="c1" colwidth="50*"/>
2474             <colspec colname="c2" colwidth="50*"/>
2475             <thead>
2476               <row>
2477                 <entry>
2478                   <para><emphasis role="bold">Parameter</emphasis></para>
2479                 </entry>
2480                 <entry>
2481                   <para><emphasis role="bold">Description</emphasis></para>
2482                 </entry>
2483               </row>
2484             </thead>
2485             <tbody>
2486               <row>
2487                 <entry>
2488                   <para>
2489                     <literal> Cur. Count </literal></para>
2490                 </entry>
2491                 <entry>
2492                   <para>Number of events of each type sent in the last interval (in this example,
2493                     10s)</para>
2494                 </entry>
2495               </row>
2496               <row>
2497                 <entry>
2498                   <para>
2499                     <literal> Cur. Rate </literal></para>
2500                 </entry>
2501                 <entry>
2502                   <para>Number of events per second in the last interval</para>
2503                 </entry>
2504               </row>
2505               <row>
2506                 <entry>
2507                   <para>
2508                     <literal> #Events </literal></para>
2509                 </entry>
2510                 <entry>
2511                   <para>Total number of such events since the system started</para>
2512                 </entry>
2513               </row>
2514               <row>
2515                 <entry>
2516                   <para>
2517                     <literal> Unit </literal></para>
2518                 </entry>
2519                 <entry>
2520                   <para>Unit of measurement for that statistic (microseconds, requests,
2521                     buffers)</para>
2522                 </entry>
2523               </row>
2524               <row>
2525                 <entry>
2526                   <para>
2527                     <literal> last </literal></para>
2528                 </entry>
2529                 <entry>
2530                   <para>Average rate of these events (in units/event) for the last interval during
2531                     which they arrived. For instance, in the above mentioned case of
2532                       <literal>ost_destroy</literal> it took an average of 736 microseconds per
2533                     destroy for the 400 object destroys in the previous 10 seconds.</para>
2534                 </entry>
2535               </row>
2536               <row>
2537                 <entry>
2538                   <para>
2539                     <literal> min </literal></para>
2540                 </entry>
2541                 <entry>
2542                   <para>Minimum rate (in units/events) since the service started</para>
2543                 </entry>
2544               </row>
2545               <row>
2546                 <entry>
2547                   <para>
2548                     <literal> avg </literal></para>
2549                 </entry>
2550                 <entry>
2551                   <para>Average rate</para>
2552                 </entry>
2553               </row>
2554               <row>
2555                 <entry>
2556                   <para>
2557                     <literal> max </literal></para>
2558                 </entry>
2559                 <entry>
2560                   <para>Maximum rate</para>
2561                 </entry>
2562               </row>
2563               <row>
2564                 <entry>
2565                   <para>
2566                     <literal> stddev </literal></para>
2567                 </entry>
2568                 <entry>
2569                   <para>Standard deviation (not measured in all cases)</para>
2570                 </entry>
2571               </row>
2572             </tbody>
2573           </tgroup>
2574         </informaltable>
2575         <para>The events common to all services are:</para>
2576         <informaltable frame="all">
2577           <tgroup cols="2">
2578             <colspec colname="c1" colwidth="50*"/>
2579             <colspec colname="c2" colwidth="50*"/>
2580             <thead>
2581               <row>
2582                 <entry>
2583                   <para><emphasis role="bold">Parameter</emphasis></para>
2584                 </entry>
2585                 <entry>
2586                   <para><emphasis role="bold">Description</emphasis></para>
2587                 </entry>
2588               </row>
2589             </thead>
2590             <tbody>
2591               <row>
2592                 <entry>
2593                   <para>
2594                     <literal> req_waittime </literal></para>
2595                 </entry>
2596                 <entry>
2597                   <para>Amount of time a request waited in the queue before being handled by an
2598                     available server thread.</para>
2599                 </entry>
2600               </row>
2601               <row>
2602                 <entry>
2603                   <para>
2604                     <literal> req_qdepth </literal></para>
2605                 </entry>
2606                 <entry>
2607                   <para>Number of requests waiting to be handled in the queue for this
2608                     service.</para>
2609                 </entry>
2610               </row>
2611               <row>
2612                 <entry>
2613                   <para>
2614                     <literal> req_active </literal></para>
2615                 </entry>
2616                 <entry>
2617                   <para>Number of requests currently being handled.</para>
2618                 </entry>
2619               </row>
2620               <row>
2621                 <entry>
2622                   <para>
2623                     <literal> reqbuf_avail </literal></para>
2624                 </entry>
2625                 <entry>
2626                   <para>Number of unsolicited lnet request buffers for this service.</para>
2627                 </entry>
2628               </row>
2629             </tbody>
2630           </tgroup>
2631         </informaltable>
2632         <para>Some service-specific events of interest are:</para>
2633         <informaltable frame="all">
2634           <tgroup cols="2">
2635             <colspec colname="c1" colwidth="50*"/>
2636             <colspec colname="c2" colwidth="50*"/>
2637             <thead>
2638               <row>
2639                 <entry>
2640                   <para><emphasis role="bold">Parameter</emphasis></para>
2641                 </entry>
2642                 <entry>
2643                   <para><emphasis role="bold">Description</emphasis></para>
2644                 </entry>
2645               </row>
2646             </thead>
2647             <tbody>
2648               <row>
2649                 <entry>
2650                   <para>
2651                     <literal> ldlm_enqueue </literal></para>
2652                 </entry>
2653                 <entry>
2654                   <para>Time it takes to enqueue a lock (this includes file open on the MDS)</para>
2655                 </entry>
2656               </row>
2657               <row>
2658                 <entry>
2659                   <para>
2660                     <literal> mds_reint </literal></para>
2661                 </entry>
2662                 <entry>
2663                   <para>Time it takes to process an MDS modification record (includes create,
2664                       <literal>mkdir</literal>, <literal>unlink</literal>, <literal>rename</literal>
2665                     and <literal>setattr</literal>)</para>
2666                 </entry>
2667               </row>
2668             </tbody>
2669           </tgroup>
2670         </informaltable>
2671       </section>
2672       <section remap="h4">
2673         <title><indexterm>
2674             <primary>proc</primary>
2675             <secondary>statistics</secondary>
2676           </indexterm>Interpreting MDT Statistics</title>
2677         <note>
2678           <para>See also <xref linkend="dbdoclet.50438219_84890"/> (llobdstat) and <xref
2679               linkend="dbdoclet.50438273_80593"/> (CollectL).</para>
2680         </note>
2681         <para>The MDT .../stats files can be used to track MDT statistics for the MDS. Here is
2682           sample output for an MDT stats file:</para>
2683         <screen># cat /proc/fs/lustre/mds/*-MDT0000/stats
2684 snapshot_time                   1244832003.676892 secs.usecs
2685 open                            2 samples [reqs]
2686 close                           1 samples [reqs]
2687 getxattr                        3 samples [reqs]
2688 process_config                  1 samples [reqs]
2689 connect                         2 samples [reqs]
2690 disconnect                      2 samples [reqs]
2691 statfs                          3 samples [reqs]
2692 setattr                         1 samples [reqs]
2693 getattr                         3 samples [reqs]
2694 llog_init                       6 samples [reqs]
2695 notify                          16 samples [reqs]</screen>
2696       </section>
2697     </section>
2698   </section>
2699 </chapter>